Back to the main page.

Bug 2295 - peerlist receives no job-information, causing all jobs to be resubmitted after the set 60 seconds

Status ASSIGNED
Reported 2013-09-20 18:05:00 +0200
Modified 2013-09-23 11:13:01 +0200
Product: FieldTrip
Component: peer
Version: unspecified
Hardware: PC
Operating System: Windows
Importance: P3 normal
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also:

Roemer van der Meij - 2013-09-20 18:05:27 +0200

I happen to come across this while using the p2p toolbox on our new DCC cluster. This is within machine. I start one master, and a set of slaves, having specific groups, groupallows and userallows. The jobs get submitted using peercellfun, and executed nicely. However, after the 60s delay at line 471, the status of the lastseen variable is an inf for all submitted jobs. The jobs are correctly seen as submitted (all ones). After closer inspection, a probably looks to be peerlist. When calling peerlist as list = peerlist('busy'), I get a mostly correct structure array for all running peers. Except that list(i).current (containing the job info), is largely 'empty'. An example: On some matlab terminals: peerslave('allowuser','roevdmei','allowgroup','arch2','memavail',4294967296','timavail',1209600) On my main matlab terminal peermaster('group','arch2','allowgroup','arch2','allowuser','roevdmei') **submiting some jobs, which get executed on the peers (and retrieved later on)** list = peerlist('busy') list(i) = hostid: 2.1561e+09 hostname: 'archimedes' user: 'roevdmei' group: 'arch2' socket: '' port: 1701 status: 3 timavail: 1209600 memavail: 4.2950e+09 cpuavail: 0 current: [1x1 struct] These are the correct settings I gave the peer. However, while the peer is actually being executed: list(i).current hostid: 0 jobid: 0 hostname: '' user: '' group: '' timreq: 0 memreq: 0 cpureq: 0 This likely leads to the master never knowing when the slaves are busy with it's jobs, and thus keeps on resubmitting them. When the originals finish, it correctly reverts to the original results and finishes/quits nicely. The resubmitted jobs keep the peers busy for much longer time after this though. Sorry being inactive/not-very-active the past weeks/months, I'm a bit swamped by theses work and been sick for a long time the past months. Will catch up very soon, hopefully next monday. Cheers, Roemer


Robert Oostenveld - 2013-09-20 19:20:12 +0200

oh jee, ik was eigenlijk van plan p2p als deprecated af te voeren...


Roemer van der Meij - 2013-09-23 11:13:01 +0200

Oeh, dat is jammer, maar begrijpelijk. Je bent er vast niet echt mee bezig geweest het afgelopen jaar? Nog altijd wel een goed eenvoudig poor man's alternative om bij gebrek aan een dedicated server efficient te werken, al zijn er natuurlijk ook andere systemen. Ik gebruik het momenteel op ons cluster omdat ons torque systeem nog niet live is (dat duurt nog wel een flinke tijd). Misschien dat je wat tips zou kunnen geven? Ik heb peer.c en wat andere mex-files doorgestruind, op zoek naar de functie die job info verzamelt. De kans is niet super groot dat ik het kan fixen (het symptoom weghacken is eenvoudiger). Misschien dat je van de top of your head wat pointers hebt? Anders, leave it be, en deprecaten.