Back to the main page.

Bug 309 - fail to submit a job

Status CLOSED FIXED
Reported 2010-12-15 13:18:00 +0100
Modified 2011-01-05 12:01:08 +0100
Product: FieldTrip
Component: peer
Version: unspecified
Hardware: PC
Operating System: Windows
Importance: P1 normal
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also:

Marcel Zwiers - 2010-12-15 13:18:24 +0100

peerlist abc >>.. >>there are 79 peers running on 32 hosts as idle slave... peercellfun('exp',{2}) >> warning: resubmitting job 1 because it takes too long to get started


Robert Oostenveld - 2010-12-15 15:52:13 +0100

this is probably not a bug. The job is submitted to a slave, the slave tries starting the engine, figures out that it cannot get a license (because of license limitations during office hours), drops the job, switches to zombie. The master resubmits (to another slave) because the job never started on the slave.


Marcel Zwiers - 2010-12-15 16:12:03 +0100

If it's not a bug it must be a feature :-) FYI, I just ran peercellfun('exp',{2}, 'timreq',0.1) again and it hasn't finished yet (after more than 20 resubmissions and 10 minutes elapsed time)...


Marcel Zwiers - 2010-12-15 16:40:40 +0100

Another half hour has passed and it just found a slave that was willing to process my job in 0 sec... :-)


Robert Oostenveld - 2010-12-15 20:14:58 +0100

The bug/feature has been there from the beginning and is a design consequence of the command-line peerslaves. If there are no licenses available, the peerslaves cannot start an engine and teh job cannot be executed. It would be a bug if the job would eventually not execute, but there is never a guarantee that peercellfun will actually speed up the job. Competing users (in this case one with many big jobs and another with a single small job) the single job has a disadvantage. Had the single job been bigger, it would not have been different. The disappointing performance (and frequent resubmissions every 30 seconds) have to do with the many peerslaves that cannot get a license but are still running. What do you suggest to solve the problem?


Marcel Zwiers - 2010-12-16 10:38:51 +0100

I suggest that the slave should switch to zombie mode for an hour if it can't get a license.


Robert Oostenveld - 2010-12-19 09:34:10 +0100

r2468 | roboos | 2010-12-19 09:32:18 +0100 (Sun, 19 Dec 2010) | 7 lines increase zombietimeout in peerslave.exe to 900 seconds (15 minutes) peerslave.exe returns an error if the matlab engine fails to start peerslave catches the error and resubmits immediately (used to take 30 seconds)


Robert Oostenveld - 2011-01-05 11:57:03 +0100

selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"


Robert Oostenveld - 2011-01-05 12:01:08 +0100

selected all old bugs from roboos with status RESOLVED and changed it into CLOSED