Back to the main page.

Bug 306 - too many resubmissions with small nr of collected jobs

Status CLOSED FIXED
Reported 2010-12-15 12:13:00 +0100
Modified 2011-01-05 12:01:08 +0100
Product: FieldTrip
Component: peer
Version: unspecified
Hardware: PC
Operating System: Windows
Importance: P1 enhancement
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also:

Marcel Zwiers - 2010-12-15 12:13:57 +0100

peercellfun resubmits massively when all jobs are submitted but only a small (e.g. 1) job has been collected, i.e. when (estimated_max - estimated_min) is very small (e.g. zero) and unreliable Suggested solution: Gradually move from the situation when there are no collected jobs (see line 399): estimated = 3*timreq to the situation when jobs have been collected (line 396): estimated = estimated_avg + 2*(estimated_max - estimated_min) I suggest replacing line 393 (which also contains a logical bug) till 399 with the following weighted average of the two: estimated_avg = mean(collecttime(collected) - submittime(collected)); estimated = (3*timreq + sum(collected)*(estimated_avg + 2*(estimated_max - estimated_min))) / (1 + sum(collected))


Marcel Zwiers - 2010-12-15 13:09:08 +0100

p.s. line 389 should, of course also be adapted to (nb timreq is never empty): elseif ~isempty(timreq)


Robert Oostenveld - 2010-12-15 15:54:00 +0100

If you specify an appropriate timreq or resubmittime, you should not have this problem. Can you please try with either one of these two options?


Marcel Zwiers - 2010-12-15 16:18:19 +0100

Passing an appropriate timreq does not do anything (that is obvious from the code) Passing resubmittime does avoid the problem (that is also obvious from the code), but that is not a good solution to the problem (basically because resubmittime is static and typically very hard to estimate beforehand) but a undesirable work-around.


Robert Oostenveld - 2010-12-15 20:05:42 +0100

timreq is currently indeed not acting as expected and should be fixed. Should resubmittime again be removed from the code (it was added upon your request)?


Robert Oostenveld - 2010-12-19 09:36:58 +0100

r2468 | roboos | 2010-12-19 09:32:18 +0100 (Sun, 19 Dec 2010) | 7 lines fixed, at the moment it does not use the distribution at all, only the estimated timreq (which is nanmax of the collecttime-submittime). It uses 3*timreq, just like the killswitch on the peerslave. Note that the timreq might slightly increase over time (with more jobs returning) and does not reflect the timreq that was used when submitting.


Robert Oostenveld - 2011-01-05 11:57:03 +0100

selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"


Robert Oostenveld - 2011-01-05 12:01:08 +0100

selected all old bugs from roboos with status RESOLVED and changed it into CLOSED