Back to the main page.

Bug 1328 - add support for SLURM backend

Status CLOSED FIXED
Reported 2012-02-13 21:45:00 +0100
Modified 2012-04-11 16:48:30 +0200
Product: FieldTrip
Component: qsub
Version: unspecified
Hardware: PC
Operating System: Mac OS
Importance: P3 minor
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also:

Robert Oostenveld - 2012-02-13 21:45:11 +0100

this feature request follows an extended email discussion. I won't post the whole discussion here, but want to use this bugzilla page to keep all people involved informed and as placeholder for items that still need to be done.


Robert Oostenveld - 2012-02-13 21:47:59 +0100

merged code from Kai and Craig into trunk, tested that the usual stuff still works on mentat, committed to svn roboos@mentat001> svn commit Sending qsub/qsubcellfun.m Sending qsub/qsubcompile.m Sending qsub/qsubfeval.m Sending qsub/qsublist.m Transmitting file data .... Committed revision 5281.


Robert Oostenveld - 2012-02-13 21:48:24 +0100

On 13 Feb 2012, at 15:25, Roennburg, Kai wrote: btw. Robert how about the requirement to have either [memreq and timereq] or the queue name specified?


Robert Oostenveld - 2012-02-13 22:00:15 +0100

(In reply to comment #2) already with the previous commit the queue has been made an option in qsubfeval and qsubcellfun


Robert Oostenveld - 2012-02-13 22:04:16 +0100

(In reply to comment #2) I changed the code to allow nan or inf. These will not cause an error from within matlab, furthermore the value will not to be passed to srun/qsub on the linux command line. Specifying [] is still invalid. So the user can specify timreq and/or memreq as nan to allow the queue default to be used. roboos@mentat001> svn commit Sending qsub/qsubfeval.m Transmitting file data . Committed revision 5282.


Kai Roennburg - 2012-02-14 13:08:57 +0100

Submitting jobs with qsubcellfun now produce the folowing output: submitting job roennburgk_esi-svhpc8_p24532_b1_j144... qstat job id /bin/matlab2011a which is a result of: [status, result] = system(cmdline); fprintf(' qstat job id %s\n', strtrim(result)); since srun is not giving back the job id after submission we need to code an if then around this to prevent the output. Will incorporate this in our changes. Kai


Robert Oostenveld - 2012-02-14 13:11:56 +0100

(In reply to comment #5) At this moment the job id cannot be known at the time that the system call returns. Instead of complicated solutions, please first try without the nohup.


Kai Roennburg - 2012-02-14 13:27:48 +0100

Slurms submit program "srun" normally runs silent, without giving any feedback, taking the nohup will not change the behavior. The only way I see at the moment would be to change to verbose output and grep the interesting information from the result string given below: roennburgk@esi-svhpc8:/opt/fieldtrip/qsub$ srun -v uptime srun-llnl: auth plugin for Munge (http://home.gna.org/munge/) loaded srun-llnl: Consumable Resources (CR) Node Selection plugin loaded with argument 20 srun-llnl: jobid 17844: nodes(1):`ESI-svHPC6', cpu counts: 1(x1) srun-llnl: switch NONE plugin loaded srun-llnl: launching 17844.0 on host ESI-svHPC6, 1 tasks: 0 srun-llnl: Node ESI-svHPC6, 1 tasks started 13:21:05 up 36 days, 1:44, 2 users, load average: 8.34, 8.13, 7.96 srun-llnl: Received task exit notification for 1 task (status=0x0000). srun-llnl: ESI-svHPC6: task 0: Completed In that case I'd prefer to skip this information instead of string handling for the output!? Kai


Robert Oostenveld - 2012-02-14 13:39:53 +0100

(In reply to comment #7) > In that case I'd prefer to skip this information instead of string handling for > the output!? I don't have a preference whether the string handling is done in matlab or in the linux shell. I think having it in matlab makes it slightly easier to maintain. E.g. [status, result] = system(cmdline); switch backend case 'slurm' result = ... % e.g. use the regexp function otherwise % for torque and sge it is enough to remove the white space result = strtrim(result); end


Robert Oostenveld - 2012-02-16 15:40:03 +0100

On 15 Feb 2012, at 17:33, Roennburg, Kai wrote: Hi Robert, I adopted the new sections to slurm and put in the handling for: queue, time and memory requirement. The queue handling works fine. mem and time needs more testing with real jobs to see how the restrictions really apply as wanted. I put the nohup into comments, since it only makes sense when the submit time comes close to the compute time (which was in my testing the case) As discussed I changed the section for the qsublist(‘add’) to use the jobname instead of the jobid, since the job cancelation was already coded on the jobname, this was straight forward. I’m now seeing the matlab engine prompted with everycall to qsubfeval, which originates from the “which matlabcmd” below. if system(sprintf('which %s', matlabcmd))==1 % the linux command "which" returns 0 on succes and 1 on failure warning('the executable for "%s" could not be found, trying "matlab" instead', matlabcmd); % use whatever is available as default matlabcmd = 'matlab'; end I must admit I don’t see a chance to avoid it, besides redirecting the output, but haven’t tested it by now: if system(sprintf('which %s > /dev/null ', matlabcmd))==1


Robert Oostenveld - 2012-02-16 15:45:20 +0100

(In reply to comment #9) I don't see any harm in the output redirect for which, so I'll copy that. I have also made that section of code to determine the matlab command run only once, using a persistent variable. manzana> svn commit Sending qsub/qsubfeval.m Transmitting file data . Committed revision 5293.


Robert Oostenveld - 2012-03-10 16:17:37 +0100

I believe that this bug has been full resolved and that in general SLURM works with fieldtrip/qsub. Small changes of course can be made in the future. If you don't agree, please reopen and explain...


Robert Oostenveld - 2012-04-11 16:48:30 +0200

I cleaned up my bugzilla list by changing the status from resolved (either fixed or wontfix) into closed. If you don't agree, please reopen the bug. Robert