SLURM/JobStatus: Difference between revisions
(Created page with "=Job Status= SLURM offers a variety of tools to check the status of your jobs before, during, and after it has begun/completed. When you first submit your job, SLURM should gi...") |
No edit summary |
||
Line 5: | Line 5: | ||
The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with <code>squeue --start</code> | The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with <code>squeue --start</code> | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00:squeue | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
162 test2 helloWor tgray26 R 0:03 2 | 162 test2 helloWor tgray26 R 0:03 2 openlab[00-01] | ||
</pre> | </pre> | ||
If you want to see the status of individual job steps you can use <code>squeue -s</code> | If you want to see the status of individual job steps you can use <code>squeue -s</code> | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00:squeue -s | ||
STEPID NAME PARTITION USER TIME NODELIST | STEPID NAME PARTITION USER TIME NODELIST | ||
162.0 sleep test2 tgray26 0:05 | 162.0 sleep test2 tgray26 0:05 openlab00 | ||
162.1 sleep test2 tgray26 0:05 | 162.1 sleep test2 tgray26 0:05 openlab01 | ||
</pre> | </pre> | ||
Line 23: | Line 23: | ||
</pre> | </pre> | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171 | ||
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | ||
------------ -------- -------------------- ---------- ---------- ---------- ---------- | ------------ -------- -------------------- ---------- ---------- ---------- ---------- | ||
171.0 1 | 171.0 1 openlab00 0 186060K 0 107900K | ||
tgray26@ | tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1 | ||
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | ||
------------ -------- -------------------- ---------- ---------- ---------- ---------- | ------------ -------- -------------------- ---------- ---------- ---------- ---------- | ||
171.1 1 | 171.1 1 openlab01 0 186060K 0 107900K | ||
</pre> | </pre> | ||
Notice that if you do not have any jobsteps sstat will not return an error | Notice that if you do not have any jobsteps sstat will not return an error | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172 | ||
JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize | ||
------------ -------- -------------------- ---------- ---------- ---------- ---------- | ------------ -------- -------------------- ---------- ---------- ---------- ---------- | ||
Line 55: | Line 55: | ||
The sacct command shows metrics from past jobs | The sacct command shows metrics from past jobs | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00:sacct | ||
JobID JobName Partition Account AllocCPUS State ExitCode | JobID JobName Partition Account AllocCPUS State ExitCode | ||
------------ ---------- ---------- ---------- ---------- ---------- -------- | ------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
Line 69: | Line 69: | ||
<pre>sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP></pre> | <pre>sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP></pre> | ||
<pre> | <pre> | ||
tgray26@ | tgray26@opensub00:sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171 | ||
JobID JobName NTasks NodeList MaxRSS MaxVMSize AveRSS AveVMSize Elapsed | JobID JobName NTasks NodeList MaxRSS MaxVMSize AveRSS AveVMSize Elapsed | ||
------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- | ------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- | ||
171 helloWorld | 171 helloWorld openlab[00-01] 00:00:30 | ||
171.batch batch 1 | 171.batch batch 1 openlab00 0 119784K 0 113120K 00:00:30 | ||
171.0 sleep 1 | 171.0 sleep 1 openlab00 0 186060K 0 107900K 00:00:30 | ||
171.1 sleep 1 | 171.1 sleep 1 openlab01 0 186060K 0 107900K 00:00:30 | ||
</pre> | </pre> |
Revision as of 17:43, 12 July 2016
Job Status
SLURM offers a variety of tools to check the status of your jobs before, during, and after it has begun/completed. When you first submit your job, SLURM should give you a job ID, this represents the resources allocated to your job, individual calls to srun will spawn job steps which can also be queried individually.
squeue
The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with squeue --start
tgray26@opensub00:squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 162 test2 helloWor tgray26 R 0:03 2 openlab[00-01]
If you want to see the status of individual job steps you can use squeue -s
tgray26@opensub00:squeue -s STEPID NAME PARTITION USER TIME NODELIST 162.0 sleep test2 tgray26 0:05 openlab00 162.1 sleep test2 tgray26 0:05 openlab01
sstat
The sstat command shows metrics from currently running job steps. If you don't specify a job step the lowest job step is displayed.
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <$JOBID>.<$JOBSTEP>
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 171.0 1 openlab00 0 186060K 0 107900K tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 171.1 1 openlab01 0 186060K 0 107900K
Notice that if you do not have any jobsteps sstat will not return an error
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- sstat: error: no steps running for job 172
If you do not run any srun commands you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format
#!/bin/bash #SBATCH ... #SBATCH ... # set environment up module load ... # launch job steps srun <command to run> # that would be step 1 srun <command to run> # that would be step 2
sacct
The sacct command shows metrics from past jobs
tgray26@opensub00:sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 162 helloWorld test2 staff 2 COMPLETED 0:0 162.batch batch staff 1 COMPLETED 0:0 162.0 sleep staff 1 COMPLETED 0:0 162.1 sleep staff 1 COMPLETED 0:0 163 helloWorld test2 staff 2 COMPLETED 0:0 163.batch batch staff 1 COMPLETED 0:0 163.0 sleep staff 1 COMPLETED 0:0
To check one specific job you can run something like the following (if you omit <$JOBSTEP> all jobsteps will be shown:
sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP>
tgray26@opensub00:sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171 JobID JobName NTasks NodeList MaxRSS MaxVMSize AveRSS AveVMSize Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- 171 helloWorld openlab[00-01] 00:00:30 171.batch batch 1 openlab00 0 119784K 0 113120K 00:00:30 171.0 sleep 1 openlab00 0 186060K 0 107900K 00:00:30 171.1 sleep 1 openlab01 0 186060K 0 107900K 00:00:30