Difference between revisions of "SLURM/JobStatus"

From UMIACS
Jump to navigation Jump to search
(Created page with "=Job Status= SLURM offers a variety of tools to check the status of your jobs before, during, and after it has begun/completed. When you first submit your job, SLURM should gi...")
 
Line 5: Line 5:
 
The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with <code>squeue --start</code>
 
The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with <code>squeue --start</code>
 
<pre>
 
<pre>
tgray26@shadosub:squeue
+
tgray26@opensub00:squeue
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               162    test2 helloWor  tgray26  R      0:03      2 shado[00-01]
+
               162    test2 helloWor  tgray26  R      0:03      2 openlab[00-01]
 
</pre>
 
</pre>
 
If you want to see the status of individual job steps you can use <code>squeue -s</code>
 
If you want to see the status of individual job steps you can use <code>squeue -s</code>
 
<pre>
 
<pre>
tgray26@shadosub:squeue -s
+
tgray26@opensub00:squeue -s
 
         STEPID    NAME PARTITION    USER      TIME NODELIST
 
         STEPID    NAME PARTITION    USER      TIME NODELIST
           162.0    sleep    test2  tgray26      0:05 shado00
+
           162.0    sleep    test2  tgray26      0:05 openlab00
           162.1    sleep    test2  tgray26      0:05 shado01
+
           162.1    sleep    test2  tgray26      0:05 openlab01
 
</pre>
 
</pre>
  
Line 23: Line 23:
 
</pre>
 
</pre>
 
<pre>
 
<pre>
tgray26@shadosub: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171
+
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
171.0              1             shado00         0    186060K          0    107900K  
+
171.0              1           openlab00         0    186060K          0    107900K  
tgray26@shadosub: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1
+
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
171.1              1             shado01         0    186060K          0    107900K  
+
171.1              1           openlab01         0    186060K          0    107900K  
 
</pre>
 
</pre>
 
Notice that if you do not have any jobsteps sstat will not return an error
 
Notice that if you do not have any jobsteps sstat will not return an error
 
<pre>
 
<pre>
tgray26@shadosub: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172
+
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
       JobID  NTasks            Nodelist    MaxRSS  MaxVMSize    AveRSS  AveVMSize  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
 
------------ -------- -------------------- ---------- ---------- ---------- ----------  
Line 55: Line 55:
 
The sacct command shows metrics from past jobs
 
The sacct command shows metrics from past jobs
 
<pre>
 
<pre>
tgray26@shadosub:sacct
+
tgray26@opensub00:sacct
 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode  
 
------------ ---------- ---------- ---------- ---------- ---------- --------  
 
------------ ---------- ---------- ---------- ---------- ---------- --------  
Line 69: Line 69:
 
<pre>sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP></pre>
 
<pre>sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP></pre>
 
<pre>
 
<pre>
tgray26@shadosub:sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171
+
tgray26@opensub00:sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171
 
       JobID    JobName  NTasks        NodeList    MaxRSS  MaxVMSize    AveRSS  AveVMSize    Elapsed  
 
       JobID    JobName  NTasks        NodeList    MaxRSS  MaxVMSize    AveRSS  AveVMSize    Elapsed  
 
------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ----------  
 
------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ----------  
171          helloWorld             shado[00-01]                                              00:00:30  
+
171          helloWorld           openlab[00-01]                                              00:00:30  
171.batch        batch        1         shado00         0    119784K          0    113120K  00:00:30  
+
171.batch        batch        1       openlab00         0    119784K          0    113120K  00:00:30  
171.0            sleep        1         shado00         0    186060K          0    107900K  00:00:30  
+
171.0            sleep        1       openlab00         0    186060K          0    107900K  00:00:30  
171.1            sleep        1         shado01         0    186060K          0    107900K  00:00:30  
+
171.1            sleep        1       openlab01         0    186060K          0    107900K  00:00:30  
 
</pre>
 
</pre>

Revision as of 17:43, 12 July 2016

Job Status

SLURM offers a variety of tools to check the status of your jobs before, during, and after it has begun/completed. When you first submit your job, SLURM should give you a job ID, this represents the resources allocated to your job, individual calls to srun will spawn job steps which can also be queried individually.

squeue

The squeue command shows job status in the queue. If your job has not yet started you can ask for an estimated start time with squeue --start

tgray26@opensub00:squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               162     test2 helloWor  tgray26  R       0:03      2 openlab[00-01]

If you want to see the status of individual job steps you can use squeue -s

tgray26@opensub00:squeue -s
         STEPID     NAME PARTITION     USER      TIME NODELIST
          162.0    sleep     test2  tgray26      0:05 openlab00
          162.1    sleep     test2  tgray26      0:05 openlab01

sstat

The sstat command shows metrics from currently running job steps. If you don't specify a job step the lowest job step is displayed.

sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <$JOBID>.<$JOBSTEP>
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
171.0               1            openlab00          0    186060K          0    107900K 
tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
171.1               1            openlab01          0    186060K          0    107900K 

Notice that if you do not have any jobsteps sstat will not return an error

tgray26@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
sstat: error: no steps running for job 172

If you do not run any srun commands you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format

#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...

# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2

sacct

The sacct command shows metrics from past jobs

tgray26@opensub00:sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
162          helloWorld      test2      staff          2  COMPLETED      0:0 
162.batch         batch                 staff          1  COMPLETED      0:0 
162.0             sleep                 staff          1  COMPLETED      0:0 
162.1             sleep                 staff          1  COMPLETED      0:0 
163          helloWorld      test2      staff          2  COMPLETED      0:0 
163.batch         batch                 staff          1  COMPLETED      0:0 
163.0             sleep                 staff          1  COMPLETED      0:0 

To check one specific job you can run something like the following (if you omit <$JOBSTEP> all jobsteps will be shown:

sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP>
tgray26@opensub00:sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171
       JobID    JobName   NTasks        NodeList     MaxRSS  MaxVMSize     AveRSS  AveVMSize    Elapsed 
------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- 
171          helloWorld           openlab[00-01]                                               00:00:30 
171.batch         batch        1       openlab00          0    119784K          0    113120K   00:00:30 
171.0             sleep        1       openlab00          0    186060K          0    107900K   00:00:30 
171.1             sleep        1       openlab01          0    186060K          0    107900K   00:00:30