SLURM/JobStatus

From UMIACS
Revision as of 15:44, 7 May 2021 by Mbaney (talk | contribs)
Jump to navigation Jump to search

Job Status

SLURM offers a variety of tools to check the status of your jobs before, during, and after execution. When you first submit your job, SLURM should give you a job ID which represents the resources allocated to your job. Individual calls to srun will spawn job steps which can also be queried individually.

squeue

The squeue command shows job status in the queue. Helpful flags:

  • -u username to show only your jobs (replace username with your UMIACS username)
  • --start to estimate start time for a job that has not yet started and the reason why it is waiting
  • -s to show the status of individual job steps for a job (e.g. batch jobs)

Examples:

username@opensub00:squeue -u username
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               162     test2 helloWor username  R       0:03      2 openlab[00-01]
username@opensub00:squeue --start -u username
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
               163     test2 helloWo2 username PD 2020-05-11T18:36:49      1 openlab02            (Priority)
username@opensub00:squeue -s -u username
         STEPID     NAME PARTITION     USER      TIME NODELIST
          162.0    sleep     test2 username      0:05 openlab00
          162.1    sleep     test2 username      0:05 openlab01

sstat

The sstat command shows metrics from currently running job steps. If you don't specify a job step, the lowest job step is displayed.

sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <$JOBID>.<$JOBSTEP>
username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
171.0               1            openlab00          0    186060K          0    107900K 
username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ---------- 
171.1               1            openlab01          0    186060K          0    107900K 

Note that if you do not have any jobsteps, sstat will return an error.

username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172
       JobID   NTasks             Nodelist     MaxRSS  MaxVMSize     AveRSS  AveVMSize 
------------ -------- -------------------- ---------- ---------- ---------- ----------
sstat: error: no steps running for job 237

If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:

#!/bin/bash
#SBATCH ...
#SBATCH ...
# set environment up
module load ...

# launch job steps
srun <command to run> # that would be step 1
srun <command to run> # that would be step 2

sacct

The sacct command shows metrics from past jobs.

username@opensub00:sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
162          helloWorld      test2      staff          2  COMPLETED      0:0 
162.batch         batch                 staff          1  COMPLETED      0:0 
162.0             sleep                 staff          1  COMPLETED      0:0 
162.1             sleep                 staff          1  COMPLETED      0:0 
163          helloWorld      test2      staff          2  COMPLETED      0:0 
163.batch         batch                 staff          1  COMPLETED      0:0 
163.0             sleep                 staff          1  COMPLETED      0:0 

To check one specific job, you can run something like the following (if you omit .<$JOBSTEP>, all jobsteps will be shown):

sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP>
username@opensub00:sacct  --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171
       JobID    JobName   NTasks        NodeList     MaxRSS  MaxVMSize     AveRSS  AveVMSize    Elapsed 
------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- 
171          helloWorld           openlab[00-01]                                               00:00:30 
171.batch         batch        1       openlab00          0    119784K          0    113120K   00:00:30 
171.0             sleep        1       openlab00          0    186060K          0    107900K   00:00:30 
171.1             sleep        1       openlab01          0    186060K          0    107900K   00:00:30