SLURM/JobStatus: Difference between revisions
No edit summary |
No edit summary |
||
Line 88: | Line 88: | ||
171.1 sleep 1 openlab01 0 186060K 0 107900K 00:00:30 | 171.1 sleep 1 openlab01 0 186060K 0 107900K 00:00:30 | ||
</pre> | </pre> | ||
=Job Codes= | |||
When you list the current running jobs and your job is in <code>PD</code> or Pending SLURM will provide you some information on what the reason for this in the NODELIST parameter. You can use <code>scontrol show job <jobid></code> to get all the parameters for your job which may be required to identify why your job is not running. | |||
<pre> | |||
# squeue -u testuser | |||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | |||
581530 dpart bash testuser PD 0:00 1 (AssocGrpGRES) | |||
581533 dpart bash testuser PD 0:00 1 (Resources) | |||
581534 dpart bash testuser PD 0:00 1 (QOSMaxGRESPerUser) | |||
581535 scavenger bash testuser PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance) | |||
</pre> | |||
Some common ones are as follows: | |||
* <code>Resources</code> - The cluster does not currently have the resources to fit your job. | |||
* <code>QOSMaxGRESPerUser</code> - The quality of service (QoS) your job is running in has a limit of resources per user. Use <code>show_qos</code> to identify the limits and then use <code>scontrol show job <jobid></code> for each of your jobs running in that QoS. | |||
* <code>AssocGrpGRES</code> - The SLURM account you are using has a limit on the resources available in total for the account. Use <code>sacctmgr show assoc account=<account_name></code> to identify the GrpTRES limit. You can see all jobs running under the account by running <code>squeue -A account_name</code> and then find out more information on each job by <code>scontrol show job <jobid></code> . | |||
* <code>ReqNodeNotAvail</code> - If you have requested a specific node and it is currently scheduled you can get this job code. You can also get this job code when it provides <code>Reserved for maintenance</code> that there is a reservation in place (often for a maintenance window). You can see the current reservations by running <code>scontrol show reservation</code>. Often the culprit is that you have requested a TimeLimit that will conflict with the reservation. Lower your TimeLimit or leave your job to wait until the reservation completes. |
Revision as of 18:54, 19 April 2022
Job Status
SLURM offers a variety of tools to check the status of your jobs before, during, and after execution. When you first submit your job, SLURM should give you a job ID which represents the resources allocated to your job. Individual calls to srun will spawn job steps which can also be queried individually.
squeue
The squeue command shows job status in the queue. Helpful flags:
-u username
to show only your jobs (replace username with your UMIACS username)--start
to estimate start time for a job that has not yet started and the reason why it is waiting-s
to show the status of individual job steps for a job (e.g. batch jobs)
Examples:
username@opensub00:squeue -u username JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 162 test2 helloWor username R 0:03 2 openlab[00-01]
username@opensub00:squeue --start -u username JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 163 test2 helloWo2 username PD 2020-05-11T18:36:49 1 openlab02 (Priority)
username@opensub00:squeue -s -u username STEPID NAME PARTITION USER TIME NODELIST 162.0 sleep test2 username 0:05 openlab00 162.1 sleep test2 username 0:05 openlab01
sstat
The sstat command shows metrics from currently running job steps. If you don't specify a job step, the lowest job step is displayed.
sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize <$JOBID>.<$JOBSTEP>
username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 171.0 1 openlab00 0 186060K 0 107900K username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 171.1 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- 171.1 1 openlab01 0 186060K 0 107900K
Note that if you do not have any jobsteps, sstat will return an error.
username@opensub00: sstat --format JobID,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize 172 JobID NTasks Nodelist MaxRSS MaxVMSize AveRSS AveVMSize ------------ -------- -------------------- ---------- ---------- ---------- ---------- sstat: error: no steps running for job 237
If you do not run any srun commands, you will not create any job steps and metrics will not be available for your job. Your batch scripts should follow this format:
#!/bin/bash #SBATCH ... #SBATCH ... # set environment up module load ... # launch job steps srun <command to run> # that would be step 1 srun <command to run> # that would be step 2
sacct
The sacct command shows metrics from past jobs.
username@opensub00:sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 162 helloWorld test2 staff 2 COMPLETED 0:0 162.batch batch staff 1 COMPLETED 0:0 162.0 sleep staff 1 COMPLETED 0:0 162.1 sleep staff 1 COMPLETED 0:0 163 helloWorld test2 staff 2 COMPLETED 0:0 163.batch batch staff 1 COMPLETED 0:0 163.0 sleep staff 1 COMPLETED 0:0
To check one specific job, you can run something like the following (if you omit .<$JOBSTEP>, all jobsteps will be shown):
sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j <$JOBID>.<$JOBSTEP>
username@opensub00:sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize,Elapsed -j 171 JobID JobName NTasks NodeList MaxRSS MaxVMSize AveRSS AveVMSize Elapsed ------------ ---------- -------- --------------- ---------- ---------- ---------- ---------- ---------- 171 helloWorld openlab[00-01] 00:00:30 171.batch batch 1 openlab00 0 119784K 0 113120K 00:00:30 171.0 sleep 1 openlab00 0 186060K 0 107900K 00:00:30 171.1 sleep 1 openlab01 0 186060K 0 107900K 00:00:30
Job Codes
When you list the current running jobs and your job is in PD
or Pending SLURM will provide you some information on what the reason for this in the NODELIST parameter. You can use scontrol show job <jobid>
to get all the parameters for your job which may be required to identify why your job is not running.
# squeue -u testuser JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 581530 dpart bash testuser PD 0:00 1 (AssocGrpGRES) 581533 dpart bash testuser PD 0:00 1 (Resources) 581534 dpart bash testuser PD 0:00 1 (QOSMaxGRESPerUser) 581535 scavenger bash testuser PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
Some common ones are as follows:
Resources
- The cluster does not currently have the resources to fit your job.QOSMaxGRESPerUser
- The quality of service (QoS) your job is running in has a limit of resources per user. Useshow_qos
to identify the limits and then usescontrol show job <jobid>
for each of your jobs running in that QoS.AssocGrpGRES
- The SLURM account you are using has a limit on the resources available in total for the account. Usesacctmgr show assoc account=<account_name>
to identify the GrpTRES limit. You can see all jobs running under the account by runningsqueue -A account_name
and then find out more information on each job byscontrol show job <jobid>
.ReqNodeNotAvail
- If you have requested a specific node and it is currently scheduled you can get this job code. You can also get this job code when it providesReserved for maintenance
that there is a reservation in place (often for a maintenance window). You can see the current reservations by runningscontrol show reservation
. Often the culprit is that you have requested a TimeLimit that will conflict with the reservation. Lower your TimeLimit or leave your job to wait until the reservation completes.