SLURM/JobSubmission
Job Submission
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
The cluster that everyone with a UMIACS account has access to is Nexus. Please visit the Nexus page for instructions on how to connect to your assigned submit nodes.
Please note that the hard maximum number of jobs that the SLURM scheduler can handle at once (on Nexus) is 50000. It is best to limit your number of submitted jobs at any given time to significantly less than this amount in the case that another user also wants to submit a large number of jobs.
Computational jobs run on submission nodes will be terminated. Please use compute nodes for running computational jobs.
For details on how SLURM decides how to schedule jobs when multiple jobs are waiting in a scheduler's queue, please see SLURM/Priority.
srun
The srun
command is used to run a process on the compute nodes in the cluster. If you pass it a normal shell command (or command that executes a script), it will submit a job to run that shell command/script on a compute node and then return. srun
accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for srun
, which can be accessed by running man srun
.
$ srun --qos=default --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`' Hello World from tron33.umiacs.umd.edu
It is important to understand that srun
is an interactive command. By default input to srun
is broadcast to all compute nodes running your process and output from the compute nodes is redirected to srun
. This behavior can be changed; however, srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end. To run a non-interactive submission that will remain running after you logout, you will need to wrap your srun
commands in a batch script and submit it with sbatch.
Common srun arguments
--job-name=helloWorld
name of your job--mem=1gb
request 1GB of memory, if no unit is given MB is assumed--ntasks=2
request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core--nodes=2
if passed to srun, the given command will be run concurrently on each node--nodelist=$NODENAME
request to run your job on the $NODENAME node--time=hh:mm:ss
time needed to run your job--error=filename
file to redirect stderr--partition=$PNAME
request job run in the $PNAME partition--qos=default
to see the available QOS options on a cluster, runshow_qos
--account=accountname
use qos specific to an account--output=filename
file to redirect stdout to--requeue
automatically requeue your job if it is preempted
Interactive Shell Sessions
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun
to invoke a shell:
$ srun --pty --qos=default --mem 1gb --time=01:00:00 bash $ hostname tron33.umiacs.umd.edu
Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.
salloc
The salloc
command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
$ salloc --qos=default -N 1 --mem=2gb --time=01:00:00 salloc: Granted job allocation 159 $ srun /usr/bin/hostname tron33.umiacs.umd.edu $ exit exit salloc: Relinquishing job allocation 159
Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.
sbatch
The sbatch
command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
#!/bin/bash srun bash -c 'echo Hello World from `hostname`'
Then you need to submit the script with sbatch and request resources:
$ sbatch --qos=default --mem=1gb --time=1:00:00 helloWorld.sh Submitted batch job 121
SLURM will return a job number that you can use to check the status of your job with squeue:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 121 tron helloWor username R 0:01 1 tron32
Advanced Batch Scripts
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run tens/hundreds/thousands of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of Python into our environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.
#!/bin/bash # Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling #SBATCH --job-name=helloWorld # sets the job name #SBATCH --output=helloWorld.out.%j # indicates a file to redirect STDOUT to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail. #SBATCH --error=helloWorld.out.%j # indicates a file to redirect STDERR to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail. #SBATCH --time=00:05:00 # how long you would like your job to run; format=hh:mm:ss #SBATCH --qos=default # set QOS, this will determine what resources can be requested #SBATCH --nodes=2 # number of nodes to allocate for your job #SBATCH --ntasks=4 # request 4 cpu cores be reserved for your node total #SBATCH --ntasks-per-node=2 # request 2 cpu cores be reserved per node #SBATCH --mem=1gb # memory required by job; if unit is not specified MB will be assumed srun -N 1 --mem=512mb bash -c "hostname; python3 --version" & # use srun to invoke commands within your job; using an '&' srun -N 1 --mem=512mb bash -c "hostname; python3 --version" & # will background the process allowing them to run concurrently wait # wait for any background processes to complete # once the end of the batch script is reached your job allocation will be revoked
Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as ${1}
for the first argument and so on.
More Examples
scancel
The scancel
command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.
scancel 255
cancel job 255scancel 255.3
cancel job step 3 of job 255scancel --user username --partition=tron
cancel all jobs for username in the tron partition
Identifying Resources and Features
The sinfo
command can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like sinfo -o "%40N %8c %8m %35f %35G"
.
$ sinfo -o "%40N %8c %8m %35f %35G" NODELIST CPUS MEMORY AVAIL_FEATURES GRES legacy00 48 125940 rhel8,Zen,EPYC-7402 (null) legacy[01-11,13-19,22-28,30] 12+ 61804+ rhel8,Xeon,E5-2620 (null) cbcb[23-24],twist[02-05] 24 255150 rhel8,Xeon,E5-2650 (null) cbcb26 128 513243 rhel8,Zen,EPYC-7763,Ampere gpu:rtxa5000:8 cbcb27 64 255167 rhel8,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 cbcb[00-21] 32 2061175 rhel8,Zen,EPYC-7313 (null) cbcb22,cmlcpu[00,06-07],legacy20 24+ 384270+ rhel8,Xeon,E5-2680 (null) cbcb25 24 255278 rhel8,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 legacy21 8 61746 rhel8,Xeon,E5-2623 (null) tron[06-09,12-15,21] 16 126214+ rhel8,Zen,EPYC-7302P,Ampere gpu:rtxa4000:4 tron[10-11,16-20,34] 16 126217 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 tron[22-33,35-45] 16 126214+ rhel8,Zen,EPYC-7302,Ampere gpu:rtxa4000:4 clip11 16 126217 rhel8,Zen,EPYC-7313,Ampere gpu:rtxa4000:4 clip00 32 255276 rhel8,Xeon,E5-2683,Pascal gpu:titanxpascal:3 clip02 20 126255 rhel8,Xeon,E5-2630,Pascal gpu:gtx1080ti:3 clip03 20 126243 rhel8,Xeon,E5-2630,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:2 clip04 32 255233 rhel8,Zen,EPYC-7302,Ampere gpu:rtx3090:4 clip[05-06] 24 126216 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa6000:2 clip07 8 255263 rhel8,Xeon,E5-2623,Pascal gpu:gtx1080ti:3 clip09 32 383043 rhel8,Xeon,6130,Pascal,Turing gpu:rtx2080ti:5,gpu:gtx1080ti:3 clip13,cml30,vulcan[29-32] 32 255218+ rhel8,Zen,EPYC-7313,Ampere gpu:rtxa6000:8 clip08,vulcan[08-22,25] 32 255258+ rhel8,Xeon,E5-2683,Pascal gpu:gtx1080ti:8 clip12,gammagpu[10-17] 16 126203+ rhel8,Zen,EPYC-7313,Ampere gpu:rtxa6000:4 clip01 32 255276 rhel8,Xeon,E5-2683,Pascal gpu:titanxpascal:1,gpu:titanxp:2 clip10 44 1029404 rhel8,Xeon,E5-2699 (null) cml[00,02-11,13-14],tron[62-63,65-66,68- 32 351530+ rhel8,Xeon,4216,Turing gpu:rtx2080ti:8 cml01 32 383030 rhel8,Xeon,4216,Turing gpu:rtx2080ti:6 cml12 32 383038 rhel8,Xeon,4216,Turing,Ampere gpu:rtx2080ti:7,gpu:rtxa4000:1 cml[15-16] 32 383038 rhel8,Xeon,4216,Turing gpu:rtx2080ti:7 cml[17-28],gammagpu05 32 255225+ rhel8,Zen,EPYC-7282,Ampere gpu:rtxa4000:8 cml31 32 384094 rhel8,Zen,EPYC-9124,Ampere gpu:a100:1 cml32 64 512999 rhel8,Zen,EPYC-7543,Ampere gpu:a100:4 cmlcpu[01-04] 20 384271 rhel8,Xeon,E5-2660 (null) gammagpu00 32 255233 rhel8,Zen,EPYC-7302,Ampere gpu:rtxa5000:8 mbrc[00-01] 20 189498 rhel8,Xeon,4114,Turing gpu:rtx2080ti:8 twist[00-01] 8 61727 rhel8,Xeon,E5-1660 (null) legacygpu08 20 513327 rhel8,Xeon,E5-2640,Maxwell gpu:m40:2 brigid[16-17] 48 512897 rhel8,Zen,EPYC-7443 (null) brigid[18-19] 20 61739 rhel8,Xeon,E5-2640 (null) legacygpu06 20 255249 rhel8,Xeon,E5-2699,Maxwell gpu:gtxtitanx:4 tron[00-05] 32 255233 rhel8,Zen,EPYC-7302,Ampere gpu:rtxa6000:8 tron[46-61] 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 tron[64,67] 32 383028+ rhel8,Xeon,4216,Turing,Ampere gpu:rtx2080ti:7,gpu:rtx3070:1 vulcan00 32 255259 rhel8,Xeon,E5-2683,Pascal gpu:p6000:7,gpu:p100:1 vulcan[01-04,06-07] 32 255259 rhel8,Xeon,E5-2683,Pascal gpu:p6000:8 vulcan05 32 255259 rhel8,Xeon,E5-2683,Pascal gpu:p6000:7 janus[02-04] 40 383025 rhel8,Xeon,6248,Turing gpu:rtx2080ti:10 legacygpu00 20 255249 rhel8,Xeon,E5-2650,Pascal gpu:titanxp:4 legacygpu[01-02,07] 20 255249+ rhel8,Xeon,E5-2650,Maxwell gpu:gtxtitanx:4 legacygpu[03-04] 16 255268 rhel8,Xeon,E5-2630,Maxwell gpu:gtxtitanx:2 legacygpu05 44 513193 rhel8,Xeon,E5-2699,Pascal gpu:gtx1080ti:4 vulcan23 32 383030 rhel8,Xeon,4612,Turing gpu:rtx2080ti:8 vulcan26 24 770126 rhel8,Xeon,6146,Pascal gpu:titanxp:10 vulcan[27-28] 56 770093 rhel8,Xeon,8280,Turing gpu:rtx2080ti:10 vulcan24 16 126216 rhel8,Zen,7282,Ampere gpu:rtxa6000:4 gammagpu[01-04,06-09],vulcan[33-37] 32 255215+ rhel8,Zen,EPYC-7313,Ampere gpu:rtxa5000:8 vulcan[38-44] 32 255215 rhel8,Zen,EPYC-7313,Ampere gpu:rtxa4000:8
Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.
You can identify further specific information about a node using scontrol with various flags.
There are also two command aliases developed by UMIACS staff to show various node information in aggregate. They are show_nodes
and show_available_nodes
.
show_nodes
The show_nodes
command alias shows each node's name, number of CPUs, memory, {OS, CPU architecture, CPU type, GPU architecture (if the node has GPUs)} (as AVAIL_FEATURES), GRES (GPUs), and State. It essentially wraps the sinfo command with some pre-determined output format options and shows each node on its own line, in alphabetical order.
To only view nodes in a specific partition, append -p <partition name>
to the command alias.
Examples
$ show_nodes NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE brigid16 48 512897 rhel8,Zen,EPYC-7443 (null) idle brigid17 48 512897 rhel8,Zen,EPYC-7443 (null) idle ... ... ... ... ... ... vulcan44 32 255215 rhel8,Zen,EPYC-7313,Ampere gpu:rtxa4000:8 idle
(specific partition)
$ show_nodes -p tron NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE tron00 32 255233 rhel8,Zen,EPYC-7302,Ampere gpu:rtxa6000:8 idle tron01 32 255233 rhel8,Zen,EPYC-7302,Ampere gpu:rtxa6000:8 idle ... ... ... ... ... ... tron69 32 383030 rhel8,Xeon,4216,Turing gpu:rtx2080ti:8 idle
show_available_nodes
The show_available_nodes
command alias takes zero or more arguments that correspond to resources or features that you are looking to request a job for and tells you what nodes could theoretically[0,1] run a job with these arguments immediately. It assumes your job is a single-node job.
These arguments are:
--partition
: Only include nodes in the specified partition(s).--account
: Only include nodes from partitions that can use the specified account(s).--qos
: Only include nodes from partitions that can use the specified QoS(es).--cpus
: Only include nodes with at least this many CPUs free.--mem
: Only include nodes with at least this much memory free. The default unit is MB if unspecified, but any of {K,M,G,T} can be suffixed to the number provided (will then be interpreted as KB, MB, GB, or TB, respectively).- GRES-related arguments:
--or-gres
: Only include nodes whose list of GRES contains any of the specified GRES type/quantity pairings.--and-gres
: Only include nodes whose list of GRES contains all of the specified GRES type/quantity pairings. Functionally identical to --or-gres if only one GRES type/quantity pairing is specified.
- GPU-related arguments:
--or-gpus
: Only include nodes whose list of GPUs (a subset of GRES) contains any of the specified GPU type/quantity pairings.--and-gpus
: Only include nodes whose list of GPUs (a subset of GRES) contains all of the specified GPU type/quantity pairings. Functionally identical to --or-gpus if only one GPU type/quantity pairing is specified.
- Feature-related arguments:
--or-feature
: Only include nodes whose list of features contains any of the specified feature(s).--and-feature
: Only include nodes whose list of features contains all of the specified feature(s). Functionally identical to --or-feature if only one feature is specified.
These arguments are also viewable by running show_available_nodes -h
.
Examples
TODO
Footnotes
[0] - As of now, this command alias does not factor in resources occupied by jobs that could be preempted (based on the partition(s) passed to it, if present). This is soon to come.
[1] - This command alias also does not factor in potentially higher priority jobs in the same partition(s) blocking execution of a job submitted with the arguments checked by the command alias. This is due to the complexity of calculating a job's priority before it is actually submitted.
Requesting GPUs
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag --gres=gpu:#
(with the actual number of GPUs you want). If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. --gres=gpu:rtxa5000:#
. If you do not request a specific type of GPU, you are likely to be scheduled on an older, lower spec'd GPU.
Note that some QoSes may have limits on the number of GPUs you can request per job, so you may need to specify a different QoS to request more GPUs.
$ srun --pty --qos=medium --gres=gpu:2 nvidia-smi ... Wed Mar 6 16:59:39 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:3D:00.0 Off | N/A | | 32% 23C P8 1W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:40:00.0 Off | N/A | | 32% 25C P8 1W / 250W | 0MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please only request 1 GPU. The others on the node (if any) will be left available for other users.
$ srun --pty --gres=gpu:rtxa5000:1 nvidia-smi Thu Aug 25 15:22:15 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 Off | Off | | 30% 23C P8 20W / 230W | 0MiB / 24256MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
As with all other flags, the --gres
flag may also be passed to sbatch and salloc rather than directly to srun.
MPI example
To run MPI jobs, you will need to include the --mpi=pmix
flag in your submission arguments.
#!/usr/bin/bash #SBATCH --job-name=mpi_test # Job name #SBATCH --nodes=4 # Number of nodes #SBATCH --ntasks=8 # Number of MPI ranks #SBATCH --ntasks-per-node=2 # Number of MPI ranks per node #SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node #SBATCH --time=00:30:00 # Time limit hrs:min:sec srun --mpi=pmix /nfshomes/username/testing/mpi/a.out