Difference between revisions of "SLURM/JobSubmission"

From UMIACS
Jump to navigation Jump to search
 
(32 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
=Job Submission=
 
=Job Submission=
 +
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
 +
 +
Please note that the hard maximum number of jobs that the SLURM scheduler can handle is 10000. It is best to limit your number of submitted jobs at any given time to less than half this amount in the case that another user also wants to submit a large number of jobs.
  
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
+
'''An important notice: computational jobs run on submission nodes will be terminated. Please use the compute nodes for that purpose.'''
  
 
==srun==
 
==srun==
<code>srun</code> is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then <code>srun</code> will return. <code>srun</code> accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for <code>srun</code>, which can be accessed by running <code>man srun</code>.  
+
<code>srun</code> is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then <code>srun</code> will return. <code>srun</code> accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for <code>srun</code>, which can be accessed by running <code>man srun</code>.
 +
 
 
<pre>
 
<pre>
tgray26@opensub01:srun --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
+
username@nexuscml010:srun --qos=dpart --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from openlab06.umiacs.umd.edu
+
Hello World from tron33.umiacs.umd.edu
 
</pre>
 
</pre>
It is important to understand that <code>srun</code> is an interactive command. By default input to <code>srun</code> is broadcast to all compute nodes running your process and output from the compute nodes is redirected to <code>srun</code>. This behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive submission that will remain running after you logout, you will need to wrap your <code>srun</code> commands in a batch script and submit it with [[#sbatch | sbatch]]
+
 
 +
It is important to understand that <code>srun</code> is an interactive command. By default input to <code>srun</code> is broadcast to all compute nodes running your process and output from the compute nodes is redirected to <code>srun</code>. This behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive submission that will remain running after you logout, you will need to wrap your <code>srun</code> commands in a batch script and submit it with [[#sbatch | sbatch]].
 +
 
 
===Common srun arguments===
 
===Common srun arguments===
 
* <code>--mem=1gb</code> ''if no unit is given MB is assumed''
 
* <code>--mem=1gb</code> ''if no unit is given MB is assumed''
 
* <code>--nodes=2</code> ''if passed to srun, the given command will be run concurrently on each node''
 
* <code>--nodes=2</code> ''if passed to srun, the given command will be run concurrently on each node''
* <code>--qos=dpart</code>
+
* <code>--qos=dpart</code> ''to see the available QOS options on a cluster, run'' <code>show_qos</code>
 
* <code>--time=hh:mm:ss</code> ''time needed to run your job''
 
* <code>--time=hh:mm:ss</code> ''time needed to run your job''
 
* <code>--job-name=helloWorld</code>
 
* <code>--job-name=helloWorld</code>
* <code>--output filename</code> ''file to redirect stdout to''
+
* <code>--output=filename</code> ''file to redirect stdout to''
* <code>--error filename</code> ''file to redirect stderr''
+
* <code>--error=filename</code> ''file to redirect stderr''
* <code>--partition $PNAME</code> ''request job run in the $PNAME partition''
+
* <code>--partition=$PNAME</code> ''request job run in the $PNAME partition''
* <code>--ntasks 2</code> ''request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core''
+
* <code>--ntasks=2</code> ''request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core''
 +
* <code>--account=accountname</code> ''use qos specific to an account''
  
 
===Interactive Shell Sessions===
 
===Interactive Shell Sessions===
 
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use <code>srun</code> to invoke a shell:
 
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use <code>srun</code> to invoke a shell:
 
<pre>
 
<pre>
tgray26@opensub01:srun --pty --mem 1gb --time=01:00:00 bash
+
username@nexuscml00:srun --pty --qos=dpart --mem 1gb --time=01:00:00 bash
tgray26@openlab06:
+
username@tron33:
 
</pre>
 
</pre>
 
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''
 
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''
Line 31: Line 38:
 
==salloc==
 
==salloc==
 
The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
 
The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
 +
 
<pre>
 
<pre>
tgray26@opensub00:salloc -N 1 --mem=2gb --time=01:00:00
+
username@nexuscml00:salloc --qos=dpart -N 1 --mem=2gb --time=01:00:00
 
salloc: Granted job allocation 159
 
salloc: Granted job allocation 159
tgray26@opensub00:srun /usr/bin/hostname
+
username@nexuscml00:srun /usr/bin/hostname
openlab00.umiacs.umd.edu
+
tron33.umiacs.umd.edu
tgray26@opensub00:exit
+
username@nexuscml00:exit
 
exit
 
exit
 
salloc: Relinquishing job allocation 159
 
salloc: Relinquishing job allocation 159
 
</pre>
 
</pre>
 +
 
'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''
 
'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''
  
 
==sbatch==
 
==sbatch==
 
The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
 
The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
 +
 
<pre>
 
<pre>
 
#!/bin/bash
 
#!/bin/bash
Line 49: Line 59:
 
srun bash -c 'echo Hello World from `hostname`'
 
srun bash -c 'echo Hello World from `hostname`'
 
</pre>
 
</pre>
 +
 
Then you need to submit the script with sbatch and request resources:
 
Then you need to submit the script with sbatch and request resources:
<pre>tgray26@opensub00:sbatch --mem=1gb --time=1:00:00 helloWorld.sh
+
 
 +
<pre>
 +
username@nexuscml00:sbatch --qos=dpart --mem=1gb --time=1:00:00 helloWorld.sh
 
Submitted batch job 121
 
Submitted batch job 121
 
</pre>
 
</pre>
 +
 
SLURM will return a job number that you can use to check the status of your job with squeue:
 
SLURM will return a job number that you can use to check the status of your job with squeue:
 +
 
<pre>
 
<pre>
tgray26@opensub00:squeue
+
username@nexuscml00:squeue
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               121    dpart helloWor tgray26 R      0:01      2 openlab[00-01]
+
               121    dpart helloWor username R      0:01      1 tron32
 
</pre>
 
</pre>
 +
 
====Advanced Batch Scripts====
 
====Advanced Batch Scripts====
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example I have a batch job that will request 2 nodes in the cluster, then I load a specific version of Python into my environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, I use the '&' operator to background the process so that both job steps can run at once; however, this means that I then need to use the wait command to block processing until all background processes have finished.
+
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of [[Python]] into our environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.
 +
 
 
<pre>
 
<pre>
 
#!/bin/bash
 
#!/bin/bash
Line 67: Line 84:
  
 
#SBATCH --job-name=helloWorld                                  # sets the job name
 
#SBATCH --job-name=helloWorld                                  # sets the job name
#SBATCH --output helloWorld.out.%j                              # indicates a file to redirect STDOUT to; %j is the jobid  
+
#SBATCH --output=helloWorld.out.%j                              # indicates a file to redirect STDOUT to; %j is the jobid. Must be set to a file instead of a directory or else submission will fail.
#SBATCH --error helloWorld.out.%j                              # indicates a file to redirect STDERR to; %j is the jobid
+
#SBATCH --error=helloWorld.out.%j                              # indicates a file to redirect STDERR to; %j is the jobid. Must be set to a file instead of a directory or else submission will fail.
 
#SBATCH --time=00:05:00                                        # how long you think your job will take to complete; format=hh:mm:ss
 
#SBATCH --time=00:05:00                                        # how long you think your job will take to complete; format=hh:mm:ss
 
#SBATCH --qos=dpart                                            # set QOS, this will determine what resources can be requested
 
#SBATCH --qos=dpart                                            # set QOS, this will determine what resources can be requested
Line 74: Line 91:
 
#SBATCH --ntasks=4                                              # request 4 cpu cores be reserved for your node total
 
#SBATCH --ntasks=4                                              # request 4 cpu cores be reserved for your node total
 
#SBATCH --ntasks-per-node=2                                    # request 2 cpu cores be reserved per node
 
#SBATCH --ntasks-per-node=2                                    # request 2 cpu cores be reserved per node
#SBATCH --mem 1gb                                              # memory required by job; if unit is not specified MB will be assumed
+
#SBATCH --mem=1gb                                              # memory required by job; if unit is not specified MB will be assumed
  
 
module load Python/2.7.9                                        # run any commands necessary to setup your environment
 
module load Python/2.7.9                                        # run any commands necessary to setup your environment
Line 88: Line 105:
  
 
====More Examples====
 
====More Examples====
 
 
* [[SLURM/ArrayJobs]]
 
* [[SLURM/ArrayJobs]]
  
Line 95: Line 111:
 
*<code>scancel 255</code>    ''cancel job 255''
 
*<code>scancel 255</code>    ''cancel job 255''
 
*<code>scancel 255.3</code>    ''cancel job step 3 of job 255''
 
*<code>scancel 255.3</code>    ''cancel job step 3 of job 255''
*<code>scancel --user tgray26 --partition dpart</code>    ''cancel all jobs for tgray26 in the dpart partition''
+
*<code>scancel --user username --partition=dpart</code>    ''cancel all jobs for username in the dpart partition''
 
 
  
 
=Identifying Resources and Features=
 
=Identifying Resources and Features=
The sinfo can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like this
+
The sinfo can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like <code>sinfo -o "%40N %8c %8m %20f %35G"</code>.
<code>sinfo -o "%15N %10c %10m  %25f %10G"</code>.
 
  
 
<pre>
 
<pre>
$ sinfo -o "%15N %10c %10m  %25f %25G"
+
username@nexuscml00:sinfo -o "%40N %8c %8m %20f %35G"
NODELIST       CPUS       MEMORY     AVAIL_FEATURES           GRES
+
NODELIST                                 CPUS     MEMORY   AVAIL_FEATURES       GRES
openlab[30-33]  64        257759      Opteron,6274              (null)
+
tron[22-33,35-45]                       16      128521+ rhel8,Zen,EPYC-7302  gpu:rtxa4000:4
openlab[00-078          7822        Opteron,2354              (null)
+
tron[06-09,12-15,21]                     16      128520+ rhel8,Zen,EPYC-7302P gpu:rtxa4000:4
openlab[10-11,1 16         23939       Xeon,x5560                (null)
+
tron[10-11,16-20,34]                    16      128524  rhel8,Zen,EPYC-7313P gpu:rtxa4000:4
openlab08       32         128720     Xeon,E5-2690              gpu:k20:2
+
legacy00                                48      128248  rhel8,Zen,EPYC-7402  (null)
openlab09       32         128722      Xeon,E5-2690              gpu:m40:1,gpu:k20:2
+
legacy[01-09]                            12       128436  rhel8,Xeon,E5-2620  (null)
 +
clip07                                  8        257570  rhel8,Xeon,E5-2623  gpu:gtx1080ti:3
 +
clip08                                  32       257565  rhel8,Xeon,E5-2683  gpu:gtx1080ti:8
 +
clip09                                  32       385350  rhel8,Xeon,6130     gpu:rtx2080ti:5,gpu:gtx1080ti:3
 +
clip00                                  32      257583  rhel8,Xeon,E5-2683  gpu:gtxtitanx:3
 +
clip01                                  32      257583  rhel8,Xeon,E5-2683  gpu:gtxtitanx:1,gpu:gtxtitanxp:2
 +
clip02                                  20      128562  rhel8,Xeon,E5-2630  gpu:gtx1080ti:3
 +
clip03                                  20       128562  rhel8,Xeon,E5-2630  gpu:rtx2080ti:1,gpu:gtx1080ti:2
 +
clip04                                  32       257540  rhel8,Zen,EPYC-7302  gpu:rtx3090:4
 +
clip[05-06]                              24      128523  rhel8,Zen,EPYC-7352  gpu:rtxa6000:2
 +
gammagpu[01-03]                          32      257541  rhel8,Zen,EPYC-7313  gpu:rtxa5000:8
 +
legacy14                                20      322068  rhel8,Xeon,E5-2650  gpu:gtxtitanx:4
 +
legacy[15-16]                            16      257587  rhel8,Xeon,E5-2630  gpu:teslak80:2
 +
legacy17                                44      515501  rhel8,Xeon,E5-2699  gpu:gtx1080ti:4
 +
twist[00-01]                            16      64031    rhel8,Xeon,E5-1660  (null)
 +
twist[02-05]                            48      257452  rhel8,Xeon,E5-2650  (null)
 +
tron[00-05]                              32      257540  rhel8,Zen,EPYC-7302  gpu:rtxa6000:8
 +
tron[46-61]                              48      257539  rhel8,Zen,EPYC-7352  gpu:rtxa5000:8
 
</pre>
 
</pre>
  
You can also identify further specific information about a node using [https://wiki.umiacs.umd.edu/umiacs/index.php/SLURM/ClusterStatus#scontrol scontrol].
+
Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.
 +
 
 +
There is also a prewritten alias <code>show_nodes</code> on all of our SLURM computing clusters that shows each node's name, number of CPUs, memory, processor type (as AVAIL_FEATURES), GRES, State, and partitions that can submit to it.
 +
 
 +
You can identify further specific information about a node using [[SLURM/ClusterStatus#scontrol | scontrol]] with various flags.
  
 
=Requesting GPUs=
 
=Requesting GPUs=
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or cpu cores. You will also need to make sure that you submit your job to the correct partition since nodes with GPUs are often put into their own partition to prevent the nodes from being tied up by jobs that don't utilize GPUs. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job you can use the flag <code>--gres:gpu:2</code> or if there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag <code>--gres:k20:1</code>
+
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. You will also need to make sure that you submit your job to the correct partition since nodes with GPUs are often put into their own partition to prevent the nodes from being tied up by jobs that don't utilize GPUs. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag <code>--gres=gpu:2</code>. If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. <code>--gres=gpu:rtxa5000:1</code>. If you do not request a specific type of GPU, you may be scheduled on the oldest/lowest specced GPU available.
 +
 
 
<pre>
 
<pre>
tgray26@opensub01:srun --pty --partition gpu --gres=gpu:2 nvidia-smi
+
username@nexuscml00:srun --pty --gres=gpu:2 nvidia-smi
Wed Jul 13 15:33:18 2016
+
Thu Aug 25 15:22:15 2022
+------------------------------------------------------+
+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.28    Driver Version: 361.28        |
+
| NVIDIA-SMI 470.129.06  Driver Version: 470.129.06  CUDA Version: 11.4    |
 
|-------------------------------+----------------------+----------------------+
 
|-------------------------------+----------------------+----------------------+
 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
 
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
 +
|                              |                      |              MIG M. |
 
|===============================+======================+======================|
 
|===============================+======================+======================|
|  0  Tesla K20c          Off  | 0000:03:00.0     Off |                   0 |
+
|  0  NVIDIA RTX A4000    Off  | 00000000:01:00.0 Off |                 Off |
| 30%  24C   P0   48W / 225W |     11MiB / 4799MiB |      0%      Default |
+
| 30%  23C   P8   20W / 140W |     0MiB / 16376MiB |      0%      Default |
 +
|                              |                      |                  N/A |
 
+-------------------------------+----------------------+----------------------+
 
+-------------------------------+----------------------+----------------------+
|  1  Tesla K20c          Off  | 0000:84:00.0     Off |                   0 |
+
|  1  NVIDIA RTX A4000    Off  | 00000000:41:00.0 Off |                 Off |
| 30%  23C   P0   52W / 225W |     11MiB / 4799MiB |     93%      Default |
+
| 30%  24C   P8   15W / 140W |     0MiB / 16376MiB |     0%      Default |
 +
|                              |                      |                  N/A |
 
+-------------------------------+----------------------+----------------------+
 
+-------------------------------+----------------------+----------------------+
  
Line 139: Line 178:
 
+-----------------------------------------------------------------------------+
 
+-----------------------------------------------------------------------------+
 
</pre>
 
</pre>
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please request only 1 GPU and the other one will be left available for other users:
+
 
 +
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please request only 1 GPU. The other one will be left available for other users.
 +
 
 
<pre>
 
<pre>
tgray26@opensub01:srun --pty --partition gpu --gres=gpu:k20:1 nvidia-smi
+
username@nexuscml00:srun --pty --gres=gpu:rtxa5000:1 nvidia-smi
Wed Jul 13 15:31:29 2016
+
Thu Aug 25 15:22:15 2022
+------------------------------------------------------+
+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.28    Driver Version: 361.28        |
+
| NVIDIA-SMI 470.129.06  Driver Version: 470.129.06  CUDA Version: 11.4    |
 
|-------------------------------+----------------------+----------------------+
 
|-------------------------------+----------------------+----------------------+
 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
 
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
 +
|                              |                      |              MIG M. |
 
|===============================+======================+======================|
 
|===============================+======================+======================|
|  0  Tesla K20c          Off  | 0000:03:00.0     Off |                   0 |
+
|  0  NVIDIA RTX A5000    Off  | 00000000:01:00.0 Off |                 Off |
| 30%  24C   P0   50W / 225W |     11MiB / 4799MiB |     92%      Default |
+
| 30%  23C   P8   20W / 230W |     0MiB / 24256MiB |     0%      Default |
 +
|                              |                      |                  N/A |
 
+-------------------------------+----------------------+----------------------+
 
+-------------------------------+----------------------+----------------------+
  
Line 160: Line 203:
 
+-----------------------------------------------------------------------------+
 
+-----------------------------------------------------------------------------+
 
</pre>
 
</pre>
The <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]]
+
 
 +
As with all other flags, the <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]].
  
 
=MPI example=
 
=MPI example=
 +
To run [https://en.wikipedia.org/wiki/Message_Passing_Interface MPI] jobs, you will need to load an MPI [[Modules | module]] and include the <code>--mpi</code> flag in your submission arguments.
 +
 
<pre>
 
<pre>
 
#!/usr/bin/bash  
 
#!/usr/bin/bash  
Line 174: Line 220:
 
module load mpi  
 
module load mpi  
  
srun --mpi=openmpi /nfshomes/derek/testing/mpi/a.out  
+
srun --mpi=openmpi /nfshomes/username/testing/mpi/a.out  
 
</pre>
 
</pre>

Latest revision as of 16:00, 26 October 2022

Job Submission

SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.

Please note that the hard maximum number of jobs that the SLURM scheduler can handle is 10000. It is best to limit your number of submitted jobs at any given time to less than half this amount in the case that another user also wants to submit a large number of jobs.

An important notice: computational jobs run on submission nodes will be terminated. Please use the compute nodes for that purpose.

srun

srun is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then srun will return. srun accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for srun, which can be accessed by running man srun.

username@nexuscml010:srun --qos=dpart --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from tron33.umiacs.umd.edu

It is important to understand that srun is an interactive command. By default input to srun is broadcast to all compute nodes running your process and output from the compute nodes is redirected to srun. This behavior can be changed; however, srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end. To run a non-interactive submission that will remain running after you logout, you will need to wrap your srun commands in a batch script and submit it with sbatch.

Common srun arguments

  • --mem=1gb if no unit is given MB is assumed
  • --nodes=2 if passed to srun, the given command will be run concurrently on each node
  • --qos=dpart to see the available QOS options on a cluster, run show_qos
  • --time=hh:mm:ss time needed to run your job
  • --job-name=helloWorld
  • --output=filename file to redirect stdout to
  • --error=filename file to redirect stderr
  • --partition=$PNAME request job run in the $PNAME partition
  • --ntasks=2 request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core
  • --account=accountname use qos specific to an account

Interactive Shell Sessions

An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun to invoke a shell:

username@nexuscml00:srun --pty --qos=dpart --mem 1gb --time=01:00:00 bash
username@tron33:

Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.

salloc

The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.

username@nexuscml00:salloc --qos=dpart -N 1 --mem=2gb --time=01:00:00
salloc: Granted job allocation 159
username@nexuscml00:srun /usr/bin/hostname
tron33.umiacs.umd.edu
username@nexuscml00:exit
exit
salloc: Relinquishing job allocation 159

Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.

sbatch

The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:

#!/bin/bash

srun bash -c 'echo Hello World from `hostname`'

Then you need to submit the script with sbatch and request resources:

username@nexuscml00:sbatch --qos=dpart --mem=1gb --time=1:00:00 helloWorld.sh
Submitted batch job 121

SLURM will return a job number that you can use to check the status of your job with squeue:

username@nexuscml00:squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               121     dpart helloWor username  R       0:01      1 tron32

Advanced Batch Scripts

You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of Python into our environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.

#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling

#SBATCH --job-name=helloWorld                                   # sets the job name
#SBATCH --output=helloWorld.out.%j                              # indicates a file to redirect STDOUT to; %j is the jobid. Must be set to a file instead of a directory or else submission will fail.
#SBATCH --error=helloWorld.out.%j                               # indicates a file to redirect STDERR to; %j is the jobid. Must be set to a file instead of a directory or else submission will fail.
#SBATCH --time=00:05:00                                         # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=dpart                                             # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                               # number of nodes to allocate for your job
#SBATCH --ntasks=4                                              # request 4 cpu cores be reserved for your node total
#SBATCH --ntasks-per-node=2                                     # request 2 cpu cores be reserved per node
#SBATCH --mem=1gb                                               # memory required by job; if unit is not specified MB will be assumed

module load Python/2.7.9                                        # run any commands necessary to setup your environment

srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                            # wait for any background processes to complete

# once the end of the batch script is reached your job allocation will be revoked

Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as ${1} for the first argument and so on.

More Examples

scancel

The scancel command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.

  • scancel 255 cancel job 255
  • scancel 255.3 cancel job step 3 of job 255
  • scancel --user username --partition=dpart cancel all jobs for username in the dpart partition

Identifying Resources and Features

The sinfo can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like sinfo -o "%40N %8c %8m %20f %35G".

username@nexuscml00:sinfo -o "%40N %8c %8m %20f %35G"
NODELIST                                 CPUS     MEMORY   AVAIL_FEATURES       GRES
tron[22-33,35-45]                        16       128521+  rhel8,Zen,EPYC-7302  gpu:rtxa4000:4
tron[06-09,12-15,21]                     16       128520+  rhel8,Zen,EPYC-7302P gpu:rtxa4000:4
tron[10-11,16-20,34]                     16       128524   rhel8,Zen,EPYC-7313P gpu:rtxa4000:4
legacy00                                 48       128248   rhel8,Zen,EPYC-7402  (null)
legacy[01-09]                            12       128436   rhel8,Xeon,E5-2620   (null)
clip07                                   8        257570   rhel8,Xeon,E5-2623   gpu:gtx1080ti:3
clip08                                   32       257565   rhel8,Xeon,E5-2683   gpu:gtx1080ti:8
clip09                                   32       385350   rhel8,Xeon,6130      gpu:rtx2080ti:5,gpu:gtx1080ti:3
clip00                                   32       257583   rhel8,Xeon,E5-2683   gpu:gtxtitanx:3
clip01                                   32       257583   rhel8,Xeon,E5-2683   gpu:gtxtitanx:1,gpu:gtxtitanxp:2
clip02                                   20       128562   rhel8,Xeon,E5-2630   gpu:gtx1080ti:3
clip03                                   20       128562   rhel8,Xeon,E5-2630   gpu:rtx2080ti:1,gpu:gtx1080ti:2
clip04                                   32       257540   rhel8,Zen,EPYC-7302  gpu:rtx3090:4
clip[05-06]                              24       128523   rhel8,Zen,EPYC-7352  gpu:rtxa6000:2
gammagpu[01-03]                          32       257541   rhel8,Zen,EPYC-7313  gpu:rtxa5000:8
legacy14                                 20       322068   rhel8,Xeon,E5-2650   gpu:gtxtitanx:4
legacy[15-16]                            16       257587   rhel8,Xeon,E5-2630   gpu:teslak80:2
legacy17                                 44       515501   rhel8,Xeon,E5-2699   gpu:gtx1080ti:4
twist[00-01]                             16       64031    rhel8,Xeon,E5-1660   (null)
twist[02-05]                             48       257452   rhel8,Xeon,E5-2650   (null)
tron[00-05]                              32       257540   rhel8,Zen,EPYC-7302  gpu:rtxa6000:8
tron[46-61]                              48       257539   rhel8,Zen,EPYC-7352  gpu:rtxa5000:8

Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.

There is also a prewritten alias show_nodes on all of our SLURM computing clusters that shows each node's name, number of CPUs, memory, processor type (as AVAIL_FEATURES), GRES, State, and partitions that can submit to it.

You can identify further specific information about a node using scontrol with various flags.

Requesting GPUs

If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. You will also need to make sure that you submit your job to the correct partition since nodes with GPUs are often put into their own partition to prevent the nodes from being tied up by jobs that don't utilize GPUs. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag --gres=gpu:2. If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. --gres=gpu:rtxa5000:1. If you do not request a specific type of GPU, you may be scheduled on the oldest/lowest specced GPU available.

username@nexuscml00:srun --pty --gres=gpu:2 nvidia-smi
Thu Aug 25 15:22:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:01:00.0 Off |                  Off |
| 30%   23C    P8    20W / 140W |      0MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    Off  | 00000000:41:00.0 Off |                  Off |
| 30%   24C    P8    15W / 140W |      0MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please request only 1 GPU. The other one will be left available for other users.

username@nexuscml00:srun --pty --gres=gpu:rtxa5000:1 nvidia-smi
Thu Aug 25 15:22:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:01:00.0 Off |                  Off |
| 30%   23C    P8    20W / 230W |      0MiB / 24256MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

As with all other flags, the --gres flag may also be passed to sbatch and salloc rather than directly to srun.

MPI example

To run MPI jobs, you will need to load an MPI module and include the --mpi flag in your submission arguments.

#!/usr/bin/bash 
#SBATCH --job-name=mpi_test # Job name 
#SBATCH --nodes=4 # Number of nodes 
#SBATCH --ntasks=8 # Number of MPI ranks 
#SBATCH --ntasks-per-node=2 # Number of MPI ranks per node 
#SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node 
#SBATCH --time=00:30:00 # Time limit hrs:min:sec 

module load mpi 

srun --mpi=openmpi /nfshomes/username/testing/mpi/a.out