SLURM/JobSubmission: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
 
(109 intermediate revisions by 11 users not shown)
Line 1: Line 1:
=Job Submission=
=Job Submission=
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
The cluster that everyone with a [[Accounts#UMIACS_Account | UMIACS account]] has access to is [[Nexus]]. Please visit the Nexus page for instructions on how to connect to your assigned submit nodes.
'''Computationally intensive processes run on submission nodes will be terminated. Please submit jobs to be scheduled on compute nodes for this purpose.'''


SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
For details on how SLURM decides how to schedule jobs when multiple jobs are waiting in a scheduler's queue, please see [[SLURM/Priority]].


==srun==
==srun==
srun is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then srun will return. srun accepts many command line options to specify the resources required by the command passed to it, some common command line arguments are listed below and full documentation of all available options is available in the man page for srun which can be accessed by running <code>man srun</code>.  
The <code>srun</code> command is used to run a process on the compute nodes in the cluster. If you pass it a normal shell command (or command that executes a script), it will submit a job to run that shell command/script on a compute node and then return. <code>srun</code> accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for <code>srun</code>, which can be accessed by running <code>man srun</code>.
 
<pre>
<pre>
tgray26@opensub01:srun --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
$ srun --qos=default --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from openlab06.umiacs.umd.edu
Hello World from tron33.umiacs.umd.edu
</pre>
</pre>
It is important to understand that srun is an interactive command. By default input to srun is broadcast to all compute nodes running your process and output from the compute nodes is redirected to srun, this behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive session that you can submit to the cluster and will remain running after you logout, you will need to wrap your srun commands in a batch script and submit it with [[#sbatch | sbatch]]
 
===Common srun arguments===
It is important to understand that <code>srun</code> is an interactive command. By default input to <code>srun</code> is broadcast to all compute nodes running your process and output from the compute nodes is redirected to <code>srun</code>. This behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive submission that will remain running after you logout, you will need to wrap your <code>srun</code> commands in a batch script and submit it with [[#sbatch | sbatch]].
* <code>--mem=1gb</code> ''if no unit is given MB is assumed''
 
* <code>--nodes=2</code> ''if passed to srun, the given command will be run concurrently on each node''
===Common srun Arguments===
* <code>--qos=dpart</code>
* <code>--job-name=<JOBNAME></code> ''Requests your job be named <JOBNAME>''
* <code>--time=hh:mm:ss</code> ''time needed to run your job''
* <code>--mem=1g</code> ''Requests 1GB of memory for your job, if no unit is given MB is assumed''
* <code>--job-name=helloWorld</code>
* <code>--ntasks=2</code> ''Requests 2 "tasks" which map to cores on a CPU for your job; if passed to srun, runs the given command concurrently on each core''
* <code>--output filename</code> ''file to redirect stdout to''
* <code>--nodes=2</code> ''Requests 2 nodes be allocated to your job; if passed to srun, runs the given command concurrently on each node''
* <code>--error filename</code> ''file to redirect stderr''
* <code>--nodelist=<NODENAME></code> ''Requests to run your job on the <NODENAME> node''
* <code>--partition $PNAME</code> ''request job run in the $PNAME partition''
* <code>--time=dd-hh:mm:ss</code> ''Requests your job run for dd days, hh hours, mm minutes, and ss seconds''
* <code>--ntasks 2</code> ''request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core''
* <code>--error=<ERRNAME></code> ''Redirects stderr for your job to the <ERRNAME> file''
* <code>--partition=<PARTITIONNAME></code> ''Requests your job run in the <PARTITIONNAME> partition''
* <code>--qos=<QOSNAME>default</code> ''Requests your job run with the <QOSNAME> QOS, to see the available QOS options on a cluster, run'' <code>show_qos</code>
* <code>--account=<ACCOUNTNAME></code> ''Requests your job runs under the <ACCOUNTNAME> Slurm account, different accounts have different available partitions/QOS''
* <code>--output=<OUTNAME></code> ''Redirects stdout for your job to the <OUTNAME> file''
* <code>--requeue</code> ''Requests your job be automatically requeued if it is preempted''
* <code>--exclusive</code> ''Requests your job be the only one running on the node(s) it is assigned to. This requires that your job be allocated all of the resources on the node(s). The scheduler '''does not''' automatically give your job all of the node's/nodes' resources, however, so if you need more than the default, you still need to request these with'' <code>--ntasks</code> ''and'' <code>--mem</code>


===Interactive Shell Sessions===
===Interactive Shell Sessions===
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun to invoke a shell:
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use <code>srun</code> with the <code>--pty</code> argument to invoke a shell:
<pre>
<pre>
tgray26@opensub01:srun --pty --mem 1gb --time=01:00:00 bash
$ srun --pty --qos=default --mem=1g --time=01:00:00 bash
tgray26@openlab06:
$ hostname
tron33.umiacs.umd.edu
</pre>
</pre>
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''


==salloc==
==salloc==
The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
The <code>salloc</code> command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
 
<pre>
<pre>
tgray26@opensub00:salloc -N 1 --mem=2gb --time=01:00:00
$ salloc --qos=default -N 1 --mem=2g --time=01:00:00
salloc: Granted job allocation 159
salloc: Granted job allocation 159
tgray26@opensub00:srun /usr/bin/hostname
$ srun /usr/bin/hostname
openlab00.umiacs.umd.edu
tron33.umiacs.umd.edu
tgray26@opensub00:exit
$ exit
exit
exit
salloc: Relinquishing job allocation 159
salloc: Relinquishing job allocation 159
</pre>
</pre>
'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''
'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''


==sbatch==
==sbatch==
The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
The <code>sbatch</code> command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
 
<pre>
<pre>
#!/bin/bash
#!/bin/bash
Line 49: Line 65:
srun bash -c 'echo Hello World from `hostname`'
srun bash -c 'echo Hello World from `hostname`'
</pre>
</pre>
Then you need to submit the script with sbatch and request resources:
Then you need to submit the script with sbatch and request resources:
<pre>tgray26@opensub00:sbatch --mem=1gb --time=1:00:00 helloWorld.sh
 
<pre>
$ sbatch --qos=default --mem=1g --time=1:00:00 helloWorld.sh
Submitted batch job 121
Submitted batch job 121
</pre>
</pre>
SLURM will return a job number that you can use to check the status of your job with squeue:
SLURM will return a job number that you can use to check the status of your job with squeue:
<pre>
<pre>
tgray26@opensub00:squeue
$ squeue
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               121     dpart helloWor tgray26 R      0:01      2 openlab[00-01]
               121     tron helloWor username R      0:01      1 tron32
</pre>
</pre>
====Advanced Batch Scripts====
====Advanced Batch Scripts====
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example I have a batch job that will request 2 nodes in the cluster, then I load a specific version of Python into my environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, I use the '&' operator to background the process so that both job steps can run at once; however, this means that I then need to use the wait command to block processing until all background processes have finished.
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run tens/hundreds/thousands of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of [[Python]] into our environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.
 
<pre>
<pre>
#!/bin/bash
#!/bin/bash
Line 66: Line 89:
# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling
# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling


#SBATCH --job-name=helloWorld                                   # sets the job name
#SBATCH --job-name=helloWorld                                       # sets the job name
#SBATCH --output helloWorld.out.%j                             # indicates a file to redirect STDOUT to; %j is the jobid  
#SBATCH --output=helloWorld.out.%j                                   # indicates a file to redirect STDOUT to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --error helloWorld.out.%j                               # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --error=helloWorld.out.%j                                   # indicates a file to redirect STDERR to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --time=00:05:00                                         # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --time=00:05:00                                             # how long you would like your job to run; format=dd-hh:mm:ss
#SBATCH --qos=dpart                                            # set QOS, this will determine what resources can be requested
#SBATCH --qos=default                                                # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                               # number of nodes to allocate for your job
#SBATCH --nodes=2                                                   # number of nodes to allocate for your job
#SBATCH --ntasks=4                                             # request 4 cpu cores be reserved for your node total
#SBATCH --ntasks=4                                                   # request 4 cpu cores be reserved for your node total
#SBATCH --ntasks-per-node=2                                     # request 2 cpu cores be reserved per node
#SBATCH --ntasks-per-node=2                                         # request 2 cpu cores be reserved per node
#SBATCH --mem 1gb                                              # memory required by job; if unit is not specified MB will be assumed
#SBATCH --mem=1g                                                    # memory required by job; if unit is not specified MB will be assumed. for multi-node jobs, this argument allocates this much memory *per node*


module load Python/2.7.9                                        # run any commands necessary to setup your environment
srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" &    # use srun to invoke commands within your job; using an '&'
 
srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" &    # will background the process allowing them to run concurrently
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
wait                                                                 # wait for any background processes to complete
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                           # wait for any background processes to complete


# once the end of the batch script is reached your job allocation will be revoked
# once the end of the batch script is reached your job allocation will be revoked
</pre>
</pre>
Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as <code>${1}</code> for the first argument and so on.
====More Examples====
====More Examples====
More examples of how to use batch scripts to setup your environment for processing will be coming soon
* [[SLURM/ArrayJobs]]
 
===scancel===
===scancel===
The scancel command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.
The <code>scancel</code> command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.
*<code>scancel 255</code>    ''cancel job 255''
*<code>scancel 255</code>    ''cancel job 255''
*<code>scancel 255.3</code>    ''cancel job step 3 of job 255''
*<code>scancel 255.3</code>    ''cancel job step 3 of job 255''
*<code>scancel --user tgray26 --partition dpart</code>    ''cancel all jobs for tgray26 in the dpart partition''
*<code>scancel --user username --partition=tron</code>    ''cancel all jobs for username in the tron partition''
 
=Identifying Resources and Features=
The <code>sinfo</code> command can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like <code>sinfo -o "%40N %8c %8m %35f %35G"</code>.
 
<pre>
$ sinfo -o "%40N %8c %8m %35f %35G"
NODELIST                                CPUS    MEMORY  AVAIL_FEATURES                      GRES
legacy00                                48      125940  rhel8,Zen,EPYC-7402                (null)
legacy[01-11,13-19,22-28,30]            12+      61804+  rhel8,Xeon,E5-2620                  (null)
cbcb[23-24],twist[02-05]                24      255150  rhel8,Xeon,E5-2650                  (null)
cbcb26                                  128      513243  rhel8,Zen,EPYC-7763,Ampere          gpu:rtxa5000:8
cbcb27                                  64      255167  rhel8,Zen,EPYC-7513,Ampere          gpu:rtxa6000:8
cbcb[00-21]                              32      2061175  rhel8,Zen,EPYC-7313                (null)
cbcb22,cmlcpu[00,06-07],legacy20        24+      384270+  rhel8,Xeon,E5-2680                  (null)
cbcb25                                  24      255278  rhel8,Xeon,E5-2650,Pascal,Turing    gpu:rtx2080ti:1,gpu:gtx1080ti:1
legacy21                                8        61746    rhel8,Xeon,E5-2623                  (null)
tron[06-09,12-15,21]                    16      126214+  rhel8,Zen,EPYC-7302P,Ampere        gpu:rtxa4000:4
tron[10-11,16-20,34]                    16      126217  rhel8,Zen,EPYC-7313P,Ampere        gpu:rtxa4000:4
tron[22-33,35-45]                        16      126214+  rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa4000:4
clip11                                  16      126217  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa4000:4
clip00                                  32      255276  rhel8,Xeon,E5-2683,Pascal          gpu:titanxpascal:3
clip02                                  20      126255  rhel8,Xeon,E5-2630,Pascal          gpu:gtx1080ti:3
clip03                                  20      126243  rhel8,Xeon,E5-2630,Pascal,Turing    gpu:rtx2080ti:1,gpu:gtx1080ti:2
clip04                                  32      255233  rhel8,Zen,EPYC-7302,Ampere          gpu:rtx3090:4
clip[05-06]                              24      126216  rhel8,Zen,EPYC-7352,Ampere          gpu:rtxa6000:2
clip07                                  8        255263  rhel8,Xeon,E5-2623,Pascal          gpu:gtx1080ti:3
clip09                                  32      383043  rhel8,Xeon,6130,Pascal,Turing      gpu:rtx2080ti:5,gpu:gtx1080ti:3
clip13,cml30,vulcan[29-32]              32      255218+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa6000:8
clip08,vulcan[08-22,25]                  32      255258+  rhel8,Xeon,E5-2683,Pascal          gpu:gtx1080ti:8
clip12,gammagpu[10-17]                  16      126203+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa6000:4
clip01                                  32      255276  rhel8,Xeon,E5-2683,Pascal          gpu:titanxpascal:1,gpu:titanxp:2
clip10                                  44      1029404  rhel8,Xeon,E5-2699                  (null)
cml[00,02-11,13-14],tron[62-63,65-66,68- 32      351530+  rhel8,Xeon,4216,Turing              gpu:rtx2080ti:8
cml01                                    32      383030  rhel8,Xeon,4216,Turing              gpu:rtx2080ti:6
cml12                                    32      383038  rhel8,Xeon,4216,Turing,Ampere      gpu:rtx2080ti:7,gpu:rtxa4000:1
cml[15-16]                              32      383038  rhel8,Xeon,4216,Turing              gpu:rtx2080ti:7
cml[17-28],gammagpu05                    32      255225+  rhel8,Zen,EPYC-7282,Ampere          gpu:rtxa4000:8
cml31                                    32      384094  rhel8,Zen,EPYC-9124,Ampere          gpu:a100:1
cml32                                    64      512999  rhel8,Zen,EPYC-7543,Ampere          gpu:a100:4
cmlcpu[01-04]                            20      384271  rhel8,Xeon,E5-2660                  (null)
gammagpu00                              32      255233  rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa5000:8
mbrc[00-01]                              20      189498  rhel8,Xeon,4114,Turing              gpu:rtx2080ti:8
twist[00-01]                            8        61727    rhel8,Xeon,E5-1660                  (null)
legacygpu08                              20      513327  rhel8,Xeon,E5-2640,Maxwell          gpu:m40:2
brigid[16-17]                            48      512897  rhel8,Zen,EPYC-7443                (null)
brigid[18-19]                            20      61739    rhel8,Xeon,E5-2640                  (null)
legacygpu06                              20      255249  rhel8,Xeon,E5-2699,Maxwell          gpu:gtxtitanx:4
tron[00-05]                              32      255233  rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa6000:8
tron[46-61]                              48      255232  rhel8,Zen,EPYC-7352,Ampere          gpu:rtxa5000:8
tron[64,67]                              32      383028+  rhel8,Xeon,4216,Turing,Ampere      gpu:rtx2080ti:7,gpu:rtx3070:1
vulcan00                                32      255259  rhel8,Xeon,E5-2683,Pascal          gpu:p6000:7,gpu:p100:1
vulcan[01-04,06-07]                      32      255259  rhel8,Xeon,E5-2683,Pascal          gpu:p6000:8
vulcan05                                32      255259  rhel8,Xeon,E5-2683,Pascal          gpu:p6000:7
janus[02-04]                            40      383025  rhel8,Xeon,6248,Turing              gpu:rtx2080ti:10
legacygpu00                              20      255249  rhel8,Xeon,E5-2650,Pascal          gpu:titanxp:4
legacygpu[01-02,07]                      20      255249+  rhel8,Xeon,E5-2650,Maxwell          gpu:gtxtitanx:4
legacygpu[03-04]                        16      255268  rhel8,Xeon,E5-2630,Maxwell          gpu:gtxtitanx:2
legacygpu05                              44      513193  rhel8,Xeon,E5-2699,Pascal          gpu:gtx1080ti:4
vulcan23                                32      383030  rhel8,Xeon,4612,Turing              gpu:rtx2080ti:8
vulcan26                                24      770126  rhel8,Xeon,6146,Pascal              gpu:titanxp:10
vulcan[27-28]                            56      770093  rhel8,Xeon,8280,Turing              gpu:rtx2080ti:10
vulcan24                                16      126216  rhel8,Zen,7282,Ampere              gpu:rtxa6000:4
gammagpu[01-04,06-09],vulcan[33-37]      32      255215+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa5000:8
vulcan[38-44]                            32      255215  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa4000:8
</pre>
 
Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.
 
You can identify further specific information about a node using [[SLURM/ClusterStatus#scontrol | scontrol]] with various flags.
 
There are also two command aliases developed by UMIACS staff to show various node information in aggregate. They are <code>show_nodes</code> and <code>show_available_nodes</code>.
 
==show_nodes==
The <code>show_nodes</code> command alias shows each node's name, number of CPUs, memory, {OS, CPU architecture, CPU type, GPU architecture (if the node has GPUs)} (as AVAIL_FEATURES), GRES (GPUs), and State. It essentially wraps the <tt>sinfo</tt> command with some pre-determined output format options and shows each node on its own line, in alphabetical order.
 
To only view nodes in a specific partition, append <code>-p <partition name></code> to the command alias.
 
===Examples===
<pre>
$ show_nodes
NODELIST            CPUS      MEMORY    AVAIL_FEATURES                          GRES                            STATE
brigid16            48        512897    rhel8,x86_64,Zen,EPYC-7443              (null)                          idle
brigid17            48        512897    rhel8,x86_64,Zen,EPYC-7443              (null)                          idle
...                  ...        ...        ...                                      ...                              ...
vulcan45            32        513250    rhel8,x86_64,Zen,EPYC-7313,Ampere        gpu:rtxa6000:8                  idle
</pre>
 
(specific partition)
<pre>
$ show_nodes -p tron
NODELIST            CPUS      MEMORY    AVAIL_FEATURES                          GRES                            STATE
tron00              32        255233    rhel8,x86_64,Zen,EPYC-7302,Ampere        gpu:rtxa6000:8                  idle
tron01              32        255233    rhel8,x86_64,Zen,EPYC-7302,Ampere        gpu:rtxa6000:8                  idle
...                  ...        ...        ...                                      ...                              ...
tron69              32        383030    rhel8,x86_64,Xeon,4216,Turing            gpu:rtx2080ti:8                  idle
</pre>
 
==show_available_nodes==
The <code>show_available_nodes</code> command alias takes zero or more arguments that correspond to Slurm constructs, resources, or features that you are looking to request a job with and tells you what nodes could '''theoretically'''[0,1] run a job with these arguments immediately. It assumes your job is a single-node job.
 
These arguments are:
* <code>--partition</code>: Only include nodes in the specified partition(s).
* <code>--account</code>: Only include nodes from partitions that can use the specified account(s).
* <code>--qos</code>: Only include nodes from partitions that can use the specified QoS(es).
* <code>--cpus</code>: Only include nodes with at least this many CPUs free.
* <code>--mem</code>: Only include nodes with at least this much memory free. The default unit is MB if unspecified, but any of {K,M,G,T} can be suffixed to the number provided (will then be interpreted as KB, MB, GB, or TB, respectively).
* GRES-related arguments:
** <code>--gres</code>, <code>--and-gres</code>: Only include nodes whose list of GRES contains ''all'' of the specified GRES type/quantity pairings.
** <code>--or-gres</code>: Only include nodes whose list of GRES contains ''any'' of the specified GRES type/quantity pairings. Functionally identical to <tt>--and-gres</tt> if only one GRES type/quantity pairing is specified.
* GPU-related arguments:
** <code>--gpus</code>, <code>--and-gpus</code>: Only include nodes whose list of GPUs (a subset of GRES) contains ''all'' of the specified GPU type/quantity pairings.
** <code>--or-gpus</code>: Only include nodes whose list of GPUs (a subset of GRES) contains ''any'' of the specified GPU type/quantity pairings. Functionally identical to <tt>--and-gpus</tt> if only one GPU type/quantity pairing is specified.
* Feature-related arguments:
** <code>--feature</code>, <code>--and-feature</code>: Only include nodes whose list of features contains ''all'' of the specified feature(s).
** <code>--or-feature</code>: Only include nodes whose list of features contains ''any'' of the specified feature(s). Functionally identical to <tt>--and-feature</tt> if only one feature is specified.
 
These arguments are also viewable by running <code>show_available_nodes -h</code>.
 
If your passed argument set does not contain any resource-based arguments (CPUs/RAM/GRES or GPUs), a node is defined as available if it has at least 1 CPU and 1MB of RAM available.
 
If there are no nodes available that meet your passed argument set, you will receive the message <tt>There are no nodes that have currently free resources that meet this argument set.</tt>
 
===Footnotes===
[0] - As of now, this command alias does not factor in resources occupied by jobs that could be preempted (based on the partition(s) passed to it, if present). This is soon to come.


[1] - This command alias also does not factor in jobs with higher priority values requesting more resources, in the same partition(s), blocking execution of a job submitted with the arguments checked by the command alias. This is due to the complexity of calculating a job's priority value before it is actually submitted.
===Examples===
Show all available nodes:
<pre>
$ show_available_nodes
brigid17
  cpus=16,mem=414593M
brigid18
  cpus=8,mem=24875M
...
</pre>


=Identifying Resources and Features=
Show nodes available in the <tt>tron</tt> partition:
The sinfo can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like this
<pre>
<code>sinfo -o "%15N %10c %10m  %25f %10G"</code>.
$ show_available_nodes --partition tron
tron00
  cpus=14,mem=50433M,gres=gpu:rtxa6000:1
tron01
  cpus=10,mem=17665M,gres=gpu:rtxa6000:2
...
</pre>
 
Show nodes with one or more RTX A5000 or RTX A6000 GPUs available to the <tt>vulcan</tt> account:
<pre>
$ show_available_nodes --account vulcan --or-gpus rtxa5000:1,rtxa6000:1
vulcan32
  cpus=16,mem=193778M,gres=gpu:rtxa6000:4
vulcan33
  cpus=15,mem=181499M,gres=gpu:rtxa5000:3
...
</pre>
 
Show nodes with 4 or more CPUs, 48G or more memory, and one or more RTX A6000 GPUs available in the <tt>scavenger</tt> partition:
<pre>
$ show_available_nodes --partition=scavenger --cpus=4 --mem=48g --or-gpus=rtxa6000:1
cbcb27
  cpus=59,mem=218303M,gres=gpu:rtxa6000:6
clip06
  cpus=20,mem=93448M,gres=gpu:rtxa6000:1
...
</pre>
 
Show nodes with [https://www.nvidia.com/en-us/geforce/turing Turing] or [https://www.nvidia.com/en-us/data-center/ampere-architecture Ampere] architecture GPUs available in the <tt>scavenger</tt> partition:
<pre>
$ show_available_nodes --partition=scavenger --or-feature=Ampere,Turing
cbcb25
  cpus=24,mem=255278M,gres=gpu:rtx2080ti:1,gpu:gtx1080ti:1
cbcb26
  cpus=127,mem=447707M,gres=gpu:rtxa5000:7
...
</pre>


Show nodes with [https://www.amd.com/en/technologies/zen-core Zen] architecture CPUs and [https://www.nvidia.com/en-us/data-center/ampere-architecture Ampere] architecture GPUs available in the <tt>scavenger</tt> partition:
<pre>
<pre>
$ sinfo -o "%15N %10c %10m  %25f %10G"
$ show_available_nodes --partition=scavenger --and-feature=Zen,Ampere
NODELIST        CPUS      MEMORY      AVAIL_FEATURES            GRES
cbcb26
openlab[00-07]  8          7822        (null)                    (null)
  cpus=127,mem=447707M,gres=gpu:rtxa5000:7
openlab08      16        128720      (null)                    gpu:k20:2
cbcb27
openlab09      16        128722      (null)                    gpu:3
  cpus=59,mem=218303M,gres=gpu:rtxa6000:6
...
</pre>
</pre>


You can also identify further specific information about a node using [https://wiki.umiacs.umd.edu/umiacs/index.php/SLURM/ClusterStatus#scontrol scontrol].
(bogus example) Attempt to show nodes available in the <tt>bogus</tt> partition:
<pre>
$ show_available_nodes --partition=bogus
There are no nodes that have currently free resources that meet this argument set.
</pre>


=Requesting GPUs=
=Requesting GPUs=
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or cpu cores. You will also need to make sure that you submit your job to the correct partition since nodes with GPUs are often put into their own partition to prevent the nodes from being tied up by jobs that don't utilize GPUs. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job you can use the flag <code>--gres:gpu:2</code> or if there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag <code>--gres:k20:1</code>
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag <code>--gres=gpu:#</code> (with the actual number of GPUs you want). If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. <code>--gres=gpu:rtxa5000:#</code>. If you do not request a specific type of GPU, you are likely to be scheduled on an older, lower spec'd GPU.
 
Note that some QoSes may have limits on the number of GPUs you can request per job, so you may need to specify a different QoS to request more GPUs.
 
<pre>
$ srun --pty --qos=medium --gres=gpu:2 nvidia-smi
...
Wed Mar  6 16:59:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03            Driver Version: 535.129.03  CUDA Version: 12.2    |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf          Pwr:Usage/Cap |        Memory-Usage | GPU-Util  Compute M. |
|                                        |                      |              MIG M. |
|=========================================+======================+======================|
|  0  NVIDIA GeForce RTX 2080 Ti    Off | 00000000:3D:00.0 Off |                  N/A |
| 32%  23C    P8              1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                        |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
1 NVIDIA GeForce RTX 2080 Ti    Off | 00000000:40:00.0 Off |                  N/A |
| 32%  25C    P8              1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                        |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
 
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU  GI  CI        PID  Type  Process name                            GPU Memory |
|        ID  ID                                                            Usage      |
|=======================================================================================|
|  No running processes found                                                          |
+---------------------------------------------------------------------------------------+
</pre>
 
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please only request 1 GPU. The others on the node (if any) will be left available for other users.
 
<pre>
<pre>
tgray26@opensub01:srun --pty --partition gpu --gres=gpu:2 nvidia-smi
$ srun --pty --gres=gpu:rtxa5000:1 nvidia-smi
Wed Jul 13 15:33:18 2016
Thu Aug 25 15:22:15 2022
+------------------------------------------------------+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.28    Driver Version: 361.28        |
| NVIDIA-SMI 470.129.06  Driver Version: 470.129.06  CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|===============================+======================+======================|
|  0  Tesla K20c          Off  | 0000:03:00.0     Off |                   0 |
|  0  NVIDIA RTX A5000    Off  | 00000000:01:00.0 Off |                 Off |
| 30%  24C   P0   48W / 225W |     11MiB / 4799MiB |      0%      Default |
| 30%  23C   P8   20W / 230W |     0MiB / 24256MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|                               |                     |                 N/A |
|  1  Tesla K20c          Off  | 0000:84:00.0    Off |                   0 |
| 30%  23C    P0    52W / 225W |    11MiB / 4799MiB |    93%      Default |
+-------------------------------+----------------------+----------------------+
+-------------------------------+----------------------+----------------------+


Line 132: Line 368:
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
</pre>
</pre>
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please request only 1 GPU and the other one will be left available for other users:
 
As with all other flags, the <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]].
 
=MPI example=
To run [https://en.wikipedia.org/wiki/Message_Passing_Interface MPI] jobs, you will need to include the <code>--mpi=pmix</code> flag in your submission arguments.
 
<pre>
<pre>
tgray26@opensub01:srun --pty --partition gpu --gres=gpu:k20:1 nvidia-smi
#!/usr/bin/bash
Wed Jul 13 15:31:29 2016
#SBATCH --job-name=mpi_test # Job name
+------------------------------------------------------+
#SBATCH --nodes=4 # Number of nodes
| NVIDIA-SMI 361.28    Driver Version: 361.28        |
#SBATCH --ntasks=8 # Number of MPI ranks
|-------------------------------+----------------------+----------------------+
#SBATCH --ntasks-per-node=2 # Number of MPI ranks per node
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
#SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
#SBATCH --time=00:30:00 # Time limit hrs:min:sec
|===============================+======================+======================|
 
|  0  Tesla K20c          Off  | 0000:03:00.0    Off |                    0 |
| 30%  24C    P0    50W / 225W |    11MiB /  4799MiB |    92%      Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
srun --mpi=pmix /nfshomes/username/testing/mpi/a.out
| Processes:                                                      GPU Memory |
|  GPU      PID  Type  Process name                              Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
</pre>
</pre>
The <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]]

Latest revision as of 16:21, 13 August 2024

Job Submission

SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.

The cluster that everyone with a UMIACS account has access to is Nexus. Please visit the Nexus page for instructions on how to connect to your assigned submit nodes.

Computationally intensive processes run on submission nodes will be terminated. Please submit jobs to be scheduled on compute nodes for this purpose.

For details on how SLURM decides how to schedule jobs when multiple jobs are waiting in a scheduler's queue, please see SLURM/Priority.

srun

The srun command is used to run a process on the compute nodes in the cluster. If you pass it a normal shell command (or command that executes a script), it will submit a job to run that shell command/script on a compute node and then return. srun accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for srun, which can be accessed by running man srun.

$ srun --qos=default --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from tron33.umiacs.umd.edu

It is important to understand that srun is an interactive command. By default input to srun is broadcast to all compute nodes running your process and output from the compute nodes is redirected to srun. This behavior can be changed; however, srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end. To run a non-interactive submission that will remain running after you logout, you will need to wrap your srun commands in a batch script and submit it with sbatch.

Common srun Arguments

  • --job-name=<JOBNAME> Requests your job be named <JOBNAME>
  • --mem=1g Requests 1GB of memory for your job, if no unit is given MB is assumed
  • --ntasks=2 Requests 2 "tasks" which map to cores on a CPU for your job; if passed to srun, runs the given command concurrently on each core
  • --nodes=2 Requests 2 nodes be allocated to your job; if passed to srun, runs the given command concurrently on each node
  • --nodelist=<NODENAME> Requests to run your job on the <NODENAME> node
  • --time=dd-hh:mm:ss Requests your job run for dd days, hh hours, mm minutes, and ss seconds
  • --error=<ERRNAME> Redirects stderr for your job to the <ERRNAME> file
  • --partition=<PARTITIONNAME> Requests your job run in the <PARTITIONNAME> partition
  • --qos=<QOSNAME>default Requests your job run with the <QOSNAME> QOS, to see the available QOS options on a cluster, run show_qos
  • --account=<ACCOUNTNAME> Requests your job runs under the <ACCOUNTNAME> Slurm account, different accounts have different available partitions/QOS
  • --output=<OUTNAME> Redirects stdout for your job to the <OUTNAME> file
  • --requeue Requests your job be automatically requeued if it is preempted
  • --exclusive Requests your job be the only one running on the node(s) it is assigned to. This requires that your job be allocated all of the resources on the node(s). The scheduler does not automatically give your job all of the node's/nodes' resources, however, so if you need more than the default, you still need to request these with --ntasks and --mem

Interactive Shell Sessions

An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun with the --pty argument to invoke a shell:

$ srun --pty --qos=default --mem=1g --time=01:00:00 bash
$ hostname
tron33.umiacs.umd.edu

Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.

salloc

The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.

$ salloc --qos=default -N 1 --mem=2g --time=01:00:00
salloc: Granted job allocation 159
$ srun /usr/bin/hostname
tron33.umiacs.umd.edu
$ exit
exit
salloc: Relinquishing job allocation 159

Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.

sbatch

The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:

#!/bin/bash

srun bash -c 'echo Hello World from `hostname`'

Then you need to submit the script with sbatch and request resources:

$ sbatch --qos=default --mem=1g --time=1:00:00 helloWorld.sh
Submitted batch job 121

SLURM will return a job number that you can use to check the status of your job with squeue:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               121      tron helloWor username  R       0:01      1 tron32

Advanced Batch Scripts

You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run tens/hundreds/thousands of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of Python into our environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.

#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling

#SBATCH --job-name=helloWorld                                        # sets the job name
#SBATCH --output=helloWorld.out.%j                                   # indicates a file to redirect STDOUT to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --error=helloWorld.out.%j                                    # indicates a file to redirect STDERR to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --time=00:05:00                                              # how long you would like your job to run; format=dd-hh:mm:ss
#SBATCH --qos=default                                                # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                                    # number of nodes to allocate for your job
#SBATCH --ntasks=4                                                   # request 4 cpu cores be reserved for your node total
#SBATCH --ntasks-per-node=2                                          # request 2 cpu cores be reserved per node
#SBATCH --mem=1g                                                     # memory required by job; if unit is not specified MB will be assumed. for multi-node jobs, this argument allocates this much memory *per node*

srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" &    # use srun to invoke commands within your job; using an '&'
srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" &    # will background the process allowing them to run concurrently
wait                                                                 # wait for any background processes to complete

# once the end of the batch script is reached your job allocation will be revoked

Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as ${1} for the first argument and so on.

More Examples

scancel

The scancel command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.

  • scancel 255 cancel job 255
  • scancel 255.3 cancel job step 3 of job 255
  • scancel --user username --partition=tron cancel all jobs for username in the tron partition

Identifying Resources and Features

The sinfo command can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like sinfo -o "%40N %8c %8m %35f %35G".

$ sinfo -o "%40N %8c %8m %35f %35G"
NODELIST                                 CPUS     MEMORY   AVAIL_FEATURES                      GRES
legacy00                                 48       125940   rhel8,Zen,EPYC-7402                 (null)
legacy[01-11,13-19,22-28,30]             12+      61804+   rhel8,Xeon,E5-2620                  (null)
cbcb[23-24],twist[02-05]                 24       255150   rhel8,Xeon,E5-2650                  (null)
cbcb26                                   128      513243   rhel8,Zen,EPYC-7763,Ampere          gpu:rtxa5000:8
cbcb27                                   64       255167   rhel8,Zen,EPYC-7513,Ampere          gpu:rtxa6000:8
cbcb[00-21]                              32       2061175  rhel8,Zen,EPYC-7313                 (null)
cbcb22,cmlcpu[00,06-07],legacy20         24+      384270+  rhel8,Xeon,E5-2680                  (null)
cbcb25                                   24       255278   rhel8,Xeon,E5-2650,Pascal,Turing    gpu:rtx2080ti:1,gpu:gtx1080ti:1
legacy21                                 8        61746    rhel8,Xeon,E5-2623                  (null)
tron[06-09,12-15,21]                     16       126214+  rhel8,Zen,EPYC-7302P,Ampere         gpu:rtxa4000:4
tron[10-11,16-20,34]                     16       126217   rhel8,Zen,EPYC-7313P,Ampere         gpu:rtxa4000:4
tron[22-33,35-45]                        16       126214+  rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa4000:4
clip11                                   16       126217   rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa4000:4
clip00                                   32       255276   rhel8,Xeon,E5-2683,Pascal           gpu:titanxpascal:3
clip02                                   20       126255   rhel8,Xeon,E5-2630,Pascal           gpu:gtx1080ti:3
clip03                                   20       126243   rhel8,Xeon,E5-2630,Pascal,Turing    gpu:rtx2080ti:1,gpu:gtx1080ti:2
clip04                                   32       255233   rhel8,Zen,EPYC-7302,Ampere          gpu:rtx3090:4
clip[05-06]                              24       126216   rhel8,Zen,EPYC-7352,Ampere          gpu:rtxa6000:2
clip07                                   8        255263   rhel8,Xeon,E5-2623,Pascal           gpu:gtx1080ti:3
clip09                                   32       383043   rhel8,Xeon,6130,Pascal,Turing       gpu:rtx2080ti:5,gpu:gtx1080ti:3
clip13,cml30,vulcan[29-32]               32       255218+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa6000:8
clip08,vulcan[08-22,25]                  32       255258+  rhel8,Xeon,E5-2683,Pascal           gpu:gtx1080ti:8
clip12,gammagpu[10-17]                   16       126203+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa6000:4
clip01                                   32       255276   rhel8,Xeon,E5-2683,Pascal           gpu:titanxpascal:1,gpu:titanxp:2
clip10                                   44       1029404  rhel8,Xeon,E5-2699                  (null)
cml[00,02-11,13-14],tron[62-63,65-66,68- 32       351530+  rhel8,Xeon,4216,Turing              gpu:rtx2080ti:8
cml01                                    32       383030   rhel8,Xeon,4216,Turing              gpu:rtx2080ti:6
cml12                                    32       383038   rhel8,Xeon,4216,Turing,Ampere       gpu:rtx2080ti:7,gpu:rtxa4000:1
cml[15-16]                               32       383038   rhel8,Xeon,4216,Turing              gpu:rtx2080ti:7
cml[17-28],gammagpu05                    32       255225+  rhel8,Zen,EPYC-7282,Ampere          gpu:rtxa4000:8
cml31                                    32       384094   rhel8,Zen,EPYC-9124,Ampere          gpu:a100:1
cml32                                    64       512999   rhel8,Zen,EPYC-7543,Ampere          gpu:a100:4
cmlcpu[01-04]                            20       384271   rhel8,Xeon,E5-2660                  (null)
gammagpu00                               32       255233   rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa5000:8
mbrc[00-01]                              20       189498   rhel8,Xeon,4114,Turing              gpu:rtx2080ti:8
twist[00-01]                             8        61727    rhel8,Xeon,E5-1660                  (null)
legacygpu08                              20       513327   rhel8,Xeon,E5-2640,Maxwell          gpu:m40:2
brigid[16-17]                            48       512897   rhel8,Zen,EPYC-7443                 (null)
brigid[18-19]                            20       61739    rhel8,Xeon,E5-2640                  (null)
legacygpu06                              20       255249   rhel8,Xeon,E5-2699,Maxwell          gpu:gtxtitanx:4
tron[00-05]                              32       255233   rhel8,Zen,EPYC-7302,Ampere          gpu:rtxa6000:8
tron[46-61]                              48       255232   rhel8,Zen,EPYC-7352,Ampere          gpu:rtxa5000:8
tron[64,67]                              32       383028+  rhel8,Xeon,4216,Turing,Ampere       gpu:rtx2080ti:7,gpu:rtx3070:1
vulcan00                                 32       255259   rhel8,Xeon,E5-2683,Pascal           gpu:p6000:7,gpu:p100:1
vulcan[01-04,06-07]                      32       255259   rhel8,Xeon,E5-2683,Pascal           gpu:p6000:8
vulcan05                                 32       255259   rhel8,Xeon,E5-2683,Pascal           gpu:p6000:7
janus[02-04]                             40       383025   rhel8,Xeon,6248,Turing              gpu:rtx2080ti:10
legacygpu00                              20       255249   rhel8,Xeon,E5-2650,Pascal           gpu:titanxp:4
legacygpu[01-02,07]                      20       255249+  rhel8,Xeon,E5-2650,Maxwell          gpu:gtxtitanx:4
legacygpu[03-04]                         16       255268   rhel8,Xeon,E5-2630,Maxwell          gpu:gtxtitanx:2
legacygpu05                              44       513193   rhel8,Xeon,E5-2699,Pascal           gpu:gtx1080ti:4
vulcan23                                 32       383030   rhel8,Xeon,4612,Turing              gpu:rtx2080ti:8
vulcan26                                 24       770126   rhel8,Xeon,6146,Pascal              gpu:titanxp:10
vulcan[27-28]                            56       770093   rhel8,Xeon,8280,Turing              gpu:rtx2080ti:10
vulcan24                                 16       126216   rhel8,Zen,7282,Ampere               gpu:rtxa6000:4
gammagpu[01-04,06-09],vulcan[33-37]      32       255215+  rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa5000:8
vulcan[38-44]                            32       255215   rhel8,Zen,EPYC-7313,Ampere          gpu:rtxa4000:8

Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.

You can identify further specific information about a node using scontrol with various flags.

There are also two command aliases developed by UMIACS staff to show various node information in aggregate. They are show_nodes and show_available_nodes.

show_nodes

The show_nodes command alias shows each node's name, number of CPUs, memory, {OS, CPU architecture, CPU type, GPU architecture (if the node has GPUs)} (as AVAIL_FEATURES), GRES (GPUs), and State. It essentially wraps the sinfo command with some pre-determined output format options and shows each node on its own line, in alphabetical order.

To only view nodes in a specific partition, append -p <partition name> to the command alias.

Examples

$ show_nodes
NODELIST             CPUS       MEMORY     AVAIL_FEATURES                           GRES                             STATE
brigid16             48         512897     rhel8,x86_64,Zen,EPYC-7443               (null)                           idle
brigid17             48         512897     rhel8,x86_64,Zen,EPYC-7443               (null)                           idle
...                  ...        ...        ...                                      ...                              ...
vulcan45             32         513250     rhel8,x86_64,Zen,EPYC-7313,Ampere        gpu:rtxa6000:8                   idle

(specific partition)

$ show_nodes -p tron
NODELIST             CPUS       MEMORY     AVAIL_FEATURES                           GRES                             STATE
tron00               32         255233     rhel8,x86_64,Zen,EPYC-7302,Ampere        gpu:rtxa6000:8                   idle
tron01               32         255233     rhel8,x86_64,Zen,EPYC-7302,Ampere        gpu:rtxa6000:8                   idle
...                  ...        ...        ...                                      ...                              ...
tron69               32         383030     rhel8,x86_64,Xeon,4216,Turing            gpu:rtx2080ti:8                  idle

show_available_nodes

The show_available_nodes command alias takes zero or more arguments that correspond to Slurm constructs, resources, or features that you are looking to request a job with and tells you what nodes could theoretically[0,1] run a job with these arguments immediately. It assumes your job is a single-node job.

These arguments are:

  • --partition: Only include nodes in the specified partition(s).
  • --account: Only include nodes from partitions that can use the specified account(s).
  • --qos: Only include nodes from partitions that can use the specified QoS(es).
  • --cpus: Only include nodes with at least this many CPUs free.
  • --mem: Only include nodes with at least this much memory free. The default unit is MB if unspecified, but any of {K,M,G,T} can be suffixed to the number provided (will then be interpreted as KB, MB, GB, or TB, respectively).
  • GRES-related arguments:
    • --gres, --and-gres: Only include nodes whose list of GRES contains all of the specified GRES type/quantity pairings.
    • --or-gres: Only include nodes whose list of GRES contains any of the specified GRES type/quantity pairings. Functionally identical to --and-gres if only one GRES type/quantity pairing is specified.
  • GPU-related arguments:
    • --gpus, --and-gpus: Only include nodes whose list of GPUs (a subset of GRES) contains all of the specified GPU type/quantity pairings.
    • --or-gpus: Only include nodes whose list of GPUs (a subset of GRES) contains any of the specified GPU type/quantity pairings. Functionally identical to --and-gpus if only one GPU type/quantity pairing is specified.
  • Feature-related arguments:
    • --feature, --and-feature: Only include nodes whose list of features contains all of the specified feature(s).
    • --or-feature: Only include nodes whose list of features contains any of the specified feature(s). Functionally identical to --and-feature if only one feature is specified.

These arguments are also viewable by running show_available_nodes -h.

If your passed argument set does not contain any resource-based arguments (CPUs/RAM/GRES or GPUs), a node is defined as available if it has at least 1 CPU and 1MB of RAM available.

If there are no nodes available that meet your passed argument set, you will receive the message There are no nodes that have currently free resources that meet this argument set.

Footnotes

[0] - As of now, this command alias does not factor in resources occupied by jobs that could be preempted (based on the partition(s) passed to it, if present). This is soon to come.

[1] - This command alias also does not factor in jobs with higher priority values requesting more resources, in the same partition(s), blocking execution of a job submitted with the arguments checked by the command alias. This is due to the complexity of calculating a job's priority value before it is actually submitted.

Examples

Show all available nodes:

$ show_available_nodes
brigid17
  cpus=16,mem=414593M
brigid18
  cpus=8,mem=24875M
...

Show nodes available in the tron partition:

$ show_available_nodes --partition tron
tron00
  cpus=14,mem=50433M,gres=gpu:rtxa6000:1
tron01
  cpus=10,mem=17665M,gres=gpu:rtxa6000:2
...

Show nodes with one or more RTX A5000 or RTX A6000 GPUs available to the vulcan account:

$ show_available_nodes --account vulcan --or-gpus rtxa5000:1,rtxa6000:1
vulcan32
  cpus=16,mem=193778M,gres=gpu:rtxa6000:4
vulcan33
  cpus=15,mem=181499M,gres=gpu:rtxa5000:3
...

Show nodes with 4 or more CPUs, 48G or more memory, and one or more RTX A6000 GPUs available in the scavenger partition:

$ show_available_nodes --partition=scavenger --cpus=4 --mem=48g --or-gpus=rtxa6000:1
cbcb27
  cpus=59,mem=218303M,gres=gpu:rtxa6000:6
clip06
  cpus=20,mem=93448M,gres=gpu:rtxa6000:1
...

Show nodes with Turing or Ampere architecture GPUs available in the scavenger partition:

$ show_available_nodes --partition=scavenger --or-feature=Ampere,Turing
cbcb25
  cpus=24,mem=255278M,gres=gpu:rtx2080ti:1,gpu:gtx1080ti:1
cbcb26
  cpus=127,mem=447707M,gres=gpu:rtxa5000:7
...

Show nodes with Zen architecture CPUs and Ampere architecture GPUs available in the scavenger partition:

$ show_available_nodes --partition=scavenger --and-feature=Zen,Ampere
cbcb26
  cpus=127,mem=447707M,gres=gpu:rtxa5000:7
cbcb27
  cpus=59,mem=218303M,gres=gpu:rtxa6000:6
...

(bogus example) Attempt to show nodes available in the bogus partition:

$ show_available_nodes --partition=bogus
There are no nodes that have currently free resources that meet this argument set.

Requesting GPUs

If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag --gres=gpu:# (with the actual number of GPUs you want). If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. --gres=gpu:rtxa5000:#. If you do not request a specific type of GPU, you are likely to be scheduled on an older, lower spec'd GPU.

Note that some QoSes may have limits on the number of GPUs you can request per job, so you may need to specify a different QoS to request more GPUs.

$ srun --pty --qos=medium --gres=gpu:2 nvidia-smi
...
Wed Mar  6 16:59:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:3D:00.0 Off |                  N/A |
| 32%   23C    P8               1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:40:00.0 Off |                  N/A |
| 32%   25C    P8               1W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please only request 1 GPU. The others on the node (if any) will be left available for other users.

$ srun --pty --gres=gpu:rtxa5000:1 nvidia-smi
Thu Aug 25 15:22:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:01:00.0 Off |                  Off |
| 30%   23C    P8    20W / 230W |      0MiB / 24256MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

As with all other flags, the --gres flag may also be passed to sbatch and salloc rather than directly to srun.

MPI example

To run MPI jobs, you will need to include the --mpi=pmix flag in your submission arguments.

#!/usr/bin/bash 
#SBATCH --job-name=mpi_test # Job name 
#SBATCH --nodes=4 # Number of nodes 
#SBATCH --ntasks=8 # Number of MPI ranks 
#SBATCH --ntasks-per-node=2 # Number of MPI ranks per node 
#SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node 
#SBATCH --time=00:30:00 # Time limit hrs:min:sec 


srun --mpi=pmix /nfshomes/username/testing/mpi/a.out