SLURM/JobSubmission: Difference between revisions

From UMIACS
Jump to navigation Jump to search
Line 40: Line 40:


==sbatch==
==sbatch==
The sbatch command allows you to write a batch script with all of your resources/options defined in the script itself. You can then handle any necessary environment setup and then run commands on the resources you requested by invoking commands with srun. The batch script will be run on the first node allocated as part of your job allocation.
The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
<pre>
<pre>
#!/bin/bash
#!/bin/bash


# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling
srun bash -c 'echo Hello World from `hostname`'
 
#SBATCH --job-name=helloWorld                      # sets the job name
#SBATCH --output helloWorld.out.%j                  # indicates a file to redirect STDOUT to; %j is the jobid
#SBATCH --error helloWorld.out.%j                  # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --time=00:05:00                            # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=dpart                                # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                  # number of nodes to allocate for your job
#SBATCH --mem 1gb                                  # memory required by job; if unit is not specified MB will be assumed
 
module load Python/2.7.9                            # run any commands necessary to setup your environment
 
srun -N 1 bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
srun -N 1 bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                # wait for any background processes to complete
 
# once the end of the batch script is reached your job allocation will be revoked
</pre>
</pre>
If your script were named batchScript.sh, you could submit it by running:
Then you need to submit the script with sbatch and request resources:
<pre>tgray26@opensub00:sbatch batchScript.sh
<pre>tgray26@opensub00:sbatch --mem=1gb --time=1:00:00 helloWorld.sh
Submitted batch job 121
Submitted batch job 121
</pre>
</pre>
Line 70: Line 54:
tgray26@opensub00:squeue
tgray26@opensub00:squeue
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
             JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
               121    test2 helloWor  tgray26  R      0:01      2 openlab[00-01]
               121    dpart helloWor  tgray26  R      0:01      2 openlab[00-01]
</pre>
====Advanced Batch Scripts====
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example I have a batch job that will request 2 nodes in the cluster, then I load a specific version of Python into my environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, I use the '&' operator to background the process so that both job steps can run at once; however, this means that I then need to use the wait command to block processing until all background processes have finished.
<pre>
#!/bin/bash
 
# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling
 
#SBATCH --job-name=helloWorld                                  # sets the job name
#SBATCH --output helloWorld.out.%j                              # indicates a file to redirect STDOUT to; %j is the jobid
#SBATCH --error helloWorld.out.%j                              # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --time=00:05:00                                        # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=dpart                                            # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                              # number of nodes to allocate for your job
#SBATCH --mem 1gb                                              # memory required by job; if unit is not specified MB will be assumed
 
module load Python/2.7.9                                        # run any commands necessary to setup your environment
 
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                            # wait for any background processes to complete
 
# once the end of the batch script is reached your job allocation will be revoked
</pre>
</pre>
====More Examples====
More examples of how to use batch scripts to setup your environment for processing will be coming soon

Revision as of 19:02, 13 July 2016

Job Submission

SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.

srun

srun is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then srun will return. srun accepts many command line options to specify the resources required by the command passed to it, some common command line arguments are listed below and full documentation of all available options is available in the man page for srun which can be accessed by running man srun.

tgray26@opensub01:srun --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from openlab06.umiacs.umd.edu

It is important to understand that srun is an interactive command. By default input to srun is broadcast to all compute nodes running your process and output from the compute nodes is redirected to srun, this behavior can be changed; however, srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end. To run a non-interactive session that you can submit to the cluster and will remain running after you logout, you will need to wrap your srun commands in a batch script and submit it with sbatch

Common srun arguments

--mem=1gb (if no unit is given MB is assumed)
--nodes=2 (the given command will be run concurrently on each node)
--qos=dpart
--time=hh:mm:ss(time needed to run your job)
--job-name=helloWorld
--output filename (file to redirect stdout to)
--error filename (file to redirect stderr)

Interactive Shell Sessions

An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun to invoke a shell:

tgray26@opensub01:srun --pty --mem 1gb --time=01:00:00 bash
tgray26@openlab06:

Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.

salloc

The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.

tgray26@opensub00:salloc -N 1 --mem=2gb --time=01:00:00
salloc: Granted job allocation 159
tgray26@opensub00:srun /usr/bin/hostname
openlab00.umiacs.umd.edu
tgray26@opensub00:exit
exit
salloc: Relinquishing job allocation 159

Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.

sbatch

The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:

#!/bin/bash

srun bash -c 'echo Hello World from `hostname`'

Then you need to submit the script with sbatch and request resources:

tgray26@opensub00:sbatch --mem=1gb --time=1:00:00 helloWorld.sh
Submitted batch job 121

SLURM will return a job number that you can use to check the status of your job with squeue:

tgray26@opensub00:squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               121     dpart helloWor  tgray26  R       0:01      2 openlab[00-01]

Advanced Batch Scripts

You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example I have a batch job that will request 2 nodes in the cluster, then I load a specific version of Python into my environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, I use the '&' operator to background the process so that both job steps can run at once; however, this means that I then need to use the wait command to block processing until all background processes have finished.

#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling

#SBATCH --job-name=helloWorld                                   # sets the job name
#SBATCH --output helloWorld.out.%j                              # indicates a file to redirect STDOUT to; %j is the jobid 
#SBATCH --error helloWorld.out.%j                               # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --time=00:05:00                                         # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=dpart                                             # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                               # number of nodes to allocate for your job
#SBATCH --mem 1gb                                               # memory required by job; if unit is not specified MB will be assumed

module load Python/2.7.9                                        # run any commands necessary to setup your environment

srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
srun -N 1 --mem=512mb bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                            # wait for any background processes to complete

# once the end of the batch script is reached your job allocation will be revoked

More Examples

More examples of how to use batch scripts to setup your environment for processing will be coming soon