SLURM/JobSubmission
Job Submission
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.
srun
Interactive Jobs
An interactive session can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun to invoke a shell:
tgray26@opensub00:srun --pty --mem 1gb --time=01:00:00 bash tgray26@openlab00:
Please do not leave interactive shells running for long periods of time when you are not working, this blocks resources from being used by everyone else.
Batch Jobs
If you only have one command to run or just want to run a script, you can use the srun command. By default all output from the compute nodes will be redirected to srun's stdout and any input given to srun's stdin will be broadcast to all compute nodes allocated, this behavior can be changed with the --output, --error, and --input flags (if --output is defined and --error is not, they will both be redirected to --output). To request resources and setup mail/other options, you will need to pass the correct command line options to srun. Some common options are:
- --mem=1gb (if no unit is given MB is assumed)
- --nodes=2 (the given command will be run concurrently on each node)
- --qos=dpart
- --time=hh:mm:ss(time needed to run your job)
- --job-name=helloWorld
- --output filename (file to redirect stdout to)
- --error filename (file to redirect stderr)
tgray26@opensub00:srun --nodes=2 --mem=100mb --time=00:01:00 /usr/bin/hostname openlab00.umiacs.umd.edu openlab01.umiacs.umd.edu
salloc
The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. Whenever your time is up or you exit the subshell your job allocation will be relinquished.
tgray26@opensub00:salloc -N 1 --mem=2gb --time=01:00:00 salloc: Granted job allocation 159 tgray26@opensub00:srun /usr/bin/hostname openlab00.umiacs.umd.edu tgray26@opensub00:exit exit salloc: Relinquishing job allocation 159
Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.
sbatch
The sbatch command allows you to write a batch script with all of your resources/options defined in the script itself. You can then handle any necessary environment setup and then run commands on the resources you requested by invoking commands with srun. The batch script will be run on the first node allocated as part of your job allocation.
#!/bin/bash # Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling #SBATCH --job-name=helloWorld # sets the job name #SBATCH --output helloWorld.out.%j # indicates a file to redirect STDOUT to; %j is the jobid #SBATCH --error helloWorld.err.%j # indicates a file to redirect STDERR to; %j is the jobid #SBATCH --time=00:05:00 # how long you think your job will take to complete; format=hh:mm:ss #SBATCH --qos=default # set QOS, this will determine what resources can be requested #SBATCH --nodes=2 # number of nodes to allocate for your job #SBATCH --mem 1gb # memory required by job; if unit is not specified MB will be assumed module load Python/2.7.9 # run any commands necessary to setup your environment srun -N 1 bash -c "hostname; python --version" & # use srun to invoke commands within your job; using an '&' srun -N 1 bash -c "hostname; python --version" & # will background the process allowing them to run concurrently wait # wait for any background processes to complete # once the end of the batch script is reached your job allocation will be revoked
If your script were named batchScript.sh, you could submit it by running:
tgray26@opensub00:sbatch batchScript.sh Submitted batch job 121
SLURM will return a job number that you can use to check the status of your job with squeue:
tgray26@opensub00:squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 121 test2 helloWor tgray26 R 0:01 2 openlab[00-01]