SLURM/JobSubmission

From UMIACS
Revision as of 01:40, 11 July 2016 by Tgray26 (talk | contribs) (Created page with "=Job Submission= SLURM offers a variety of ways to run jobs, it is important that you understand the different options available and how to request the resources required by...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Job Submission

SLURM offers a variety of ways to run jobs, it is important that you understand the different options available and how to request the resources required by your job in order for it to run successfully.

Batch Jobs

Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). To submit a batch job you will need to create a submission script like the following:

#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling

#SBATCH --job-name=helloWorld                       # sets the job name
#SBATCH --output helloWorld.out.%j                  # indicates a file to redirect STDOUT to; %j is the jobid 
#SBATCH --error helloWorld.err.%j                   # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --time=00:05:00                             # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=default                               # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2                                   # number of nodes to allocate for your job
#SBATCH --mem 1gb                                   # memory required by job; if unit is not specified MB will be assumed

module load Python/2.7.9                            # run any commands necessary to setup your environment

srun -N 1 bash -c "hostname; python --version" &    # use srun to invoke commands within your job; using an '&'
srun -N 1 bash -c "hostname; python --version" &    # will background the process allowing them to run concurrently
wait                                                # wait for any background processes to complete

# once the end of the batch script is reached your job allocation will be revoked

If your script were named batchScript.sh you could submit it by running:

tgray26@shadosub$ sbatch batchScript.sh
Submitted batch job 121

SLURM will return a job number that you can use to check the status of your job with squeue:

tgray26@shadosub$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               121     test2 helloWor  tgray26  R       0:01      2 shado[00-01]

Interactive Jobs

An interactive session can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use srun to invoke a shell:

tgray26@shadosub:srun --pty --mem 1gb --time=01:00:00 bash
tgray26@shado00:

Please do not leave interactive shells running for long periods of time when you are not working, this blocks resources from being used by everyone else.

salloc

The salloc command can also be used to request resources without requiring a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. Whenever your time is up or you exit the subshell your job allocation will be relinquished.

tgray26@shadosub:salloc -N 1 --mem=2gb --time=01:00:00
salloc: Granted job allocation 159
tgray26@shadosub:srun /usr/bin/hostname
shado00.umiacs.umd.edu
tgray26@shadosub:exit
exit
salloc: Relinquishing job allocation 159

Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.

srun

If you only have one command to run, you can use the srun command on it's own, requesting resources with the correct flags. A job will be allocated with the requested resources, the command specified will be run concurrently on all nodes and then the job will be relinquished. By default all output from the compute nodes will be redirected to srun's stdout and any input given to srun's stdin will be broadcast to all compute nodes allocated, this behavior can be changed with the --output, --error, and --input flags.

tgray26@shadosub:srun -N2 --mem=100mb --time=00:01:00 /usr/bin/hostname
shado00.umiacs.umd.edu
shado01.umiacs.umd.edu