SLURM: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
Line 34: Line 34:


==Modules==
==Modules==
If you are trying to use [[Modules | GNU Modules]] in a Slurm job, please read the section of our [[Modules]] documentation on [[Modules#Modules_in_Non-Interactive_Shell_Sessions | non-interactive shell sessions]].
If you are trying to use [[Modules | GNU Modules]] in a Slurm job, please read the section of our [[Modules]] documentation on [[Modules#Modules_in_Non-Interactive_Shell_Sessions | non-interactive shell sessions]].  This also needs to be done if the OS version of the compute node you are scheduled on is different from the OS version of the submission node you are submitting the job from.


==Running Jupyter Notebook on a Compute Node==
==Running Jupyter Notebook on a Compute Node==

Revision as of 15:00, 6 April 2022

Simple Linux Utility for Resource Management (SLURM)

SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First, it allocates exclusive or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Documentation

Submitting Jobs
Checking Job Status
Checking Cluster Status
Official Documentation
FAQ

Commands

Below are some of the common commands used in Slurm. Further information on how to use these commands is found in the documentation linked above. To see all flags available for a command, please check the command's manual by using man $COMMAND on the command line.

srun

srun runs a parallel job on a cluster managed by Slurm. If necessary, it will first create a resource allocation in which to run the parallel job.

salloc

salloc allocates a Slurm job allocation, which is a set of resources (nodes), possibly with some set of constraints (e.g. number of processors per node). When salloc successfully obtains the requested allocation, it then runs the command specified by the user. Finally, when the user specified command is complete, salloc relinquishes the job allocation. If no command is specified, salloc runs the user's default shell.

sbatch

sbatch submits a batch script to Slurm. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input. The batch script may contain options preceded with "#SBATCH" before any executable commands in the script.

squeue

squeue views job and job step information for jobs managed by Slurm.

scancel

scancel signals or cancels jobs, job arrays, or job steps. An arbitrary number of jobs or job steps may be signaled using job specification filters or a space separated list of specific job and/or job step IDs.

sacct

sacct displays job accounting data stored in the job accounting log file or Slurm database in a variety of forms for your analysis. The sacct command displays information on jobs, job steps, status, and exitcodes by default. You can tailor the output with the use of the --format= option to specify the fields to be shown.

sstat

sstat displays job status information for your analysis. The sstat command displays information pertaining to CPU, Task, Node, Resident Set Size (RSS) and Virtual Memory (VM). You can tailor the output with the use of the --fields= option to specify the fields to be shown.

Modules

If you are trying to use GNU Modules in a Slurm job, please read the section of our Modules documentation on non-interactive shell sessions. This also needs to be done if the OS version of the compute node you are scheduled on is different from the OS version of the submission node you are submitting the job from.

Running Jupyter Notebook on a Compute Node

The steps to run a Jupyter Notebook from a compute node are listed below.

Setting up Python Virtual Environment

In order to set up your python virtual environment, you'll first want to follow the steps listed here to create a Python virtual environment on the compute node you are assigned. Then, activate it using the steps listed here. Next, install Jupyter using pip by following the steps here.

Running Jupyter Notebook

After you've set up the python virtual environment, run the following commands on the compute node you are assigned:

jupyter notebook --no-browser --port=8889 --ip=0.0.0.0

This will start running the notebook on port 8889. Note: You must keep this shell window open to be able to connect. Then, on your local machine, run

ssh -N -f -L localhost:8888:$(NODENAME):8889 $(USERNAME)@$(SUBMISSIONNODE).umiacs.umd.edu

This will tunnel port 8889 from the compute node to port 8888 on your local machine, using the $(SUBMISSIONNODE) as an intermediate node. Make sure to replace $(NODENAME) with the name of the compute node you are assigned, $(USERNAME) with your username, and $(SUBMISSIONNODE) with the name of the submission node you want to use. For example, username@opensub02.umiacs.umd.edu. You can then open a web browser and type in localhost:8888 to access the notebook. Note: You must be on a machine connected to the UMIACS network or connected to our VPN in order to access the Jupyter notebook.

  • If the port on the compute node mentioned in the example above (8889) is not working, it may be that someone else has already started a notebook using that port. The port can be replaced with any other port you'd like, just make sure to change it in both the command you run on the compute node and the ssh command from your local machine.

Quick Guide to translate PBS/Torque to SLURM

User commands
PBS/Torque SLURM
Job submission qsub [filename] sbatch [filename]
Job deletion qdel [job_id] scancel [job_id]
Job status (by job) qstat [job_id] squeue --job [job_id]
Full job status (by job) qstat -f [job_id] scontrol show job [job_id]
Job status (by user) qstat -u [username] squeue --user=[username]
Environment variables
PBS/Torque SLURM
Job ID $PBS_JOBID $SLURM_JOBID
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
Node List $PBS_NODEFILE $SLURM_JOB_NODELIST
Job specification
PBS/Torque SLURM
Script directive #PBS #SBATCH
Job Name -N [name] --job-name=[name] OR -J [name]
Node Count -l nodes=[count] --nodes=[min[-max]] OR -N [min[-max]]
CPU Count -l ppn=[count] --ntasks-per-node=[count]
CPUs Per Task --cpus-per-task=[count]
Memory Size -l mem=[MB] --mem=[MB] OR --mem-per-cpu=[MB]
Wall Clock Limit -l walltime=[hh:mm:ss] --time=[min] OR --time=[days-hh:mm:ss]
Node Properties -l nodes=4:ppn=8:[property] --constraint=[list]
Standard Output File -o [file_name] --output=[file_name] OR -o [file_name]
Standard Error File -e [file_name] --error=[file_name] OR -e [file_name]
Combine stdout/stderr -j oe (both to stdout) (Default if you don't specify --error)
Job Arrays -t [array_spec] --array=[array_spec] OR -a [array_spec]
Delay Job Start -a [time] --begin=[time]