Torque
Note: refer to the Slurm wiki instead of this page for more current information.
Getting Started
Torque is a resource manager that interacts with another program called Maui which provides the scheduling for the cluster. To get started you will need to ensure that your SSH keys are setup for password-less SSH. In our Torque environments this is critical to delivering the error and output of your jobs back to where the job was submitted from. Please note that Fair Share is enabled in this setup, with a historical scope of 12 hours.
The hosts that you can submit from are any CBCB Workstation and the cbcbsub00.umiacs.umd.edu and cbcbsub01.umiacs.umd.edu nodes.
Now that you have that setup here are the queues that are available to users. Use qstat -Q -f
to see the resource limits for each queue.
Red Hat 7 Queues
- default- default memory is 3GB (max 4GB), default walltime is 1 hour (max 1 hour), allows up to 16 jobs per user concurrently
- shell - interactive jobs only - default memory is 2GB (max 4GB), default walltime is 12 hours (max 2 wks), allows up to 4 jobs per user concurrently - restricted to nodes with shell* property
- workstation - default memory is 4GB (max 47GB), default walltime is 8 hours (max one week), allows up to 4 jobs per user concurrently
- throughput - no interactive jobs - default memory is 4GB (max 36GB), default walltime is 4 hours (max 18 hours), allows up to 125 jobs per user concurrently - restricted to nodes with ibis* property
- high_throughput - no interactive jobs - default memory is 4gb (max 8GB), default walltime is 3 hours (max 6 hours), allows up to 300 jobs per user concurrently - restricted to nodes with ibis* property
- long - no interactive jobs - default memory is 12gb (max 12gb), default walltime is 8 hours (max 1 week), allows up to 16 jobs per user concurrently
- large - no interactive jobs -default memory is 32GB (max 120GB), default walltime is 24 hours (max 11 days), allows up to 3 jobs per user concurrently
- xlarge - default memory is 100GB (max is unlimited), default walltime is 1 week (max 3 weeks), allows 1 job per user at a time
- The xlarge queue is restricted to members of the group cbcbtorque. If you need to run large jobs please send mail to staff@umiacs.umd.edu
*You can list nodes with a specific property by running "pbsnodes :property" where property is the specific property you want to see
qsub
qsub is how you submit jobs into a Torque cluster. A job is a shell script that is given as STDIN or as a file on the command line. The -l (lower case L) option allows the user to specify some options for your job submission. While your jobs will not always be penalized for using more resources or fewer resources than you request, it is very important to request resources as accurately as possible so that torque knows how many resources each machine has available when new jobs are scheduled. If your job is using more resources than you request, another job may be scheduled on that same machine and could potentially run the machine out of resources and cause segfaults and eventually bring down the machine; likewise, if you request more resources than you need, it will slow down the execution of other users' jobs because torque may think a machine is at capacity when it actually is not.
To specify the queue that you would like to submit to, use the -q option,
qsub -q workstation
Use these options with the -l (lower case L) option to request resources:
- ncpus=4
- mem=32GB
- walltime=12:00:00
You can find a full list of job submission arguments see here Torque Job Submission Arguments.
As an example, to run the perl script myscript.pl on 4 CPUs with 128GB of memory for 12 hours you could run the following,
qsub -q large -l ncpus=4,mem=128GB,walltime=12:00:00 myscript.pl
Note that the large queue was used in the above example because 128GB is more memory than the max allowed in all of the other queues. By default, all of the other queues reserve approximately the maximum memory allowed for that queue, but you may set a lower reservation if you know you will not need the full amount.
Once you have submitted your job for execution you will get something back in the form of,
<JOBID>.<PBSSERVER>
You can use that <JOBID> to delete or find your job later if there is a problem.
When a job finishes, Torque/PBS deposits the standard output and standard error as
- <jobname>.o<number>
- <jobname>.e<number>
Where <jobname> is the name of the script you submitted (or STDIN if it came from qsub's standard in), and <number> is the leading number in the job id.
Interactive Jobs
Interactive jobs allow you to schedule interactive shell access on Torque-scheduled compute nodes. You can get an interactive session with the -I (upper case i) option,
qsub -I
Please note that only the "workstation", "shell", and "default" queues allow interactive jobs. If you require larger resource allocations than the queue defaults, the -l (lower case L) flag still applies.
Array Jobs
Array jobs let you submit the same script multiple times, each with a different setting for the environment variable PBS_ARRAYID
:
qsub -q throughput -t 0-999 my_script.sh
Torque will run 1000 instances of my_script.sh
with the environment variable PBS_ARRAYID
set to the range of values specified by the -t
argument. In this case my_script.sh
will be executed once with PBS_ARRAYID=0
, again with PBS_ARRAYID=1
, etc.
You can also specify comma separate values for PBS_ARRAYID
:
qsub -q throughput -t 0,3,9 my_script.sh
qstat
This will display if any jobs are in the queue for your Torque cluster. It is normally run with out any arguments and if it returns nothing then there is nothing running in the Torque cluster.
Here is an example of what qstat will look like,
$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 135.cbcbtorque STDIN tgray26 00:00:00 R workstation 136.cbcbtorque STDIN tgray26 00:00:00 R default 137.cbcbtorque STDIN tgray26 00:00:00 R default 138.cbcbtorque STDIN tgray26 0 Q workstation
For full information about the default settings and maximum resource limits for a queue, use qstat -Q -f
:
$ qstat -Q -f throughput Queue: throughput queue_type = Execution total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Comp lete:0 resources_max.mem = 36gb resources_max.nodect = 1 resources_max.walltime = 18:00:00 resources_default.mem = 4gb resources_default.walltime = 04:00:00 mtime = 1424395185 disallowed_types = interactive resources_assigned.mem = 0b resources_assigned.ncpus = 0 resources_assigned.nodect = 0 max_user_run = 125 enabled = True started = True
qdel
You can remove a running or stalled job with the qdel command. It requires that you give it a <JOBID> that can be found by running qstat.
pbsnodes
To find out what resources and nodes are available in the Torque cluster you can run pbsnodes. It will give you back a detailed list of the nodes and their current status.
For example,
$ pbsnodes redbud.umiacs.umd.edu state = free np = 32 ntype = cluster status = rectime=1343744604,varattr=,jobs=,state=free,netload=384208495,gres=,loadave=0.02,ncpus=64, physmem=528633432kb,availmem=530365780kb,totmem=530730576kb,idletime=489899,nusers=0,nsessions=? 0, sessions=? 0,uname=Linux redbud.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux gpus = 0 beech.umiacs.umd.edu state = free np = 2 ntype = cluster status = rectime=1343744577,varattr=,jobs=,state=free,netload=425438230,gres=,loadave=0.00,ncpus=2, physmem=7154944kb,availmem=8960412kb,totmem=9252088kb,idletime=49,nusers=0,nsessions=? 0, sessions=? 0,uname=Linux beech.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux gpus = 0
Using CBCB Modules with Torque
To use CBCB Software Modules with Torque, you will need to add these lines to your ~/.bashrc
:
. /usr/share/Modules/init/bash . /etc/profile.d/ummodules.sh
Host Monitoring
http://ganglia.umiacs.umd.edu/ganglia/?c=cbcb_compute&m=load_one&r=hour&s=by%20name&hc=4&mc=2