Torque

From Cbcb
Revision as of 17:33, 3 March 2015 by Pknut777 (talk | contribs)
Jump to navigation Jump to search

Getting Started

Torque is a resource manager that interacts with another program called Maui which provides the scheduling for the cluster. To get started you will need to ensure that your SSH keys are setup for password-less SSH. In our Torque environments this is critical to delivering the error and output of your jobs back to where the job was submitted from. Please note that Fair Share is enabled in this setup, with a historical scope of 12 hours.

The hosts that you can submit from are any CBCB Workstation and the ibissub00.umiacs.umd.edu and ibissub01.umiacs.umd.edu nodes.

Now that you have that setup here are the queues that are available to users,

  • default- default memory is 3GB (max 4GB), default walltime is 1 hour (max 1 hour), allows up to 16 jobs per user concurrently
  • shell - interactive jobs only - default memory is 2GB (max 4GB), default walltime is 12 hours (max 2 wks), allows up to 4 jobs per user concurrently - restricted to nodes with shell* property
  • workstation - default memory is 4GB (max 47GB), default walltime is 8 hours (max one week), allows up to 4 jobs per user concurrently
  • throughput - no interactive jobs - default memory is 4GB (max 36GB), default walltime is 4 hours (max 18 hours), allows up to 125 jobs per user concurrently - restricted to nodes with ibis* property
  • high_throughput - no interactive jobs - default memory is 4gb (max 8GB), default walltime is 3 hours (max 6 hours), allows up to 300 jobs per user concurrently - restricted to nodes with ibis* property
  • long - no interactive jobs - default memory is 12gb (max 12gb), default walltime is 8 hours (max 1 week), allows up to 16 jobs per user concurrently
  • large - no interactive jobs -default memory is 32GB (max 120GB), default walltime is 24 hours (max 11 days), allows up to 3 jobs per user concurrently
  • xlarge - default memory is 100GB (max is unlimited), default walltime is 1 week (max 3 weeks), allows 1 job per user at a time
    • The xlarge queue is restricted to members of the group cbcbtorque. If you need to run large jobs please send mail to staff@umiacs.umd.edu

*You can list nodes with a specific property by running "pbsnodes :property" where property is the specific property you want to see

qsub

qsub is how you submit jobs into a Torque cluster. A job is a shell script that is given as STDIN or as a file on the command line. The -l (lower case L) option allows the user to specify some options for your job submission. While your jobs will not always be penalized for using more resources or fewer resources than you request, it is very important to request resources as accurately as possible so that torque knows how many resources each machine has available when new jobs are scheduled. If your job is using more resources than you request, another job may be scheduled on that same machine and could potentially run the machine out of resources and cause segfaults and eventually bring down the machine; likewise, if you request more resources than you need, it will slow down the execution of other users' jobs because torque may think a machine is at capacity when it actually is not.


To specify the queue that you would like to submit to, use the -q option,

 qsub -q workstation


Use these options with the -l (lower case L) option to request resources:

  • ncpus=4
  • mem=32GB
  • walltime=12:00:00

You can find a full list of job submission arguments see here Torque Job Submission Arguments.


As an example, to run the perl script myscript.pl with 128GB of memory for 12 hours you could run the following,

 qsub -q large -l mem=128GB,walltime=12:00:00 myscript.pl

Note that the large queue was used in the above example because 128GB is more memory than the max allowed in all of the other queues. By default, all of the other queues reserve approximately the maximum memory allowed for that queue, but you may set a lower reservation if you know you will not need the full amount.

Once you have submited your job for execution you will get something back in the form of,

 <JOBID>.<PBSSERVER>

You can use that <JOBID> to delete or find your job later if there is a problem.

When a job finishes, Torque/PBS deposits the standard output and standard error as

  • <jobname>.o<number>
  • <jobname>.e<number>

Where <jobname> is the name of the script you submitted (or STDIN if it came from qsub's standard in), and <number> is the leading number in the job id.

Interactive Jobs

Interactive jobs allow you to schedule interactive shell access on Torque-scheduled compute nodes. You can get an interactive session with the -I (upper case i) option,

 qsub -I

Please note that only the "workstation", "shell", and "default" queues allow interactive jobs. If you require larger resource allocations than the queue defaults, the -l (lower case L) flag still applies.

qstat

This will display if any jobs are in the queue for your Torque cluster. It is normally run with out any arguments and if it returns nothing then there is nothing running in the Torque cluster.

Here is an example of what qstat will look like,

$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
135.cbcbtorque             STDIN            tgray26         00:00:00 R workstation
136.cbcbtorque             STDIN            tgray26         00:00:00 R default        
137.cbcbtorque             STDIN            tgray26         00:00:00 R default        
138.cbcbtorque             STDIN            tgray26                0 Q workstation

qdel

You can remove a running or stalled job with the qdel command. It requires that you give it a <JOBID> that can be found by running qstat.

pbsnodes

To find out what resources and nodes are available in the Torque cluster you can run pbsnodes. It will give you back a detailed list of the nodes and their current status.

For example,

$ pbsnodes
redbud.umiacs.umd.edu
     state = free
     np = 32
     ntype = cluster
     status = rectime=1343744604,varattr=,jobs=,state=free,netload=384208495,gres=,loadave=0.02,ncpus=64,
              physmem=528633432kb,availmem=530365780kb,totmem=530730576kb,idletime=489899,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux redbud.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
     gpus = 0

beech.umiacs.umd.edu
     state = free
     np = 2
     ntype = cluster
     status = rectime=1343744577,varattr=,jobs=,state=free,netload=425438230,gres=,loadave=0.00,ncpus=2,
              physmem=7154944kb,availmem=8960412kb,totmem=9252088kb,idletime=49,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux beech.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
     gpus = 0

Host Monitoring

http://ganglia.umiacs.umd.edu/ganglia/?c=cbcb_compute&m=load_one&r=hour&s=by%20name&hc=4&mc=2