Torque: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(9 intermediate revisions by 5 users not shown)
Line 1: Line 1:
==Getting Started==
'''Refer to the [https://wiki.umiacs.umd.edu/umiacs/index.php/Nexus/CBCB UMIACS SLURM wiki] instead of this page for current information.'''
 
Torque is a resource manager that interacts with another program called Maui which provides the scheduling for the cluster.  To get started you will need to ensure that your [https://wiki.umiacs.umd.edu/umiacs/index.php/SSH#SSH_Keys_.28and_Passwordless_SSH.29 SSH keys] are setup for password-less SSH.  In our Torque environments this is critical to delivering the error and output of your jobs back to where the job was submitted from. Please note that [https://wiki.umiacs.umd.edu/umiacs/index.php/Fairshare Fair Share] is enabled in this setup, with a historical scope of 12 hours.
 
The hosts that you can submit from are any CBCB Workstation and the <tt>ibissub00.umiacs.umd.edu</tt>, <tt>ibissub01.umiacs.umd.edu</tt>, and <tt>heronsub00.umiacs.umd.edu</tt> nodes.
 
Now that you have that setup here are the queues that are available to users. Use <code>qstat -Q -f</code> to see the resource limits for each queue.
 
=== Red Hat 5 Queues ===
 
* default- default memory is 3GB (max 4GB), default walltime is 1 hour (max 1 hour), allows up to 16 jobs per user concurrently
* shell - interactive jobs only - default memory is 2GB (max 4GB), default walltime is 12 hours (max 2 wks), allows up to 4 jobs per user concurrently - restricted to nodes with '''shell'''* property
* workstation - default memory is 4GB (max 47GB), default walltime is 8 hours (max one week), allows up to 4 jobs per user concurrently
* throughput  - no interactive jobs - default memory is 4GB (max 36GB), default walltime is 4 hours (max 18 hours), allows up to 125 jobs per user concurrently - restricted to nodes with '''ibis'''* property
* high_throughput - no interactive jobs - default memory is 4gb (max 8GB), default walltime is 3 hours (max 6 hours), allows up to 300 jobs per user concurrently - restricted to nodes with '''ibis'''* property
* long - no interactive jobs - default memory is 12gb (max 12gb), default walltime is 8 hours (max 1 week), allows up to 16 jobs per user concurrently
* large - no interactive jobs -default memory is 32GB (max 120GB), default walltime is 24 hours (max 11 days), allows up to 3 jobs per user concurrently
* xlarge - default memory is 100GB (max is unlimited), default walltime is 1 week (max 3 weeks), allows 1 job per user at a time
** The xlarge queue is restricted to members of the group cbcbtorque. If you need to run large jobs please send mail to staff@umiacs.umd.edu
 
=== Red Hat 7 Queues ===
 
These queues submit jobs to the RHEL 7 heron* nodes.
 
* new_workstation  - default memory is 4GB (max 47GB), default walltime is 8 hours (max one week), allows up to 4 jobs per user concurrently
* new_throughput - no interactive jobs - default memory is 4GB (max 36GB), default walltime is 4 hours (max 18 hours), allows up to 125 jobs per user concurrently
* new_hthroughput - no interactive jobs - default memory is 4gb (max 8GB), default walltime is 3 hours (max 6 hours), allows up to 300 jobs per user concurrently
* new_large - no interactive jobs -default memory is 32GB (max 120GB), default walltime is 24 hours (max 11 days), allows up to 3 jobs per user concurrently
<nowiki>*</nowiki>You can list nodes with a specific property by running "pbsnodes :property" where property is the specific property you want to see
 
===qsub===
 
qsub is how you submit jobs into a Torque cluster.  A job is a shell script that is given as STDIN or as a file on the command line.  The -l (lower case L) option allows the user to specify some options for your job submission. While your jobs will not always be penalized for using more resources or fewer resources than you request, it is very important to request resources as accurately as possible so that torque knows how many resources each machine has available when new jobs are scheduled. If your job is using more resources than you request, another job may be scheduled on that same machine and could potentially run the machine out of resources and cause segfaults and eventually bring down the machine; likewise, if you request more resources than you need, it will slow down the execution of other users' jobs because torque may think a machine is at capacity when it actually is not.
 
 
 
To specify the queue that you would like to submit to, use the -q option,
 
  qsub -q workstation
 
 
Use these options with the -l (lower case L) option to request resources:
 
* ncpus=4
* mem=32GB
* walltime=12:00:00
 
You can find a full list of job submission arguments see here [http://docs.adaptivecomputing.com/torque/4-2-6/help.htm#topics/2-jobs/jobSubmission.htm Torque Job Submission Arguments].
 
 
As an example, to run the perl script myscript.pl on 4 CPUs with 128GB of memory for 12 hours you could run the following,
 
  qsub -q large -l ncpus=4,mem=128GB,walltime=12:00:00 myscript.pl
 
Note that the large queue was used in the above example because 128GB is more memory than the max allowed in all of the other queues. By default, all of the other queues reserve approximately the maximum memory allowed for that queue, but you may set a lower reservation if you know you will not need the full amount.
 
Once you have submitted your job for execution you will get something back in the form of,
 
  <JOBID>.<PBSSERVER>
 
You can use that <JOBID> to delete or find your job later if there is a problem.
 
When a job finishes, Torque/PBS deposits the standard output and standard error as
 
* <jobname>.o<number>
* <jobname>.e<number>
 
Where <jobname> is the name of the script you submitted (or STDIN if it came from qsub's standard in), and <number> is the leading number in the job id.
 
====Interactive Jobs====
 
Interactive jobs allow you to schedule interactive shell access on Torque-scheduled compute nodes. You can get an interactive session with the -I (upper case i) option,
 
  qsub -I
 
Please note that only the "workstation", "shell", and "default" queues allow interactive jobs. If you require larger resource allocations than the queue defaults, the -l (lower case L) flag still applies.
 
===qstat===
 
This will display if any jobs are in the queue for your Torque cluster.  It is normally run with out any arguments and if it returns nothing then there is nothing running in the Torque cluster.
 
Here is an example of what qstat will look like,
 
<blockquote><pre>
$ qstat
Job id                    Name            User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
135.cbcbtorque            STDIN            tgray26        00:00:00 R workstation
136.cbcbtorque            STDIN            tgray26        00:00:00 R default       
137.cbcbtorque            STDIN            tgray26        00:00:00 R default       
138.cbcbtorque            STDIN            tgray26                0 Q workstation
</pre></blockquote>
 
For full information about the default settings and maximum resource limits for a queue, use <code>qstat -Q -f</code>:
 
<blockquote><pre>
$ qstat -Q -f throughput
Queue: throughput
    queue_type = Execution
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Comp
lete:0
    resources_max.mem = 36gb
    resources_max.nodect = 1
    resources_max.walltime = 18:00:00
    resources_default.mem = 4gb
    resources_default.walltime = 04:00:00
    mtime = 1424395185
    disallowed_types = interactive
    resources_assigned.mem = 0b
    resources_assigned.ncpus = 0
    resources_assigned.nodect = 0
    max_user_run = 125
    enabled = True
    started = True
</pre></blockquote>
 
===qdel===
 
You can remove a running or stalled job with the qdel command.  It requires that you give it a <JOBID> that can be found by running qstat.
 
===pbsnodes===
 
To find out what resources and nodes are available in the Torque cluster you can run pbsnodes.  It will give you back a detailed list of the nodes and their current status.
 
For example,
 
<blockquote><pre>
$ pbsnodes
redbud.umiacs.umd.edu
    state = free
    np = 32
    ntype = cluster
    status = rectime=1343744604,varattr=,jobs=,state=free,netload=384208495,gres=,loadave=0.02,ncpus=64,
              physmem=528633432kb,availmem=530365780kb,totmem=530730576kb,idletime=489899,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux redbud.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
    gpus = 0
 
beech.umiacs.umd.edu
    state = free
    np = 2
    ntype = cluster
    status = rectime=1343744577,varattr=,jobs=,state=free,netload=425438230,gres=,loadave=0.00,ncpus=2,
              physmem=7154944kb,availmem=8960412kb,totmem=9252088kb,idletime=49,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux beech.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
    gpus = 0
</pre></blockquote>
===Host Monitoring===
http://ganglia.umiacs.umd.edu/ganglia/?c=cbcb_compute&m=load_one&r=hour&s=by%20name&hc=4&mc=2

Latest revision as of 13:15, 12 June 2024

Refer to the UMIACS SLURM wiki instead of this page for current information.