Torque: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(22 intermediate revisions by 7 users not shown)
Line 1: Line 1:
==Getting Started==
'''Refer to the [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Slurm private Slurm wiki] or the [https://wiki.umiacs.umd.edu/umiacs/index.php/SLURM UMIACS SLURM wiki] instead of this page for current information.'''
 
Torque is a resource manager that interacts with another program called Maui which provides the scheduling for the cluster.  To get started you will need to ensure that your [https://wiki.umiacs.umd.edu/umiacs/index.php/SSH#SSH_Keys_.28and_Passwordless_SSH.29 SSH keys] are setup for password-less SSH.  In our Torque environments this is critical to delivering the error and output of your jobs back to where the job was submitted from. Please note that [https://wiki.umiacs.umd.edu/umiacs/index.php/Fairshare Fair Share] is enabled in this setup, with a historical scope of 12 hours.
 
The hosts that you can submit from are any CBCB Workstation and the <tt>ibissub00.umiacs.umd.edu</tt> and <tt>ibissub01.umiacs.umd.edu</tt> nodes.
 
Now that you have that setup here are the queues that are available to users,
 
* shell - default memory is 2GB (max 4GB), default walltime is 12 hours (max 2 wks)
* workstation - default memory is 4GB (max 47GB), default walltime is 8 hours (max 10 days)
*throughput - default memory is 4GB (max 36GB), default walltime is 4 hours (max 18 hours)
*high_throughput - default memory is 4gb (max 8GB), default walltime is 3 hours (max 6 hours)
* large - default memory is 32GB (max 120GB), default walltime is 24 hours (max 10 days)
* xlarge - default memory is unlimited, default walltime is 1 week (max 3 weeks)
** The xlarge queue is restricted to members of the group cbcbtorque. If you need to run large jobs please send mail to staff@umiacs.umd.edu or speak to Dan Sommer
 
===qsub===
 
qsub is how you submit jobs into a Torque cluster.  A job is a shell script that is given as STDIN or as a file on the command line.  The -l (lower case L) option allows the user to specify some options for your job submission. While your jobs will not always be penalized for using more resources or fewer resources than you request, it is very important to request resources as accurately as possible so that torque knows how many resources each machine has available when new jobs are scheduled. If your job is using more resources than you request, another job may be scheduled on that same machine and could potentially run the machine out of resources and cause segfaults and eventually bring down the machine; likewise, if you request more resources than you need, it will slow down the execution of other users' jobs because torque may think a machine is at capacity when it actually is not.
 
 
 
To specify the queue that you would like to submit to, use the -q option,
 
  qsub -q workstation
 
 
Use these options with the -l (lower case L) option to request resources:
 
* mem=32GB
* walltime=12:00:00
 
You can find a full list of job submission arguments see here [http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml Torque Job Submission Arguments].
 
 
As an example, to run the perl script myscript.pl with 128GB of memory for 12 hours you could run the following,
 
  qsub -q large -l mem=128GB,walltime=12:00:00 myscript.pl
 
Note that the large queue was used in the above example because 128GB is more memory than the max allowed in all of the other queues. By default, all of the other queues reserve approximately the maximum memory allowed for that queue, but you may set a lower reservation if you know you will not need the full amount.
 
 
You can get an interactive session with the -I (upper case i) option,
 
  qsub -I
 
Once you have submited your job for execution you will get something back in the form of,
 
  <JOBID>.<PBSSERVER>
 
You can use that <JOBID> to delete or find your job later if there is a problem.
 
When a job finishes, Torque/PBS deposits the standard output and standard error as
 
* <jobname>.o<number>
* <jobname>.e<number>
 
Where <jobname> is the name of the script you submitted (or STDIN if it came from qsub's standard in), and <number> is the leading number in the job id.
 
===qstat===
 
This will display if any jobs are in the queue for your Torque cluster.  It is normally run with out any arguments and if it returns nothing then there is nothing running in the Torque cluster.
 
Here is an example of what qstat will look like,
 
<blockquote><pre>
$ qstat
Job id                    Name            User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
135.cbcbtorque            STDIN            tgray26        00:00:00 R workstation
136.cbcbtorque            STDIN            tgray26        00:00:00 R default       
137.cbcbtorque            STDIN            tgray26        00:00:00 R default       
138.cbcbtorque            STDIN            tgray26                0 Q workstation
</pre></blockquote>
 
===qdel===
 
You can remove a running or stalled job with the qdel command.  It requires that you give it a <JOBID> that can be found by running qstat.
 
===pbsnodes===
 
To find out what resources and nodes are available in the Torque cluster you can run pbsnodes.  It will give you back a detailed list of the nodes and their current status.
 
For example,
 
<blockquote><pre>
$ pbsnodes
redbud.umiacs.umd.edu
    state = free
    np = 32
    ntype = cluster
    status = rectime=1343744604,varattr=,jobs=,state=free,netload=384208495,gres=,loadave=0.02,ncpus=64,
              physmem=528633432kb,availmem=530365780kb,totmem=530730576kb,idletime=489899,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux redbud.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
    gpus = 0
 
beech.umiacs.umd.edu
    state = free
    np = 2
    ntype = cluster
    status = rectime=1343744577,varattr=,jobs=,state=free,netload=425438230,gres=,loadave=0.00,ncpus=2,
              physmem=7154944kb,availmem=8960412kb,totmem=9252088kb,idletime=49,nusers=0,nsessions=? 0,
              sessions=? 0,uname=Linux beech.umiacs.umd.edu 2.6.18-308.11.1.el5 #1 SMP Fri Jun 15 15:41:53 EDT 2012 x86_64,opsys=linux
    gpus = 0
</pre></blockquote>
===Host Monitoring===
http://ganglia.umiacs.umd.edu/ganglia/?c=cbcb_compute&m=load_one&r=hour&s=by%20name&hc=4&mc=2

Latest revision as of 20:35, 16 November 2018

Refer to the private Slurm wiki or the UMIACS SLURM wiki instead of this page for current information.