CML: Difference between revisions

From UMIACS
Jump to navigation Jump to search
m (Added CelebA, COCO, and ShapeNetCore.v2 to the table of datasets (as per UMIACS-91198).)
No edit summary
 
(87 intermediate revisions by 8 users not shown)
Line 1: Line 1:
The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies.  The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.
The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies.  The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.


=Compute Infrastructure=
<span style="font-size:150%">'''As of the [[MonthlyMaintenanceWindow | August 2023 maintenance window]], all compute nodes have moved into the [[Nexus]] cluster.''' Please see [[Nexus/CML]] for more details.</span>


Each of UMIACS cluster computational infrastructures is accessed through the submission node.  Users will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node.  Each cluster in UMIACS has different quality of service (QoS) that are '''required''' to be selected upon submission of a job and many like this one has specific other resources such as GPUs that need to be requested for a job. 
=Getting Started=
 
* [[SLURM/JobSubmission | Submitting Jobs]]
The current submission node(s) for '''CML''' are:
* [[SLURM/JobStatus | Checking Job Status]]
* <code>cmlsub00.umiacs.umd.edu</code>
* [[Nexus/CML#Storage | Data Storage]]
 
The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members.  The schedulers resources therefore going forward are modeled around this concept.  This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.
 
== Partitions ==
There are two partitions to the CML [[SLURM]] computational infrastructure.  If you do not specify a partition when submitting your job you will receive the '''dpart'''.
 
* '''dpart''' - This is the default partition and job allocations are guaranteed.
* '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the '''dpart''' are ready to be scheduled.
 
== Accounts ==
The Center has a base account <code>cml</code> which has a modest number of nodes (currently 16 GPUs) total that available in it.  For other faculty who have invested they have an additional account provided which provides guaranteed GPU resources that they invested.  If you do not specify a account when submitting your job you will receive the <code>cml</code> account.
 
<pre>
$ sacctmgr show accounts
  Account                Descr                  Org
---------- -------------------- --------------------
      cml                  cml                  cml
  furongh        furong huang                  cml
      john      john dickerson                  cml
      root default root account                root
scavenger            scavenger            scavenger
    sfeizi        soheil feizi                  cml
      tomg        tom goldstein                  cml
</pre>
 
You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with.  Please contact staff@umiacs.umd.edu and CC your faculty member if you do not see the appropriate association.
 
<pre>
$ show_assoc
      User    Account  Def Acct  Def QOS                                  QOS
---------- ---------- ---------- --------- ------------------------------------
      tomg      tomg                                      default,high,medium
      tomg        cml                                            default,medium
      tomg  scavenger                                                scavenger
</pre>
 
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.
 
<pre>
$ sacctmgr show assoc account=tomg format=user,account,qos,grptres
      User    Account                  QOS      GrpTRES
---------- ---------- -------------------- -------------
                tomg                        gres/gpu=24
...
 
</pre>
 
== QoS ==
CML currently has 3 QoS for the '''dpart''' and 1 QoS for the '''scavenger''' partition.  You are '''required''' to specify a QoS when submitting your job.  The important parts here is that in different QoS you can have a shorter/longer maximum wall time, a total number of jobs running at once and a maximum number of track-able resources (TRES) for the job.  In the scavenger QoS there is one more constraint that you are restricted by the total number of TRES per user (over multiple jobs).
 
<pre>
# show_qos
      Name    MaxWall MaxJobs                        MaxTRES    MaxTRESPU
---------- ----------- ------- ------------------------------ -------------
    medium  3-00:00:00      1      cpu=8,mem=64G,gres/gpu=2
  default  7-00:00:00      2      cpu=4,mem=32G,gres/gpu=1
      high  1-12:00:00      1    cpu=16,mem=128G,gres/gpu=4
scavenger  3-00:00:00                                          gres/gpu=36
</pre>
 
== GPUs ==
Jobs that require GPU resources need to explicitly request the resources within their job submission.  This is done through Generic Resource Scheduling (GRES).  Currently all nodes in the cluster are homogeneous however in the future this may not be the case.  Users may use the most generic identifier in this case '''gpu''' a colon and a number to select without explicitly naming the type of GPU (ie. <code>--gres gpu:4</code> for 4 GPUs). 
 
<pre>
$ sinfo -o "%15N %10c %10m  %25f %25G"
NODELIST        CPUS      MEMORY      AVAIL_FEATURES            GRES
cml[00-09]      32        1+          (null)                    gpu:rtx2080ti:8
</pre>
 
== Job Submission and Management ==
Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation. 
 
A very quick start to get an interactive shell is as follows when run on the submission node.  This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time.  If the job goes beyond these limits either the memory allocation or the maximum time it will be terminated immediately.
 
<pre>
srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
</pre>
 
<pre>
[derek@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[derek@cml00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)
</pre>
 
If you are going to run in a faculty account instead of the default <code>cml</code> account you will need to specify the <code>--account=</code> flag.
 
=Data Storage=
 
Until the final storage investment arrives we have made available a temporary allocation of storage.  This section is subject to change.  There are 3 types of storage available to users in the CML home directories, project directories and scratch directories.
 
== Home Directories ==
Home directories in the CML computational infrastructure are available from the Institutes [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username.  These home directories have very limited storage and are intended for your personal files, configuration and source code.  Your home directory is '''not''' intended for data sets or other large scale data holdings.  Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.
 
'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.
 
Your home directory data is fully protected and has both snapshots and is backed up nightly.
 
== Project Directories ==
 
Users within the CML compute infrastructure can request project based allocations for up to 2TB for up to 120 days from [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] with approval from a CML faculty member and the director.  These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.  Once the allocation period is over the user will be contacted and give a window of opportunity to clean and secure their data before staff will remove the allocation.
 
This data is backed up nightly.
 
== Scratch Directories ==
 
There are two types of scratch directories in the CML compute infrastructure, network and local scratch directories.  Scratch data has no data protection including no snapshots and the data is not backed up.
 
=== Network Scratch Directory===
Users granted access to the CML compute infrastructure are each allocated '''400GB''' of network attached scratch.  This is available as <code>/cmlscratch/USERNAME</code> where USERNAME is your username.  This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.
 
Users may request an additional allocation of scratch space up to '''800GB''' by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].
 
=== Local Scratch Directory===
Each computational node that a user can schedule compute jobs on has one or more local scratch directories.  These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc.  These are almost always more performant than any other storage available to the job.  However users must stage their data within the confine of their job and stage the data out before the end of their job.
 
These local scratch directories will have a tmpwatch job which will '''delete unmodified data after 120 days'''.  Please make sure you secure any data you write to these directories at the end of your job.
 
== Data sets ==
 
The following data sets are available read only for the Center.  If there are other data sets that you would like to see curated and available please contact [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].
 
{| class="wikitable"
! Data set
! Path
|-
| CelebA
| /fs/cml-datasets/CelebA
|-
| COCO
| /fs/cml-datasets/coco
|-
| ImageNet ILSVRC2012
| /fs/cml-datasets/ImageNet/ILSVRC2012
|-
| roberta
| /fs/cml-datasets/roberta
|-
| ShapeNetCore.v2
| /fs/cml-datasets/ShapeNetCore.v2
|}

Latest revision as of 22:31, 17 August 2023

The Center for Machine Learning (CML) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

As of the August 2023 maintenance window, all compute nodes have moved into the Nexus cluster. Please see Nexus/CML for more details.

Getting Started