CML: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies.  The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.
The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies.  The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.


<span style="font-size:150%">'''Please note that during the [[MonthlyMaintenanceWindow | August 2023 maintenance window]], all compute nodes will move into the [[Nexus]] cluster.''' Please see [[Nexus/CML]] for more details.</span>
<span style="font-size:150%">'''As of the [[MonthlyMaintenanceWindow | August 2023 maintenance window]], all compute nodes have moved into the [[Nexus]] cluster.''' Please see [[Nexus/CML]] for more details.</span>


=Compute Infrastructure=
=Getting Started=
Each of UMIACS' cluster computational infrastructures is accessed through the submission node.  Users will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node.  Each cluster in UMIACS has different quality of services (QoS) that are '''required''' to be selected upon submission of a job. Many clusters, including this one, also have other resources such as GPUs that need to be requested for a job. 
* [[SLURM/JobSubmission | Submitting Jobs]]
 
* [[SLURM/JobStatus | Checking Job Status]]
The current submission node(s) for '''CML''' are:
* [[Nexus/CML#Storage | Data Storage]]
* <code>cmlsub00.umiacs.umd.edu</code>
 
The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members.  The scheduler's resources are modeled around this concept.  This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.
 
==Partitions==
There are three partitions to the CML [[SLURM]] computational infrastructure.  If you do not specify a partition when submitting your job, you will receive the '''dpart''' partition.
 
* '''dpart''' - This is the default partition. Job allocations are guaranteed.
* '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other partitions are ready to be scheduled.
* '''cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.
 
==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time.  Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.  If you do not specify an account when submitting your job, you will receive the '''cml''' account.
 
<pre>
$ sacctmgr show accounts
  Account                Descr                  Org
---------- -------------------- --------------------
  abhinav  abhinav shrivastava                  cml
      cml                  cml                  cml
  furongh        furong huang                  cml
  hajiagha  mohammad hajiaghayi                  cml
      john      john dickerson                  cml
    ramani    ramani duraiswami                  cml
      root default root account                root
scavenger            scavenger            scavenger
    sfeizi        soheil feizi                  cml
  tokekar      pratap tokekar                  cml
      tomg        tom goldstein                  cml
</pre>
 
You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with.  Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.
 
<pre>
$ show_assoc
      User    Account  Def Acct  Def QOS                                  QOS
---------- ---------- ---------- --------- ------------------------------------
      tomg      tomg                                      default,high,medium
      tomg        cml                                        cpu,default,medium
      tomg  scavenger                                                scavenger
</pre>
 
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.
 
<pre>
$ sacctmgr show assoc account=tomg format=user,account,qos,grptres
      User    Account                  QOS      GrpTRES
---------- ---------- -------------------- -------------
                tomg                      billing=8107
</pre>
 
==QoS==
CML currently has 5 QoS for the '''dpart''' partition (though <code>high_long</code> and <code>very_high</code> may not be available to all faculty accounts), 1 QoS for the '''scavenger''' partition, and 1 QoS for the '''cpu''' partition.  If you do not specify a QoS when submitting your job, you will receive the '''default''' QoS.  The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job.  In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).
 
<pre>
$ show_qos
        Name    MaxWall MaxJobs                        MaxTRES                      MaxTRESPU              GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
      medium  3-00:00:00      2      cpu=8,gres/gpu=2,mem=64G
    default  7-00:00:00      2      cpu=4,gres/gpu=1,mem=32G
        high  1-12:00:00      2    cpu=16,gres/gpu=4,mem=128G
  scavenger  3-00:00:00                                                          gres/gpu=24
      normal
        cpu  7-00:00:00      8
  very_high  1-12:00:00      8    cpu=32,gres/gpu=8,mem=256G                    gres/gpu=12
  high_long 14-00:00:00      8              cpu=32,gres/gpu=8                    gres/gpu=8
</pre>
 
==GPUs==
Jobs that require GPU resources need to explicitly request the resources within their job submission.  This is done through Generic Resource Scheduling (GRES).  Users may use the most generic identifier (in this case '''gpu'''), a colon, and a number to select without explicitly naming the type of GPU (i.e. <code>--gres=gpu:4</code> for 4 GPUs of any type).
 
<pre>
$ sinfo -o "%20N %10c %10m %25f %40G"
NODELIST            CPUS      MEMORY    AVAIL_FEATURES            GRES
cmlgrad[02,05]       32        385421    Xeon,4216                gpu:rtx2080ti:7,gpu:rtx3070:1
cml[00-11,13-16],cml 32        353924+    Xeon,4216                gpu:rtx2080ti:8
cmlcpu[01-04]        20        386675    Xeon,E5-2660              (null)
cmlcpu[00,06-07]    24        386675+    Xeon,E5-2680              (null)
cml12                32        385429    Xeon,4216                gpu:rtx2080ti:7,gpu:rtxa4000:1
cml[17-29]          32        257654    Zen,EPYC-7282            gpu:rtxa4000:8
</pre>
 
==Job Submission and Management==
Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation. 
 
A very quick start to get an interactive shell is as follows when run on the submission node.  This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time.  If the job goes beyond these limits (either the memory allocation or the maximum time) it will be terminated immediately.
 
<pre>
srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
</pre>
 
<pre>
[username@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[username@cml00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)
</pre>
 
If you are going to run in a faculty account instead of the default <code>cml</code> account you will need to specify the <code>--account=</code> flag.
 
A quick example to run an interactive job using the cpu partition. The cpu partition uses the default account <code>cml</code>.
<pre>
-bash-4.2$ srun --partition=cpu --qos=cpu bash -c 'echo "Hello World from" `hostname`'
</pre>
 
=Data Storage=
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories
 
There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories
 
==Home Directories==
Home directories in the CML computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username.  These home directories have very limited storage (20GB, cannot be increased) and are intended for your personal files, configuration and source code.  Your home directory is '''not''' intended for data sets or other large scale data holdings.  Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.
 
'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.
 
Your home directory data is fully protected and has both [[Snapshots | snapshots]] and is [[NightlyBackups | backed up nightly]].
 
==Project Directories==
You can request project based allocations for up to 6TB for up to 120 days with approval from a CML faculty member and the director of CML. 
 
To request an allocation, please [[HelpDesk | contact staff]] with your account sponsor involved in the conversation.  Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any
 
These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.  Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and the director of CML).  If you do not want to renew or do not get approval for renewal, you will need to relocate all desired data within 14 days of the end of the allocation period.  Staff will then remove the allocation.
 
This data is backed up nightly.
 
==Scratch Directories==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories
 
===Network Scratch Directory===
You are allocated 400GB of scratch space via NFS from <code>/cmlscratch/$username</code>.  '''It is not backed up or protected in any way.'''  This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.
 
You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]].  If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and the director of CML.
 
This file system is available on all submission, data management, and computational nodes within the cluster.
 
===Local Scratch Directories===
Each computational node that you can schedule compute jobs on has one or more local scratch directories.  These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc.  These are almost always more performant than any other storage available to the job.  However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.
 
These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows.  Again, please make sure you secure any data you write to these directories at the end of your job.
 
==Datasets==
We have read-only dataset storage available at <code>/fs/cml-datasets</code>.  If there are datasets that you would like to see curated and available, please see [[Datasets | this page]].
 
The following is the list of datasets available:
{| class="wikitable"
! Dataset
! Path
|-
| CelebA
| /fs/cml-datasets/CelebA
|-
| CelebA-HQ
| /fs/cml-datasets/CelebA-HQ
|-
| CelebAMask-HQ
| /fs/cml-datasets/CelebAMask-HQ
|-
| Charades
| /fs/cml-datasets/Charades
|-
| Cityscapes
| /fs/cml-datasets/cityscapes
|-
| COCO
| /fs/cml-datasets/coco
|-
| Diversity in Faces [1]
| /fs/cml-datasets/diversity_in_faces
|-
| FFHQ
| /fs/cml-datasets/FFHQ
|-
| ImageNet ILSVRC2012
| /fs/cml-datasets/ImageNet/ILSVRC2012
|-
| LFW
| /fs/cml-datasets/facial_test_data
|-
| LibriSpeech
| /fs/cml-datasets/LibriSpeech
|-
| LSUN
| /fs/cml-datasets/LSUN
|-
| MAG240M
| /fs/cml-datasets/OGB/MAG240M
|-
| MegaFace
| /fs/cml-datasets/megaface
|-
| MS-Celeb-1M
| /fs/cml-datasets/MS_Celeb_aligned_112
|-
| OC20
| /fs/cml-datasets/OC20
|-
| ogbn-papers100M
| /fs/cml-datasets/OGB/ogbn-papers100M
|-
| roberta
| /fs/cml-datasets/roberta
|-
| Salient ImageNet
| /fs/cml-datasets/Salient-ImageNet
|-
| ShapeNetCore.v2
| /fs/cml-datasets/ShapeNetCore.v2
|-
| Tiny ImageNet
| /fs/cml-datasets/tiny_imagenet
|-
| WikiKG90M
| /fs/cml-datasets/OGB/WikiKG90M
|}
 
[1] - This dataset has restricted access. Please [[HelpDesk | contact staff]] if you are looking to use this dataset.
 
==Models==
We have read-only model storage available at <code>/fs/cml-models</code>.  If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Latest revision as of 22:31, 17 August 2023

The Center for Machine Learning (CML) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

As of the August 2023 maintenance window, all compute nodes have moved into the Nexus cluster. Please see Nexus/CML for more details.

Getting Started