CML: Difference between revisions

From UMIACS
Jump to navigation Jump to search
Line 55: Line 55:


== QoS ==
== QoS ==
CML currently has 4 QoS for the '''dpart''' partition, 1 QoS for the '''scavenger''' partition and 1 QoS for the '''cpu''' partition.  You are '''required''' to specify a QoS when submitting your job.  The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job.  In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).  
CML currently has 4 QoS for the '''dpart''' partition (though <code>very_high</code> is only available on a single faculty member's account), 1 QoS for the '''scavenger''' partition and 1 QoS for the '''cpu''' partition.  You are '''required''' to specify a QoS when submitting your job.  The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job.  In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).  


<pre>
<pre>

Revision as of 14:57, 22 September 2021

The Center for Machine Learning (CML) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

Compute Infrastructure

Each of UMIACS' cluster computational infrastructures is accessed through the submission node. Users will need to submit jobs through the SLURM resource manager once they have logged into the submission node. Each cluster in UMIACS has different quality of service (QoS) that are required to be selected upon submission of a job. Many clusters, including this one, also have other resources such as GPUs that need to be requested for a job.

The current submission node(s) for CML are:

  • cmlsub00.umiacs.umd.edu

The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members. The scheduler's resources are modeled around this concept. This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.

Partitions

There are three partitions to the CML SLURM computational infrastructure. If you do not specify a partition when submitting your job, you will receive the dpart partition.

  • dpart - This is the default partition and job allocations are guaranteed.
  • scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the dpart partition are ready to be scheduled.
  • cpu - This partition is for CPU focused jobs and the job allocations are guaranteed.

Accounts

The Center has a base account cml which has a modest number of nodes (currently 16 GPUs) total available in it. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed GPU resources corresponding to the amount that they invested. If you do not specify a account when submitting your job, you will receive the cml account.

# sacctmgr show accounts
   Account                Descr                  Org
---------- -------------------- --------------------
   abhinav  abhinav shrivastava                  cml
       cml                  cml                  cml
   furongh         furong huang                  cml
      john       john dickerson                  cml
      root default root account                 root
 scavenger            scavenger            scavenger
    sfeizi         soheil feizi                  cml
      tomg        tom goldstein                  cml

You can check your account associations by running the show_assoc to see the accounts you are associated with. Please contact staff@umiacs.umd.edu and CC your faculty member if you do not see the appropriate association.

$ show_assoc
      User    Account   Def Acct   Def QOS                                  QOS
---------- ---------- ---------- --------- ------------------------------------
      tomg       tomg                                       default,high,medium
      tomg        cml                                        cpu,default,medium
      tomg  scavenger                                                 scavenger

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.

$ sacctmgr show assoc account=tomg format=user,account,qos,grptres
      User    Account                  QOS       GrpTRES
---------- ---------- -------------------- -------------
                 tomg                        gres/gpu=48

QoS

CML currently has 4 QoS for the dpart partition (though very_high is only available on a single faculty member's account), 1 QoS for the scavenger partition and 1 QoS for the cpu partition. You are required to specify a QoS when submitting your job. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

# show_qos
      Name     MaxWall MaxJobs                        MaxTRES     MaxTRESPU   Priority
---------- ----------- ------- ------------------------------ ------------- ----------
    medium  3-00:00:00       1       cpu=8,gres/gpu=2,mem=64G                        0
   default  7-00:00:00       2       cpu=4,gres/gpu=1,mem=32G                        0
      high  1-12:00:00       2     cpu=16,gres/gpu=4,mem=128G                        0
 scavenger  3-00:00:00                                          gres/gpu=24          0
    normal                                                                           0
       cpu  1-00:00:00       1                                                       0
 very_high  1-12:00:00       8     cpu=32,gres/gpu=8,mem=256G   gres/gpu=12          0

GPUs

Jobs that require GPU resources need to explicitly request the resources within their job submission. This is done through Generic Resource Scheduling (GRES). Currently all nodes in the cluster are homogeneous, however in the future this may not be the case. Users may use the most generic identifier (in this case gpu), a colon, and a number to select without explicitly naming the type of GPU (ie. --gres gpu:4 for 4 GPUs).

$ sinfo -o "%15N %10c %10m  %25f %25G"
NODELIST        CPUS       MEMORY      AVAIL_FEATURES            GRES
cml[00-09]      32         1+          (null)                    gpu:rtx2080ti:8

Job Submission and Management

Users should review our SLURM job submission and job management documentation.

A very quick start to get an interactive shell is as follows when run on the submission node. This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time. If the job goes beyond these limits (either the memory allocation or the maximum time) it will be terminated immediately.

srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[username@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[username@cml00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)

If you are going to run in a faculty account instead of the default cml account you will need to specify the --account= flag.


A quick example to run an interactive job using the cpu partition. The cpu partition uses the default account cml.

-bash-4.2$ srun --partition=cpu --qos=cpu bash -c 'echo "Hello World from" `hostname`'

Data Storage

Until the final storage investment arrives we have made available a temporary allocation of storage. This section is subject to change. There are 3 types of storage available to users in the CML:

  • Home directories
  • Project directories
  • Scratch directories

Home Directories

Home directories in the CML computational infrastructure are available from the Institute's NFShomes as /nfshomes/USERNAME where USERNAME is your username. These home directories have very limited storage and are intended for your personal files, configuration and source code. Your home directory is not intended for data sets or other large scale data holdings. Users are encouraged to utilize our GitLab infrastructure to host your code repositories.

NOTE: To check your quota on this directory you will need to use the quota -s command.

Your home directory data is fully protected and has both snapshots and is backed up nightly.

Project Directories

Users within the CML compute infrastructure can request project based allocations for up to 2TB for up to 120 days from staff@umiacs.umd.edu with approval from a CML faculty member and the director of CML. These allocations will be available from /fs/cml-projects under a name that you provide when you request the allocation. Once the allocation period is over, you will be contacted and given a 14-day window of opportunity to clean and secure your data before staff will remove the allocation.

This data is backed up nightly.

Scratch Directories

Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:

  • Network scratch directory
  • Local scratch directories

Network Scratch Directory

Users granted access to the CML compute infrastructure are each allocated 400GB of network attached scratch. This is available as /cmlscratch/USERNAME where USERNAME is your username. This directory is automounted so you will need to cd into the directory or request/specify a fully qualified file path to access this.

Users may request an additional allocation of scratch space up to a total of 800GB by contacting staff@umiacs.umd.edu.

Local Scratch Directories

Each computational node that a user can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0, /scratch1, etc. These are almost always more performant than any other storage available to the job. However, users must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories will have a tmpwatch job which will delete unmodified data after 120 days. Please make sure you secure any data you write to these directories at the end of your job.

Datasets

We have read-only dataset storage available for the Center at /fs/cml-datasets. If there are other datasets that you would like to see curated and available, please see this page.

The following is the list of datasets available:

Dataset Path
CelebA /fs/cml-datasets/CelebA
CelebA-HQ /fs/cml-datasets/CelebA-HQ
CelebAMask-HQ /fs/cml-datasets/CelebAMask-HQ
Charades /fs/cml-datasets/Charades
Cityscapes /fs/cml-datasets/cityscapes
COCO /fs/cml-datasets/coco
Diversity in Faces [1] /fs/cml-datasets/diversity_in_faces
FFHQ /fs/cml-datasets/FFHQ
ImageNet ILSVRC2012 /fs/cml-datasets/ImageNet/ILSVRC2012
LFW /fs/cml-datasets/facial_test_data
LibriSpeech /fs/cml-datasets/LibriSpeech
LSUN /fs/cml-datasets/LSUN
MAG240M /fs/cml-datasets/OGB/MAG240M
MegaFace /fs/cml-datasets/megaface
MS-Celeb-1M /fs/cml-datasets/MS_Celeb_aligned_112
OC20 /fs/cml-datasets/OC20
ogbn-papers100M /fs/cml-datasets/OGB/ogbn-papers100M
roberta /fs/cml-datasets/roberta
ShapeNetCore.v2 /fs/cml-datasets/ShapeNetCore.v2
Tiny ImageNet /fs/cml-datasets/tiny_imagenet
WikiKG90M /fs/cml-datasets/OGB/WikiKG90M

[1] - This dataset has restricted access. Please contact staff if you are looking to use this dataset.