Difference between revisions of "CML"

From UMIACS
Jump to navigation Jump to search
Line 15: Line 15:
  
 
== QoS ==
 
== QoS ==
CML currently has 3 QoS for the '''dpart''' and 1 QoS for the '''scavenger''' partition.
+
CML currently has 3 QoS for the '''dpart''' and 1 QoS for the '''scavenger''' partition.  The important parts here is that in different QoS you can have a shorter/longer maximum wall time, a total number of jobs running at once and a maximum number of track-able resources (TRES) for the job.  In the scavenger QoS there is one more constraint that you are restricted by the total number of TRES per user (over multiple jobs).  
  
 
<pre>
 
<pre>
Line 21: Line 21:
 
       Name    MaxWall MaxJobs                        MaxTRES    MaxTRESPU
 
       Name    MaxWall MaxJobs                        MaxTRES    MaxTRESPU
 
---------- ----------- ------- ------------------------------ -------------
 
---------- ----------- ------- ------------------------------ -------------
    normal
 
 
     medium  3-00:00:00              cpu=8,mem=64G,gres/gpu=2
 
     medium  3-00:00:00              cpu=8,mem=64G,gres/gpu=2
 
   default  7-00:00:00              cpu=4,mem=32G,gres/gpu=1
 
   default  7-00:00:00              cpu=4,mem=32G,gres/gpu=1
Line 28: Line 27:
 
</pre>
 
</pre>
  
 +
== GPUs ==
 +
Jobs that require GPU resources need to explicitly request the resources within their job submission.  This is done through Generic Resource Scheduling (GRES).  Currently all nodes in the cluster are homogeneous however in the future this may not be the case.  Users may use the most generic identifier in this case '''gpu''' a colon and a number to select without explicitly naming the type of GPU (ie. <code>--gres gpu:4</code> for 4 GPUs). 
  
== GPUs ==
+
<pre>
Jobs that require GPU resources need to explicitly request the resources within their job submission.
+
$ sinfo -o "%15N %10c %10m  %25f %25G"
 +
NODELIST        CPUS      MEMORY      AVAIL_FEATURES            GRES
 +
cml[00-09]      32        1+          (null)                    gpu:rtx2080ti:8
 +
</pre>
  
 
=Data Storage=
 
=Data Storage=

Revision as of 18:04, 7 August 2019

The Center for Machine Learning (CML) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

Compute Infrastructure

Each of UMIACS cluster computational infrastructures is accessed through the submission node. Users will need to submit jobs through the SLURM resource manager once they have logged into the submission node. Each cluster in UMIACS has different quality of service (QoS) that need to be selected upon submission of a job and many like this one has specific other resources such as GPUs that need to be requested for a job.

The current submission node(s) for CML are:

  • cmlsub00.umiacs.umd.edu

Partition

There are two partitions to the CML SLURM computational infrastructure. If you do not specify a partition when submitting your job you will receive the dpart.

  • dpart - This is the default partition and job allocations are guaranteed.
  • scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the dpart are ready to be scheduled.

QoS

CML currently has 3 QoS for the dpart and 1 QoS for the scavenger partition. The important parts here is that in different QoS you can have a shorter/longer maximum wall time, a total number of jobs running at once and a maximum number of track-able resources (TRES) for the job. In the scavenger QoS there is one more constraint that you are restricted by the total number of TRES per user (over multiple jobs).

# show_qos
      Name     MaxWall MaxJobs                        MaxTRES     MaxTRESPU
---------- ----------- ------- ------------------------------ -------------
    medium  3-00:00:00               cpu=8,mem=64G,gres/gpu=2
   default  7-00:00:00               cpu=4,mem=32G,gres/gpu=1
      high  1-12:00:00             cpu=16,mem=128G,gres/gpu=4
 scavenger  3-00:00:00                                          gres/gpu=36

GPUs

Jobs that require GPU resources need to explicitly request the resources within their job submission. This is done through Generic Resource Scheduling (GRES). Currently all nodes in the cluster are homogeneous however in the future this may not be the case. Users may use the most generic identifier in this case gpu a colon and a number to select without explicitly naming the type of GPU (ie. --gres gpu:4 for 4 GPUs).

$ sinfo -o "%15N %10c %10m  %25f %25G"
NODELIST        CPUS       MEMORY      AVAIL_FEATURES            GRES
cml[00-09]      32         1+          (null)                    gpu:rtx2080ti:8

Data Storage

Until the final storage investment arrives we have made available a temporary allocation of storage. There are 3 types of storage available to users in the CML home directories, project directories and scratch directories.

Home Directories

Home directories in the CML computational infrastructure are available from the Institutes NFShomes as /nfshomes/USERNAME where USERNAME is your username. These home directories have very limited storage and are intended for your personal files, configuration and source code. Your home directory is not intended for data sets or other large scale data holdings. Users are encouraged to utilize our GitLab infrastructure to host your code repositories.

NOTE: To check your quota on this directory you will need to use the quota -s command.

Your home directory data is fully protected and has both snapshots and is backed up nightly.

Project Directories

Users within the CML compute infrastructure can request project based allocations for up to 1TB for up to 120 days from staff@umiacs.umd.edu with approval from a CML faculty member and the director. These allocations will be available from /fs/cml-projects under a name that you provide when you request the allocation. Once the allocation period is over the user will be contacted and give a window of opportunity to clean and secure their data before staff will remove the allocation.

This data is backed up nightly.

Scratch Directories

There are two types of scratch directories in the CML compute infrastructure, network and local scratch directories. Scratch data has no data protection including no snapshots and the data is not backed up.

Network Scratch Directory

Users granted access to the CML compute infrastructure are each allocated 200GB of network attached scratch. This is available as /cmlscratch/USERNAME where USERNAME is your username. This directory is automounted so you will need to cd into the directory or request/specify a fully qualified file path to access this.

Local Scratch Directory

Each computational node that a user can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0, /scratch1, etc. These are almost always more performant than any other storage available to the job. However users must stage their data within the confine of their job and stage the data out before the end of their job.