Nexus/CML: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
 
(74 intermediate revisions by 5 users not shown)
Line 1: Line 1:
The [[CML]] standalone cluster's compute nodes will fold into [[Nexus]] on Thursday, August 17th, 2023 during the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August (5-8pm).
The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).


The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
 
In addition, the CML cluster's standalone submission node <code>cmlsub00.umiacs.umd.edu</code> will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use <code>nexuscml00.umiacs.umd.edu</code> and <code>nexuscml01.umiacs.umd.edu</code> for any general purpose CML compute needs after this time.
 
Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.


Please [[HelpDesk | contact staff]] with any questions or concerns.
Please [[HelpDesk | contact staff]] with any questions or concerns.


==Usage==
==Usage==
The Nexus cluster submission nodes that are allocated to CML are <code>nexuscml00.umiacs.umd.edu</code> and <code>nexuscml01.umiacs.umd.edu</code>. '''You must use these nodes to submit jobs to CML compute nodes after the August maintenance window.''' Submission from <code>cmlsub00.umiacs.umd.edu</code> will no longer work.
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.
 
If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>


All partitions, QoSes, and account names from the standalone CML cluster are being moved over to Nexus when the compute nodes move. However, please note that <code>cml-</code> will be prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account currently named <code>cml</code> in the standalone cluster (will retain same name).
All partitions, QoSes, and account names from the standalone CML cluster have been moved over to Nexus. However, please note that <code>cml-</code> is prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named <code>cml</code> in the standalone cluster (it is also named just <code>cml</code> in Nexus).


Here are some before/after examples of job submission with various parameters:
Here are some before/after examples of job submission with various parameters:
Line 20: Line 20:
! Nexus cluster submission command
! Nexus cluster submission command
|-
|-
|<code>srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtxa4000:2 --pty bash</code>
|<code>srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtx2080ti:2 --pty bash</code>
|<code>srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtxa4000:2 --pty bash</code>
|<code>srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtx2080ti:2 --pty bash</code>
|-
|-
|<code>srun --partition=cpu --qos=cpu --pty bash</code>
|<code>srun --partition=cpu --qos=cpu --pty bash</code>
|<code>srun --partition=cml-cpu --qos=cml-cpu --pty bash</code>
|<code>srun --partition=cml-cpu --qos=cml-cpu --account=cml --pty bash</code>
|-
|-
|<code>srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash</code>
|<code>srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash</code>
Line 30: Line 30:
|}
|}


CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that the <code>cml-dpart</code> partition will have a <code>GrpTRES</code> limit of 100% of the available cores/RAM on each set of cml## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.


Please note that the CML compute nodes will also be added to the institute-wide <code>scavenger</code> partition in Nexus. CML users will still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> queue jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> queue jobs, and <code>cml-scavenger</code> queue jobs can preempt <code>scavenger</code> queue jobs).
Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).


==Timeline==
==Network==
Each event will be completed within the timeframe specified.
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16],cmlcpu[00-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).


{| class="wikitable"
The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.
! Date
! Event
|-
| July 14th 2023
| A single compute node <code>cml03</code> is moved into Nexus so submission can be tested
|-
| [[MonthlyMaintenanceWindow | August 17th 2023, 5-8pm]]
| All standalone CML cluster compute nodes are moved into Nexus in corresponding <code>cml-</code> named partitions
|-
| [[MonthlyMaintenanceWindow | September 21st 2023, 5-8pm]]
| <code>cmlsub00.umiacs.umd.edu</code> is taken offline
|-
|}


==Post-Migration==
For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].
'''The below information will become relevant AFTER 8pm on Thursday, August 17th, 2023.'''


===Partitions===
==Partitions==
There are three partitions available to general CML [[SLURM]] users.  You must specify a partition when submitting your job.
There are three partitions available to general CML [[SLURM]] users.  You must specify a partition when submitting your job.


Line 62: Line 53:
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.


There is one additional partition available solely to Furong's sponsored accounts.
There are two additional partitions available solely to specific faculty members and their sponsored accounts.
 
* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased A6000 node. Job allocations are guaranteed.
* '''cml-zhou''' - This partition is for exclusive priority access to Dr. Tianyi Zhou's purchased nodes. Job allocations are guaranteed.
 
==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time.  Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.
 
If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-cpu''', '''cml-default''', and '''cml-medium''' QoSes (see below section).


* '''cml-furongh''' - This partition is for exclusive priority access to Furong's purchased A6000 node.
If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.


===Accounts===
The current faculty accounts are:
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time.  Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.  If you do not specify an account when submitting your job, you will receive the '''cml''' account.
* cml-abhinav
* cml-cameron
* cml-furongh
* cml-hajiagha
* cml-john
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg
* cml-zhou


<pre>
<pre>
$ sacctmgr show accounts
$ sacctmgr show account format=account%20,description%30,organization%10
    Account                 Descr                 Org
            Account                         Descr       Org
------------- -------------------- --------------------
-------------------- ------------------------------ ----------
  cml-abhinav abhinav shrivastava                 cml
                ...                            ...        ...
          cml                 cml                 cml
                cml                            cml        cml
  cml-furongh         furong huang                 cml
        cml-abhinav     cml - abhinav shrivastava       cml
cml-hajiagha mohammad hajiaghayi                 cml
        cml-cameron            cml - maria cameron        cml
    cml-john       john dickerson                 cml
        cml-furongh             cml - furong huang       cml
  cml-ramani   ramani duraiswami                 cml
        cml-hajiagha     cml - mohammad hajiaghayi       cml
        root default root account                root
            cml-john           cml - john dickerson       cml
cml-scavenger           scavenger            scavenger
          cml-ramani       cml - ramani duraiswami       cml
  cml-sfeizi         soheil feizi                 cml
      cml-scavenger               cml - scavenger       cml
  cml-tokekar       pratap tokekar                 cml
          cml-sfeizi             cml - soheil feizi       cml
    cml-tomg       tom goldstein                  cml
        cml-tokekar           cml - pratap tokekar       cml
            cml-tomg           cml - tom goldstein       cml
            cml-zhou              cml - tianyi zhou        cml
                 ...                            ...        ...
</pre>
</pre>


You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with.  Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.  
Faculty can manage this list of users via our [https://intranet.umiacs.umd.edu/directory/secgroup/ Directory application] in the Security Groups section.  The security group that controls access has the prefix <code>cml_</code> and then the faculty username.  It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.
 
You can check your account associations by running the '''show_assoc''' command to see the accounts you are associated with.  Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.  


<pre>
<pre>
$ show_assoc
$ show_assoc
       User       Account   Def Acct  Def QOS                                  QOS
       User         Account MaxJobs      GrpTRES                                                QOS
---------- -------------- ---------- --------- ------------------------------------
---------- ---------------- ------- ------------- --------------------------------------------------
      tomg      cml-tomg                          cml-default,cml-high,cml-medium
      ...              ...                                                                      ...
       tomg           cml                           cml-cpu,cml-default,cml-medium
       tomg             cml                                           cml-cpu,cml-default,cml-medium
       tomg cml-scavenger                                             cml-scavenger
       tomg   cml-scavenger                                                           cml-scavenger
      tomg        cml-tomg                                          cml-default,cml-high,cml-medium
      ...              ...                                                                      ...
</pre>
</pre>


You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Modern | resource weightings]] for all nodes appropriated to that account.
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.


<pre>
<pre>
$ sacctmgr show assoc account=cml-tomg format=user,account,qos,grptres
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
       User    Account                  QOS      GrpTRES
       User    Account                  QOS      GrpTRES
---------- ---------- -------------------- -------------
---------- ---------- -------------------- -------------
            cml-tomg                       billing=8107
                  cml                      billing=6481
                  ...                                ...
</pre>
</pre>


===QoS===
==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>high_long</code> and <code>very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition.  If you do not specify a QoS when submitting your job, you will receive the '''cml-default''' QoS assuming you are using a CML account. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job.  In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).  
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition.  If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.
 
If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.
 
The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job.  In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).  
 
<pre>
$ show_qos --all | grep cml
                Name    MaxWall                        MaxTRES MaxJobsPU                      MaxTRESPU                                                                           
-------------------- ----------- ------------------------------ --------- ------------------------------     
...                                                                     
            cml-cpu  7-00:00:00                                        8
        cml-default  7-00:00:00      cpu=4,gres/gpu=1,mem=32G        2
            cml-high  1-12:00:00    cpu=16,gres/gpu=4,mem=128G        2
      cml-high_long 14-00:00:00              cpu=32,gres/gpu=8        8                    gres/gpu=8
          cml-medium  3-00:00:00      cpu=8,gres/gpu=2,mem=64G        2
      cml-scavenger  3-00:00:00                                                            gres/gpu=24
      cml-very_high  1-12:00:00    cpu=32,gres/gpu=8,mem=256G        8                    gres/gpu=12
...
</pre>


<pre>
<pre>
$ show_qos
$ show_partition_qos --all | grep cml
        Name     MaxWall  MaxJobs                        MaxTRES                     MaxTRESPU              GrpTRES
                Name MaxSubmitPU                     MaxTRESPU              GrpTRES  
------------- ----------- ------- ------------------------------ ------------------------------ --------------------
-------------------- ----------- ------------------------------ --------------------  
  cml-medium  3-00:00:00      2      cpu=8,gres/gpu=2,mem=64G
...
  cml-default  7-00:00:00      2      cpu=4,gres/gpu=1,mem=32G
                cml         500                                    cpu=1128,mem=11T
    cml-high  1-12:00:00      2    cpu=16,gres/gpu=4,mem=128G
            cml-cpu         500
cml-scavenger 3-00:00:00                                                          gres/gpu=24
        cml-furongh        500
      normal
      cml-scavenger         500                    gres/gpu=24
      cml-cpu  7-00:00:00      8
          cml-wriva        500
cml-very_high  1-12:00:00      8    cpu=32,gres/gpu=8,mem=256G                    gres/gpu=12
            cml-zhou        500
cml-high_long 14-00:00:00      8              cpu=32,gres/gpu=8                    gres/gpu=8
...
</pre>
</pre>


===Data Storage===
==Storage==
[[CML#Data_Storage | All data storage that was available on the standalone CML cluster]] will continue to be available in Nexus.
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories
 
There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories
 
CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].
 
===Home Directories===
{{Nfshomes}}
 
===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]
 
To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation.  Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any
 
These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.  Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML). 
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period.  Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.
 
This data is backed up nightly.
 
===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories
 
====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>.  '''It is not backed up or protected in any way.'''  This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.
 
You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]].  If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.
 
This file system is available on all submission and computational nodes within the cluster.
 
====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories.  These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc.  These are almost always more performant than any other storage available to the job.  However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.
 
These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows.  Again, please make sure you secure any data you write to these directories at the end of your job.
 
===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>.  If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].
 
The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].
 
===Models===
We have read-only model storage available at <code>/fs/cml-models</code>.  If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Latest revision as of 20:22, 4 December 2024

The compute nodes from CML's previous standalone cluster have folded into Nexus as of the scheduled maintenance window for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found here.

Please contact staff with any questions or concerns.

Usage

You can SSH to nexuscml.umiacs.umd.edu to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:

  • nexuscml00.umiacs.umd.edu
  • nexuscml01.umiacs.umd.edu

All partitions, QoSes, and account names from the standalone CML cluster have been moved over to Nexus. However, please note that cml- is prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named cml in the standalone cluster (it is also named just cml in Nexus).

Here are some before/after examples of job submission with various parameters:

Standalone CML cluster submission command Nexus cluster submission command
srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtx2080ti:2 --pty bash srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtx2080ti:2 --pty bash
srun --partition=cpu --qos=cpu --pty bash srun --partition=cml-cpu --qos=cml-cpu --account=cml --pty bash
srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the cml-dpart partition has a GrpTRES limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named cml.

Please note that the CML compute nodes are also in the institute-wide scavenger partition in Nexus. CML users still have scavenging priority over these nodes via the cml-scavenger partition (i.e., all cml- partition jobs (other than cml-scavenger) can preempt both cml-scavenger and scavenger partition jobs, and cml-scavenger partition jobs can preempt scavenger partition jobs).

Network

The network infrastructure supporting the CML partition consists of:

  1. One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
    • cml[17-28,30-32]: Two 100GbE links per node, one to each switch in the pair (redundancy).
  2. One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
    • cml[00-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
    • cml[10-16],cmlcpu[00-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML project, scratch, dataset, and model allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see Nexus/Network.

Partitions

There are three partitions available to general CML SLURM users. You must specify a partition when submitting your job.

  • cml-dpart - This is the default partition. Job allocations are guaranteed.
  • cml-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other cml- partitions are ready to be scheduled.
  • cml-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are two additional partitions available solely to specific faculty members and their sponsored accounts.

  • cml-furongh - This partition is for exclusive priority access to Dr. Furong Huang's purchased A6000 node. Job allocations are guaranteed.
  • cml-zhou - This partition is for exclusive priority access to Dr. Tianyi Zhou's purchased nodes. Job allocations are guaranteed.

Accounts

The Center has a base SLURM account cml which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the cml account, which only has access to the cml-cpu, cml-default, and cml-medium QoSes (see below section).

If you need access to a different QoS, or if the cml account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the cml-dpart partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:

  • cml-abhinav
  • cml-cameron
  • cml-furongh
  • cml-hajiagha
  • cml-john
  • cml-ramani
  • cml-sfeizi
  • cml-tokekar
  • cml-tomg
  • cml-zhou
$ sacctmgr show account format=account%20,description%30,organization%10
             Account                          Descr        Org
-------------------- ------------------------------ ----------
                 ...                            ...        ...
                 cml                            cml        cml
         cml-abhinav      cml - abhinav shrivastava        cml
         cml-cameron            cml - maria cameron        cml
         cml-furongh             cml - furong huang        cml
        cml-hajiagha      cml - mohammad hajiaghayi        cml
            cml-john           cml - john dickerson        cml
          cml-ramani        cml - ramani duraiswami        cml
       cml-scavenger                cml - scavenger        cml
          cml-sfeizi             cml - soheil feizi        cml
         cml-tokekar           cml - pratap tokekar        cml
            cml-tomg            cml - tom goldstein        cml
            cml-zhou              cml - tianyi zhou        cml
                 ...                            ...        ...

Faculty can manage this list of users via our Directory application in the Security Groups section. The security group that controls access has the prefix cml_ and then the faculty username. It will also list slurm://nexusctl.umiacs.umd.edu as the associated URI.

You can check your account associations by running the show_assoc command to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.

$ show_assoc
      User          Account MaxJobs       GrpTRES                                                QOS
---------- ---------------- ------- ------------- --------------------------------------------------
       ...              ...                                                                      ...
      tomg              cml                                           cml-cpu,cml-default,cml-medium
      tomg    cml-scavenger                                                            cml-scavenger
      tomg         cml-tomg                                          cml-default,cml-high,cml-medium
       ...              ...                                                                      ...

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of resource weightings for all nodes appropriated to that account.

$ sacctmgr show assoc account=cml format=user,account,qos,grptres
      User    Account                  QOS       GrpTRES
---------- ---------- -------------------- -------------
                  cml                       billing=6481
                  ...                                ...

QoS

CML currently has 5 QoS for the cml-dpart partition (though cml-high_long and cml-very_high may not be available to all faculty accounts), 1 QoS for the cml-scavenger partition, and 1 QoS for the cml-cpu partition. If you do not specify a QoS when submitting your job using the --qos parameter, you will receive the cml-default QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the cml-high_long or cml-very_high QoS available to it, we can add it to their account provided they approve. Please contact staff if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

$ show_qos --all | grep cml
                Name     MaxWall                        MaxTRES MaxJobsPU                      MaxTRESPU                                                                             
-------------------- ----------- ------------------------------ --------- ------------------------------      
...                                                                       
             cml-cpu  7-00:00:00                                        8
         cml-default  7-00:00:00       cpu=4,gres/gpu=1,mem=32G         2
            cml-high  1-12:00:00     cpu=16,gres/gpu=4,mem=128G         2
       cml-high_long 14-00:00:00              cpu=32,gres/gpu=8         8                     gres/gpu=8
          cml-medium  3-00:00:00       cpu=8,gres/gpu=2,mem=64G         2
       cml-scavenger  3-00:00:00                                                             gres/gpu=24
       cml-very_high  1-12:00:00     cpu=32,gres/gpu=8,mem=256G         8                    gres/gpu=12
...
$ show_partition_qos --all | grep cml
                Name MaxSubmitPU                      MaxTRESPU              GrpTRES 
-------------------- ----------- ------------------------------ -------------------- 
...
                 cml         500                                    cpu=1128,mem=11T
             cml-cpu         500
         cml-furongh         500
       cml-scavenger         500                    gres/gpu=24
           cml-wriva         500
            cml-zhou         500
...

Storage

There are 3 types of user storage available to users in the CML:

  • Home directories
  • Project directories
  • Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:

  • Dataset directories
  • Model directories

CML users can also request Nexus project allocations.

Home Directories

You have 30GB of home directory storage available at /nfshomes/<username>. It has both Snapshots and Backups enabled.

Home directories are intended to store personal or configuration files only. We encourage you to not share any data in your home directory. You are encouraged to utilize our GitLab infrastructure to host your code repositories.

NOTE: To check your quota on this directory, use the command df -h ~.

Project Directories

You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:

  • Allocations up to and including 3TB require approval from a CML faculty member
  • Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the director of CML

To request an allocation, please contact staff with the faculty member(s) that the project is under involved in the conversation. Please include the following details:

  • Project Name (short)
  • Description
  • Size (1TB, 2TB, etc.)
  • Length in days (30 days, 90 days, etc.)
  • Other user(s) that need to access the allocation, if any

These allocations will be available from /fs/cml-projects under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).

  • If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
  • If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
    • If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
    • If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

This data is backed up nightly.

Scratch Directories

Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:

  • Network scratch directory
  • Local scratch directories

Network Scratch Directory

You have 200GB of scratch storage available at /cmlscratch/<username>. It is not backed up or protected in any way. This directory is automounted so you will need to cd into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by contacting staff. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.

  • As with project directories, allocations over 3TB total space require approval from the director of CML in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

Local Scratch Directories

Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0, /scratch1, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will delete unaccessed data after 90 days, scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

Datasets

We have read-only dataset storage available at /fs/cml-datasets. If there are datasets that you would like to see curated and made available, please see this page.

The list of CML datasets we currently host can be viewed here.

Models

We have read-only model storage available at /fs/cml-models. If there are models that you would like to see downloaded and made available, please see this page.