Nexus/CML: Difference between revisions
No edit summary |
(→QoS) |
||
Line 114: | Line 114: | ||
<pre> | <pre> | ||
$ show_qos | $ show_qos | ||
Name MaxWall MaxTRES MaxJobsPU | Name MaxWall MaxTRES MaxJobsPU MaxTRESPU | ||
-------------------- ----------- ------------------------------ --------- ----------- ------------------- | -------------------- ----------- ------------------------------ --------- ------------------------------ | ||
... | |||
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2 | cml-cpu 7-00:00:00 8 | ||
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2 | |||
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2 | |||
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8 | |||
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2 | |||
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 | cml-scavenger 3-00:00:00 gres/gpu=24 | ||
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12 | |||
cml | ... | ||
$ show_partition_qos | |||
Name MaxSubmitPU MaxTRESPU GrpTRES | |||
-------------------- ----------- ------------------------------ -------------------- | |||
... | |||
cml 500 cpu=1128,mem=11381G | |||
cml-scavenger 500 gres/gpu=24 | |||
... | |||
</pre> | </pre> | ||
Revision as of 15:50, 22 September 2023
The CML standalone cluster's compute nodes have folded into Nexus as of the scheduled maintenance window for August 2023 (Thursday 08/17/2023, 5-8pm).
The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.
The CML cluster's standalone submission node cmlsub00.umiacs.umd.edu
will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as it is no longer able to submit jobs to CML compute nodes. Please use nexuscml00.umiacs.umd.edu
and nexuscml01.umiacs.umd.edu
for any general purpose CML compute needs after this time.
Please see the Timeline section below for concrete dates in chronological order.
Please contact staff with any questions or concerns.
Usage
The Nexus cluster submission nodes that are allocated to CML are nexuscml00.umiacs.umd.edu
and nexuscml01.umiacs.umd.edu
. You must use these nodes to submit jobs to CML compute nodes. Submission from cmlsub00.umiacs.umd.edu
is no longer available.
All partitions, QoSes, and account names from the standalone CML cluster have been moved over to Nexus. However, please note that cml-
is prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named cml
in the standalone cluster (it is also named just cml
in Nexus).
Here are some before/after examples of job submission with various parameters:
Standalone CML cluster submission command | Nexus cluster submission command |
---|---|
srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtx2080ti:2 --pty bash
|
srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtx2080ti:2 --pty bash
|
srun --partition=cpu --qos=cpu --pty bash
|
srun --partition=cml-cpu --qos=cml-cpu --account=cml --pty bash
|
srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash
|
srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash
|
CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the cml-dpart
partition has a GrpTRES
limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named cml.
Please note that the CML compute nodes are also in the institute-wide scavenger
partition in Nexus. CML users still have scavenging priority over these nodes via the cml-scavenger
partition (i.e., all cml-
partition jobs (other than cml-scavenger
) can preempt both cml-scavenger
and scavenger
partition jobs, and cml-scavenger
partition jobs can preempt scavenger
partition jobs).
Partitions
There are three partitions available to general CML SLURM users. You must specify a partition when submitting your job.
- cml-dpart - This is the default partition. Job allocations are guaranteed.
- cml-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other
cml-
partitions are ready to be scheduled. - cml-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.
There is one additional partition available solely to Dr. Furong Huang's sponsored accounts.
- cml-furongh - This partition is for exclusive priority access to Dr. Huang's purchased A6000 node.
Accounts
The Center has a base SLURM account cml
which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.
If you do not specify an account when submitting your job, you will receive the cml
account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.
The current faculty accounts are:
- cml-abhinav
- cml-cameron
- cml-furongh
- cml-hajiagha
- cml-john
- cml-ramani
- cml-sfeizi
- cml-tokekar
- cml-tomg
- cml-zhou
$ sacctmgr show account format=account%20,description%30,organization%10 Account Descr Org -------------------- ------------------------------ ---------- ... ... ... cml cml cml cml-abhinav cml - abhinav shrivastava cml cml-cameron cml - maria cameron cml cml-furongh cml - furong huang cml cml-hajiagha cml - mohammad hajiaghayi cml cml-john cml - john dickerson cml cml-ramani cml - ramani duraiswami cml cml-scavenger cml - scavenger cml cml-sfeizi cml - soheil feizi cml cml-tokekar cml - pratap tokekar cml cml-tomg cml - tom goldstein cml cml-zhou cml - tianyi zhou cml ... ... ...
Faculty can manage this list of users via our Directory application in the Security Groups section. The security group that controls access has the prefix cml_
and then the faculty username. It will also list slurm://nexusctl.umiacs.umd.edu
as the associated URI.
You can check your account associations by running the show_assoc command to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.
$ show_assoc User Account MaxJobs GrpTRES QOS ---------- ---------------- ------- ------------- -------------------------------------------------- ... ... ... tomg cml cml-cpu,cml-default,cml-medium tomg cml-scavenger cml-scavenger tomg cml-tomg cml-default,cml-high,cml-medium ... ... ...
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of resource weightings for all nodes appropriated to that account.
$ sacctmgr show assoc account=cml format=user,account,qos,grptres User Account QOS GrpTRES ---------- ---------- -------------------- ------------- cml billing=7732 ... ...
QoS
CML currently has 5 QoS for the cml-dpart partition (though high_long
and very_high
may not be available to all faculty accounts), 1 QoS for the cml-scavenger partition, and 1 QoS for the cml-cpu partition. If you do not specify a QoS when submitting your job using the --qos
parameter, you will receive the cml-default QoS assuming you are using a CML account.
The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).
$ show_qos Name MaxWall MaxTRES MaxJobsPU MaxTRESPU -------------------- ----------- ------------------------------ --------- ------------------------------ ... cml-cpu 7-00:00:00 8 cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2 cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2 cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8 cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2 cml-scavenger 3-00:00:00 gres/gpu=24 cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12 ... $ show_partition_qos Name MaxSubmitPU MaxTRESPU GrpTRES -------------------- ----------- ------------------------------ -------------------- ... cml 500 cpu=1128,mem=11381G cml-scavenger 500 gres/gpu=24 ...
Storage
There are 3 types of user storage available to users in the CML:
- Home directories
- Project directories
- Scratch directories
There are also 2 types of read-only storage available for common use among users in the CML:
- Dataset directories
- Model directories
CML users can also request Nexus project allocations.
Home Directories
Home directories in the CML computational infrastructure are available from the Institute's NFShomes as /nfshomes/USERNAME
where USERNAME is your username. These home directories have very limited storage (30GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is not intended for data sets or other large scale data holdings. Users are encouraged to utilize our GitLab infrastructure to host your code repositories.
NOTE: To check your quota on this directory you will need to use the quota -s
command.
Your home directory data is fully protected and has both snapshots and is backed up nightly.
Project Directories
You can request project based allocations for up to 6TB for up to 120 days with approval from a CML faculty member and the director of CML.
To request an allocation, please contact staff with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
- Project Name (short)
- Description
- Size (1TB, 2TB, etc.)
- Length in days (30 days, 90 days, etc.)
- Other user(s) that need to access the allocation, if any
These allocations will be available from /fs/cml-projects under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and the director of CML). If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation. If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible. If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible. If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.
This data is backed up nightly.
Scratch Directories
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
- Network scratch directory
- Local scratch directories
Network Scratch Directory
You are allocated 400GB of scratch space via NFS from /cmlscratch/$username
. It is not backed up or protected in any way. This directory is automounted so you will need to cd
into the directory or request/specify a fully qualified file path to access this.
You may request a permanent increase of up to 800GB total space without any faculty approval by contacting staff. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and the director of CML.
This file system is available on all submission, data management, and computational nodes within the cluster.
Local Scratch Directories
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0
, /scratch1
, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.
These local scratch directories have a tmpwatch job which will delete unaccessed data after 90 days, scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.
Datasets
We have read-only dataset storage available at /fs/cml-datasets
. If there are datasets that you would like to see curated and available, please see this page.
The following is the list of datasets available:
Dataset | Path |
---|---|
CelebA | /fs/cml-datasets/CelebA |
CelebA-HQ | /fs/cml-datasets/CelebA-HQ |
CelebAMask-HQ | /fs/cml-datasets/CelebAMask-HQ |
Charades | /fs/cml-datasets/Charades |
Cityscapes | /fs/cml-datasets/cityscapes |
COCO | /fs/cml-datasets/coco |
Diversity in Faces [1] | /fs/cml-datasets/diversity_in_faces |
FFHQ | /fs/cml-datasets/FFHQ |
ImageNet ILSVRC2012 | /fs/cml-datasets/ImageNet/ILSVRC2012 |
LFW | /fs/cml-datasets/facial_test_data |
LibriSpeech | /fs/cml-datasets/LibriSpeech |
LSUN | /fs/cml-datasets/LSUN |
MAG240M | /fs/cml-datasets/OGB/MAG240M |
MegaFace | /fs/cml-datasets/megaface |
MS-Celeb-1M | /fs/cml-datasets/MS_Celeb_aligned_112 |
OC20 | /fs/cml-datasets/OC20 |
ogbn-papers100M | /fs/cml-datasets/OGB/ogbn-papers100M |
roberta | /fs/cml-datasets/roberta |
Salient ImageNet | /fs/cml-datasets/Salient-ImageNet |
ShapeNetCore.v2 | /fs/cml-datasets/ShapeNetCore.v2 |
Tiny ImageNet | /fs/cml-datasets/tiny_imagenet |
WikiKG90M | /fs/cml-datasets/OGB/WikiKG90M |
[1] - This dataset has restricted access. Please contact staff if you are looking to use this dataset.
Models
We have read-only model storage available at /fs/cml-models
. If there are models that you would like to see downloaded and made available, please see this page.
Timeline
Each event will be completed within the timeframe specified.
Date | Event |
---|---|
July 14th 2023 | A single compute node cml03 is moved into Nexus so submission can be tested
|
August 17th 2023, 5-8pm | All other standalone CML cluster compute nodes are moved into Nexus in corresponding cml- named partitions
|
September 21st 2023, 5-8pm | cmlsub00.umiacs.umd.edu is taken offline
|