Nexus/CML: Difference between revisions
No edit summary |
No edit summary |
||
Line 48: | Line 48: | ||
|- | |- | ||
|} | |} | ||
==Post-Migration== | |||
'''The below information will become relevant AFTER 8pm on Thursday, August 17th, 2023.''' | |||
==Partitions== | |||
There are three partitions to the CML [[SLURM]] computational infrastructure. If you do not specify a partition when submitting your job, you will receive the '''dpart''' partition. | |||
* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed. | |||
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other partitions are ready to be scheduled. | |||
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed. | |||
==Accounts== | |||
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested. If you do not specify an account when submitting your job, you will receive the '''cml''' account. | |||
<pre> | |||
$ sacctmgr show accounts | |||
Account Descr Org | |||
------------- -------------------- -------------------- | |||
cml-abhinav abhinav shrivastava cml | |||
cml cml cml | |||
cml-furongh furong huang cml | |||
cml-hajiagha mohammad hajiaghayi cml | |||
cml-john john dickerson cml | |||
cml-ramani ramani duraiswami cml | |||
root default root account root | |||
cml-scavenger scavenger scavenger | |||
cml-sfeizi soheil feizi cml | |||
cml-tokekar pratap tokekar cml | |||
cml-tomg tom goldstein cml | |||
</pre> | |||
You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association. | |||
<pre> | |||
$ show_assoc | |||
User Account Def Acct Def QOS QOS | |||
---------- -------------- ---------- --------- ------------------------------------ | |||
tomg cml-tomg cml-default,cml-high,cml-medium | |||
tomg cml cml-cpu,cml-default,cml-medium | |||
tomg cml-scavenger cml-scavenger | |||
</pre> | |||
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. | |||
<pre> | |||
$ sacctmgr show assoc account=cml-tomg format=user,account,qos,grptres | |||
User Account QOS GrpTRES | |||
---------- ---------- -------------------- ------------- | |||
cml-tomg billing=8107 | |||
</pre> | |||
===QoS=== | |||
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>high_long</code> and <code>very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition. If you do not specify a QoS when submitting your job, you will receive the '''cml-default''' QoS assuming you are using a CML account. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs). | |||
<pre> | |||
$ show_qos | |||
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES | |||
------------- ----------- ------- ------------------------------ ------------------------------ -------------------- | |||
cml-medium 3-00:00:00 2 cpu=8,gres/gpu=2,mem=64G | |||
cml-default 7-00:00:00 2 cpu=4,gres/gpu=1,mem=32G | |||
cml-high 1-12:00:00 2 cpu=16,gres/gpu=4,mem=128G | |||
cml-scavenger 3-00:00:00 gres/gpu=24 | |||
normal | |||
cml-cpu 7-00:00:00 8 | |||
cml-very_high 1-12:00:00 8 cpu=32,gres/gpu=8,mem=256G gres/gpu=12 | |||
cml-high_long 14-00:00:00 8 cpu=32,gres/gpu=8 gres/gpu=8 | |||
</pre> |
Revision as of 19:09, 3 July 2023
The CML standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August (5-8pm).
The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.
In addition, the CML cluster's standalone submission node cmlsub00.umiacs.umd.edu
will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use nexuscml00.umiacs.umd.edu
and nexuscml01.umiacs.umd.edu
for any general purpose CML compute needs after this time.
Please see the Timeline section below for concrete dates in chronological order.
Please contact staff with any questions or concerns.
Usage
The Nexus cluster submission nodes that are allocated to CML are nexuscml00.umiacs.umd.edu
and nexuscml01.umiacs.umd.edu
. You must use these nodes to submit jobs to CML compute nodes after the August maintenance window. Submission from cmlsub00.umiacs.umd.edu
will no longer work.
All partitions, QoSes, and account names from the standalone CML cluster are being moved over to Nexus when the compute nodes move. However, please note that cml-
will be prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus.
Here are some before/after examples of job submission with various parameters:
Standalone CML cluster submission command | Nexus cluster submission command |
---|---|
srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtxa4000:2 --pty bash
|
srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtxa4000:2 --pty bash
|
srun --partition=cpu --qos=cpu --pty bash
|
srun --partition=cml-cpu --qos=cml-cpu --pty bash
|
srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash
|
srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash
|
CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that the cml-dpart
partition will have a GrpTRES
limit of 100% of the available cores/RAM on each set of cml## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
Please note that the CML compute nodes will also be added to the institute-wide scavenger
partition in Nexus. CML users will still have scavenging priority over these nodes via the cml-scavenger
partition (i.e., all cml-
queue jobs (other than cml-scavenger
) can preempt both cml-scavenger
and scavenger
queue jobs, and cml-scavenger
queue jobs can preempt scavenger
queue jobs).
Timeline
Each event will be completed within the timeframe specified.
Date | Event |
---|---|
August 17th 2023, 5-8pm | All standalone CML cluster compute nodes are moved into Nexus in corresponding cml- named partitions
|
September 21st 2023, 5-8pm | cmlsub00.umiacs.umd.edu is taken offline
|
Post-Migration
The below information will become relevant AFTER 8pm on Thursday, August 17th, 2023.
Partitions
There are three partitions to the CML SLURM computational infrastructure. If you do not specify a partition when submitting your job, you will receive the dpart partition.
- cml-dpart - This is the default partition. Job allocations are guaranteed.
- cml-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other partitions are ready to be scheduled.
- cml-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.
Accounts
The Center has a base SLURM account cml
which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested. If you do not specify an account when submitting your job, you will receive the cml account.
$ sacctmgr show accounts Account Descr Org ------------- -------------------- -------------------- cml-abhinav abhinav shrivastava cml cml cml cml cml-furongh furong huang cml cml-hajiagha mohammad hajiaghayi cml cml-john john dickerson cml cml-ramani ramani duraiswami cml root default root account root cml-scavenger scavenger scavenger cml-sfeizi soheil feizi cml cml-tokekar pratap tokekar cml cml-tomg tom goldstein cml
You can check your account associations by running the show_assoc to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.
$ show_assoc User Account Def Acct Def QOS QOS ---------- -------------- ---------- --------- ------------------------------------ tomg cml-tomg cml-default,cml-high,cml-medium tomg cml cml-cpu,cml-default,cml-medium tomg cml-scavenger cml-scavenger
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.
$ sacctmgr show assoc account=cml-tomg format=user,account,qos,grptres User Account QOS GrpTRES ---------- ---------- -------------------- ------------- cml-tomg billing=8107
QoS
CML currently has 5 QoS for the cml-dpart partition (though high_long
and very_high
may not be available to all faculty accounts), 1 QoS for the cml-scavenger partition, and 1 QoS for the cml-cpu partition. If you do not specify a QoS when submitting your job, you will receive the cml-default QoS assuming you are using a CML account. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).
$ show_qos Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES ------------- ----------- ------- ------------------------------ ------------------------------ -------------------- cml-medium 3-00:00:00 2 cpu=8,gres/gpu=2,mem=64G cml-default 7-00:00:00 2 cpu=4,gres/gpu=1,mem=32G cml-high 1-12:00:00 2 cpu=16,gres/gpu=4,mem=128G cml-scavenger 3-00:00:00 gres/gpu=24 normal cml-cpu 7-00:00:00 8 cml-very_high 1-12:00:00 8 cpu=32,gres/gpu=8,mem=256G gres/gpu=12 cml-high_long 14-00:00:00 8 cpu=32,gres/gpu=8 gres/gpu=8