Nexus/CML: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
Line 32: Line 32:
|}
|}


CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on each set of cml## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on each set of cml## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.


Please note that the CML compute nodes will also be added to the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> queue jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> queue jobs, and <code>cml-scavenger</code> queue jobs can preempt <code>scavenger</code> queue jobs).
Please note that the CML compute nodes will also be added to the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> queue jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> queue jobs, and <code>cml-scavenger</code> queue jobs can preempt <code>scavenger</code> queue jobs).

Revision as of 13:25, 21 July 2023

The CML standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August (5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.

In addition, the CML cluster's standalone submission node cmlsub00.umiacs.umd.edu will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu for any general purpose CML compute needs after this time.

As of Friday, July 14th, 2023, a single 2080 Ti compute node (cml03) has been moved into Nexus to give you a chance to test your new submission scripts now if you would like. Please continue to run your normal CML workloads on the standalone CML cluster for now, as a single compute node will not be able to handle jobs from multiple users simultaneously. Only the cml-dpart and cml-scavenger partitions are available to test with.

Please see the Timeline section below for concrete dates in chronological order.

Please contact staff with any questions or concerns.

Usage

The Nexus cluster submission nodes that are allocated to CML are nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu. You must use these nodes to submit jobs to CML compute nodes after the August maintenance window. Submission from cmlsub00.umiacs.umd.edu will no longer work.

All partitions, QoSes, and account names from the standalone CML cluster have been moved over to Nexus in advance of the compute node move. Please note that cml- is prepended to all of the values that are present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account currently named cml in the standalone cluster (it is also named just cml in Nexus).

Here are some before/after examples of job submission with various parameters:

Standalone CML cluster submission command Nexus cluster submission command
srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtx2080ti:2 --pty bash srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtx2080ti:2 --pty bash
srun --partition=cpu --qos=cpu --pty bash srun --partition=cml-cpu --qos=cml-cpu --pty bash
srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash

CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the cml-dpart partition has a GrpTRES limit of 100% of the available cores/RAM on each set of cml## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

Please note that the CML compute nodes will also be added to the institute-wide scavenger partition in Nexus. CML users still have scavenging priority over these nodes via the cml-scavenger partition (i.e., all cml- queue jobs (other than cml-scavenger) can preempt both cml-scavenger and scavenger queue jobs, and cml-scavenger queue jobs can preempt scavenger queue jobs).

Timeline

Each event will be completed within the timeframe specified.

Date Event
July 14th 2023 A single compute node cml03 is moved into Nexus so submission can be tested
August 17th 2023, 5-8pm All other standalone CML cluster compute nodes are moved into Nexus in corresponding cml- named partitions
September 21st 2023, 5-8pm cmlsub00.umiacs.umd.edu is taken offline

Post-Migration

Partitions

There are three partitions available to general CML SLURM users. You must specify a partition when submitting your job.

  • cml-dpart - This is the default partition. Job allocations are guaranteed.
  • cml-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other cml- partitions are ready to be scheduled.
  • cml-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.

There is one additional partition available solely to Furong's sponsored accounts.

  • cml-furongh - This partition is for exclusive priority access to Furong's purchased A6000 node.

Accounts

The Center has a base SLURM account cml which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested. If you do not specify an account when submitting your job, you will receive the cml account.

$ sacctmgr show accounts
    Account                  Descr                  Org
------------- -------------------- --------------------
  cml-abhinav  abhinav shrivastava                  cml
          cml                  cml                  cml
  cml-furongh         furong huang                  cml
 cml-hajiagha  mohammad hajiaghayi                  cml
     cml-john       john dickerson                  cml
   cml-ramani    ramani duraiswami                  cml
         root default root account                 root
cml-scavenger            scavenger            scavenger
   cml-sfeizi         soheil feizi                  cml
  cml-tokekar       pratap tokekar                  cml
     cml-tomg        tom goldstein                  cml

You can check your account associations by running the show_assoc to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.

$ show_assoc
      User        Account   Def Acct   Def QOS                                  QOS
---------- -------------- ---------- --------- ------------------------------------
      tomg       cml-tomg                           cml-default,cml-high,cml-medium
      tomg            cml                            cml-cpu,cml-default,cml-medium
      tomg  cml-scavenger                                             cml-scavenger

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of resource weightings for all nodes appropriated to that account.

$ sacctmgr show assoc account=cml-tomg format=user,account,qos,grptres
      User    Account                  QOS       GrpTRES
---------- ---------- -------------------- -------------
             cml-tomg                       billing=8107

QoS

CML currently has 5 QoS for the cml-dpart partition (though high_long and very_high may not be available to all faculty accounts), 1 QoS for the cml-scavenger partition, and 1 QoS for the cml-cpu partition. If you do not specify a QoS when submitting your job, you will receive the cml-default QoS assuming you are using a CML account. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

$ show_qos
        Name     MaxWall  MaxJobs                        MaxTRES                      MaxTRESPU              GrpTRES
------------- ----------- ------- ------------------------------ ------------------------------ --------------------
   cml-medium  3-00:00:00       2       cpu=8,gres/gpu=2,mem=64G
  cml-default  7-00:00:00       2       cpu=4,gres/gpu=1,mem=32G
     cml-high  1-12:00:00       2     cpu=16,gres/gpu=4,mem=128G
cml-scavenger  3-00:00:00                                                           gres/gpu=24
       normal
      cml-cpu  7-00:00:00       8
cml-very_high  1-12:00:00       8     cpu=32,gres/gpu=8,mem=256G                    gres/gpu=12
cml-high_long 14-00:00:00       8              cpu=32,gres/gpu=8                     gres/gpu=8

Data Storage

All data storage that was available on the standalone CML cluster will continue to be available in Nexus.