Nexus/Vulcan

From UMIACS
Revision as of 15:08, 3 August 2023 by Mbaney (talk | contribs)
Jump to navigation Jump to search

The Vulcan standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August (5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.

In addition, the Vulcan cluster's standalone submission nodes vulcansub00.umiacs.umd.edu and vulcansub01.umiacs.umd.edu will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as they will no longer be able to submit jobs to Vulcan compute nodes after the August maintenance window. Please use nexusvulcan00.umiacs.umd.edu and nexusvulcan01.umiacs.umd.edu for any general purpose Vulcan compute needs after this time.

As of Friday, July 21st, 2023, two 1080 Ti compute nodes (vulcan11 and vulcan12) have been moved into Nexus to give you a chance to test your new submission scripts now if you would like. Please continue to run your normal Vulcan workloads on the standalone Vulcan cluster for now, as two compute nodes will not be able to handle jobs from multiple users simultaneously. Only the vulcan-dpart and vulcan-scavenger partitions are available to test with.

Please see the Timeline section below for concrete dates in chronological order.

Please contact staff with any questions or concerns.

Usage

The Nexus cluster submission nodes that are allocated to Vulcan are nexusvulcan00.umiacs.umd.edu and nexusvulcan01.umiacs.umd.edu. You must use these nodes to submit jobs to Vulcan compute nodes after the August maintenance window. Submission from vulcansub00.umiacs.umd.edu or vulcansub01.umiacs.umd.edu will no longer work.

All partitions, QoSes, and account names from the standalone Vulcan cluster are being moved over to Nexus when the compute nodes move. However, please note that vulcan- will be prepended to all of the values that were present in the standalone Vulcan cluster to distinguish them from existing values in Nexus. The lone exception is the base account currently named vulcan in the standalone cluster (will retain same name).

Here are some before/after examples of job submission with various parameters:

Standalone Vulcan cluster submission command Nexus cluster submission command
srun --partition=dpart --qos=medium --account=abhinav --gres=gpu:rtxa4000:2 --pty bash srun --partition=vulcan-dpart --qos=vulcan-medium --account=vulcan-abhinav --gres=gpu:rtxa4000:2 --pty bash
srun --partition=cpu --qos=cpu --pty bash srun --partition=vulcan-cpu --qos=vulcan-cpu --pty bash
srun --partition=scavenger --qos=scavenger --account=vulcan --gres=gpu:4 --pty bash srun --partition=vulcan-scavenger --qos=vulcan-scavenger --account=vulcan --gres=gpu:4 --pty bash

Vulcan users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the vulcan-dpart partition will have a GrpTRES limit of 100% of the available cores/RAM on vulcan## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs simultaneously so as to not overload the cluster. This is codified by the partition QoS named vulcan.

Please note that the Vulcan compute nodes will also be added to the institute-wide scavenger partition in Nexus. Vulcan users will still have scavenging priority over these nodes via the vulcan-scavenger partition (i.e., all vulcan- queue jobs (other than vulcan-scavenger) can preempt both vulcan-scavenger and scavenger queue jobs, and vulcan-scavenger queue jobs can preempt scavenger queue jobs).

Timeline

Each event will be completed within the timeframe specified.

Date Event
July 21st 2023 Two compute nodes vulcan11 and vulcan12 are moved into Nexus so submission can be tested
August 17th 2023, 5-8pm All other standalone Vulcan cluster compute nodes are moved into Nexus in corresponding vulcan- named partitions
September 21st 2023, 5-8pm vulcansub00.umiacs.umd.edu and vulcansub01.umiacs.umd.edu are taken offline

Migration

Home Directories

The Nexus uses NFShomes home directories - if your UMIACS account was created before February 22nd, 2023, you have been using /cfarhomes/<username> as your home directory on the standalone Vulcan cluster. While /cfarhomes is available on Nexus, your shell initialization scripts from it will not automatically load. Please copy over anything you need to your /nfshomes/<username> directory at your earliest convenience, as /cfarhomes may be retired in the coming year.

Post-Migration

Partitions

There are three partitions available to general Vulcan SLURM users. You must specify a partition when submitting your job.

  • vulcan-dpart - This is the default partition. Job allocations are guaranteed.
  • vulcan-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other vulcan- partitions are ready to be scheduled.
  • vulcan-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are a few additional partitions available to subsets of Vulcan users based on specific requirements.

Accounts

Vulcan has a base SLURM account vulcan which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested. If you do not specify an account when submitting your job, you will receive the vulcan account.

$ sacctmgr show account format=account%20,description%30,organization%10
             Account                          Descr        Org
-------------------- ------------------------------ ----------
                 ...
              vulcan                         vulcan     vulcan
      vulcan-abhinav   vulcan - abhinav shrivastava     vulcan
      vulcan-djacobs          vulcan - david jacobs     vulcan
        vulcan-janus                 vulcan - janus     vulcan
      vulcan-jbhuang         vulcan - jia-bin huang     vulcan
          vulcan-lsd           vulcan - larry davis     vulcan
      vulcan-metzler         vulcan - chris metzler     vulcan
         vulcan-rama        vulcan - rama chellappa     vulcan
       vulcan-ramani     vulcan - ramani duraiswami     vulcan
        vulcan-yaser          vulcan - yaser yacoob     vulcan
      vulcan-zwicker      vulcan - matthias zwicker     vulcan
                 ...

You can check your account associations by running the show_assoc command to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.

$ show_assoc
      User          Account MaxJobs       GrpTRES                                                                              QOS
---------- ---------------- ------- ------------- --------------------------------------------------------------------------------
       ...
   abhinav          abhinav      48                           vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger
   abhinav           vulcan      48                                       vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger
       ...

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.

$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres
      User    Account                  QOS       GrpTRES
---------- ---------- -------------------- -------------
               vulcan                        gres/gpu=64
                  ...

QoS

You need to decide the QOS to submit with which will set a certain number of restrictions to your job. If you do not specify a QoS when submitting your job, you will receive the vulcan-default QoS assuming you are using a Vulcan account.

The following sacctmgr command will list the current QOS. Either the vulcan-default, vulcan-medium, or vulcan-high QOS is required for the vulcan-dpart partition. This will be passed to all your submission commands as --qos.

The following example will show you the current limits that the QOS have.

$ show_qos
                Name     MaxWall                        MaxTRES MaxJobsPU MaxSubmitPU                      MaxTRESPU              GrpTRES
-------------------- ----------- ------------------------------ --------- ----------- ------------------------------ --------------------
                 ...         ...                            ...       ...         ...                            ...                  ...
       vulcan-medium  3-00:00:00       cpu=8,gres/gpu=2,mem=64G         2
         vulcan-high  1-12:00:00     cpu=16,gres/gpu=4,mem=128G         2
      vulcan-default  7-00:00:00       cpu=4,gres/gpu=1,mem=32G         2
    vulcan-scavenger  3-00:00:00     cpu=32,gres/gpu=8,mem=256G
        vulcan-janus  3-00:00:00    cpu=32,gres/gpu=10,mem=256G
       vulcan-exempt  7-00:00:00     cpu=32,gres/gpu=8,mem=256G         2
          vulcan-cpu  2-00:00:00                cpu=1024,mem=4T         4
    vulcan-exclusive 30-00:00:00
       vulcan-sailon  3-00:00:00     cpu=32,gres/gpu=8,mem=256G                                          gres/gpu=48
                 ...         ...                            ...       ...         ...                            ...                  ...

Data Storage

All data storage that was available on the standalone Vulcan cluster will continue to be available in Nexus.

Vulcan users can also request Nexus project allocations.