Nexus/CML

From UMIACS
Revision as of 18:24, 3 July 2023 by Mbaney (talk | contribs)
Jump to navigation Jump to search

The CML standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August.

The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.

In addition, the CML cluster's standalone submission node cmlsub00.umiacs.umd.edu will be retired on Thursday, September 21st, 2023 during that month's maintenance window, as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu for any general purpose CML compute needs after this time.

Please see the Timeline section below for concrete dates in chronological order.

Please contact staff with any questions or concerns.

Usage

The Nexus cluster submission nodes that are allocated to CML are nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu. You must use these nodes to submit jobs to CML compute nodes after the August maintenance window. Submission from cmlsub00.umiacs.umd.edu will no longer work.

All partitions, QoSes, and account names from the standalone CML cluster are being moved over to Nexus when the compute nodes move. However, please note that cml- will be prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus.

Here are some before/after examples of job submission with various parameters:

Standalone CML cluster submission command Nexus cluster submission command
srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtxa4000:2 --pty bash srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtxa4000:2 --pty bash
srun --partition=cpu --qos=cpu --pty bash srun --partition=cml-cpu --qos=cml-cpu --pty bash
srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash

CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that each partition has a GrpTRES limit of 100% of the available cores/RAM on each set of clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

Please note that the CML compute nodes will also be added to the institute-wide scavenger partition in Nexus. CML users will still have scavenging priority over these nodes via the cml-scavenger partition (i.e., all cml- queue jobs (other than cml-scavenger) can preempt both cml-scavenger and scavenger queue jobs, and cml-scavenger queue jobs can preempt scavenger queue jobs).

Timeline

All events are liable to begin as early as 9am US Eastern time on the dates indicated, unless otherwise indicated. Each event will be completed within the business week (i.e. Fridays at 5pm) or within the timeframe specified.

Date Event
August 17th 2023, 5-8pm All standalone CML cluster compute nodes are moved into Nexus in corresponding cml- named partitions
September 21st 2023, 5-8pm cmlsub00.umiacs.umd.edu is taken offline