Nexus/CML: Difference between revisions
|  Created page with "stub" | No edit summary | ||
| Line 1: | Line 1: | ||
| The [[CML]] standalone cluster's compute nodes will fold into [[Nexus]] on Thursday, August 17th, 2023 during the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August. | |||
| The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]]. | |||
| In addition, the CML cluster's standalone submission node <code>cmlsub00.umiacs.umd.edu</code> will be retired on Thursday, September 21st, 2023 during that month's maintenance window, as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use <code>nexuscml00.umiacs.umd.edu</code> and <code>nexuscml01.umiacs.umd.edu</code> for any general purpose CML compute needs after this time. | |||
| Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order. | |||
| Please [[HelpDesk | contact staff]] with any questions or concerns. | |||
| ==Usage== | |||
| The Nexus cluster submission nodes that are allocated to CML are <code>nexuscml00.umiacs.umd.edu</code> and <code>nexuscml01.umiacs.umd.edu</code>. '''You must use these nodes to submit jobs to CML compute nodes after the August maintenance window.''' Submission from <code>cmlsub00.umiacs.umd.edu</code> will no longer work. | |||
| All partitions, QoSes, and account names from the standalone CML cluster are being moved over to Nexus when the compute nodes move. However, please note that <code>cml-</code> will be prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. | |||
| Here are some before/after examples of job submission with various parameters: | |||
| {| class="wikitable" | |||
| ! Standalone CML cluster submission command | |||
| ! Nexus cluster submission command | |||
| |- | |||
| |<code>srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtxa4000:2 --pty bash</code> | |||
| |<code>srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtxa4000:2 --pty bash</code> | |||
| |- | |||
| |<code>srun --partition=cpu --qos=cpu --pty bash</code> | |||
| |<code>srun --partition=cml-cpu --qos=cml-cpu --pty bash</code> | |||
| |- | |||
| |<code>srun --partition=scavenger --account=scavenger --qos=scavenger --gres=gpu:4 --pty bash</code> | |||
| |<code>srun --partition=cml-scavenger --account=cml-scavenger --qos=cml-scavenger --gres=gpu:4 --pty bash</code> | |||
| |} | |||
| CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that each partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on each set of clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use. | |||
| Please note that the CML compute nodes will also be added to the institute-wide <code>scavenger</code> partition in Nexus. CML users will still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> queue jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> queue jobs, and <code>cml-scavenger</code> queue jobs can preempt <code>scavenger</code> queue jobs). | |||
| ==Timeline== | |||
| All events are liable to begin as early as 9am US Eastern time on the dates indicated, unless otherwise indicated. Each event will be completed within the business week (i.e. Fridays at 5pm) or within the timeframe specified. | |||
| {| class="wikitable" | |||
| ! Date | |||
| ! Event | |||
| |- | |||
| | [[MonthlyMaintenanceWindow | August 17th 2023, 5-8pm]] | |||
| | All standalone CML cluster compute nodes are moved into Nexus in corresponding <code>cml-</code> named partitions | |||
| |-  | |||
| | [[MonthlyMaintenanceWindow | September 21st 2023, 5-8pm]] | |||
| | <code>cmlsub00.umiacs.umd.edu</code> is taken offline | |||
| |-  | |||
| |} | |||
Revision as of 18:23, 3 July 2023
The CML standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August.
The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.
In addition, the CML cluster's standalone submission node cmlsub00.umiacs.umd.edu will be retired on Thursday, September 21st, 2023 during that month's maintenance window, as it will no longer be able to submit jobs to CML compute nodes after the August maintenance window. Please use nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu for any general purpose CML compute needs after this time.
Please see the Timeline section below for concrete dates in chronological order.
Please contact staff with any questions or concerns.
Usage
The Nexus cluster submission nodes that are allocated to CML are nexuscml00.umiacs.umd.edu and nexuscml01.umiacs.umd.edu. You must use these nodes to submit jobs to CML compute nodes after the August maintenance window. Submission from cmlsub00.umiacs.umd.edu will no longer work.
All partitions, QoSes, and account names from the standalone CML cluster are being moved over to Nexus when the compute nodes move. However, please note that cml- will be prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus.
Here are some before/after examples of job submission with various parameters:
| Standalone CML cluster submission command | Nexus cluster submission command | 
|---|---|
| srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtxa4000:2 --pty bash | srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtxa4000:2 --pty bash | 
| srun --partition=cpu --qos=cpu --pty bash | srun --partition=cml-cpu --qos=cml-cpu --pty bash | 
| srun --partition=scavenger --account=scavenger --qos=scavenger --gres=gpu:4 --pty bash | srun --partition=cml-scavenger --account=cml-scavenger --qos=cml-scavenger --gres=gpu:4 --pty bash | 
CML users (exclusively) can schedule non-interruptible jobs on the moved nodes with these job parameters. Please note that each partition has a GrpTRES limit of 100% of the available cores/RAM on each set of clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
Please note that the CML compute nodes will also be added to the institute-wide scavenger partition in Nexus. CML users will still have scavenging priority over these nodes via the cml-scavenger partition (i.e., all cml- queue jobs (other than cml-scavenger) can preempt both cml-scavenger and scavenger queue jobs, and cml-scavenger queue jobs can preempt scavenger queue jobs).
Timeline
All events are liable to begin as early as 9am US Eastern time on the dates indicated, unless otherwise indicated. Each event will be completed within the business week (i.e. Fridays at 5pm) or within the timeframe specified.
| Date | Event | 
|---|---|
| August 17th 2023, 5-8pm | All standalone CML cluster compute nodes are moved into Nexus in corresponding cml-named partitions | 
| September 21st 2023, 5-8pm | cmlsub00.umiacs.umd.eduis taken offline |