Latest revision as of 18:54, 3 December 2024

The previous standalone cluster for CLIP's compute nodes have folded into Nexus as of late 2022.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found here.

Please contact staff with any questions or concerns.

Submission Nodes

You can SSH to nexusclip.umiacs.umd.edu to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:

nexusclip00.umiacs.umd.edu
nexusclip01.umiacs.umd.edu

Compute Nodes

The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named clip##.

Network

The network infrastructure supporting the CLIP partition consists of:

One pair of network switches connected to each other via dual 25GbE links for redundancy, serving the following compute nodes:
- clip[00-03,05,07-08,10]: Two 10GbE links per node, one to each switch in the pair (redundancy).
- clip04: Two 40GbE links per node, one to each switch in the pair (redundancy).
- clip06: Two 25GbE links per node, one to each switch in the pair (redundancy).
- clip09: Two 1GbE links per node, one to each switch in the pair (redundancy).
- clip[11-13]: Two 100GbE links per node, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see Nexus/Network.

QoS

CLIP users have access to all of the standard job QoSes in the clip partition using the clip account.

The additional job QoSes for the CLIP partition specifically are:

huge-long: Allows for longer jobs using higher overall resources.

Please note that the partition has a GrpTRES limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

Jobs

You will need to specify --partition=clip and --account=clip to be able to submit jobs to the CLIP partition.

[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
   UserId=username(1000) GroupId=username(21000) MCS_label=N/A
   Priority=897 Nice=0 Account=clip QOS=default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
   AccrueTime=2022-11-18T11:13:56
   StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
   PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
   Partition=clip AllocNode:Sid=nexusclip00:25443
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=clip00
   BatchHost=clip00
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=8G,node=1,billing=2266
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/nfshomes/username
   Power=

Storage

All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.

CLIP users can also request Nexus project allocations.

@@ Line 1: / Line 1: @@
-==Overview==
+The previous standalone cluster for [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP]'s compute nodes have folded into [[Nexus]] as of late 2022.
-The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] lab's cluster compute nodes will be gradually folded into UMIACS' new [[Nexus]] cluster beginning on Monday, July 25th, 2022 at 9am in order to further the goal of consolidating all compute nodes in UMIACS onto one common [[SLURM]] scheduler.
-The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
+The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
-As part of the transition, compute nodes will be reinstalled with Red Hat Enterprise Linux 8 (RHEL8) as their operating system. The nodes are currently installed with Red Hat Enterprise Linux 7 (RHEL7) as is. Their names will also change to be just <code>clip##</code> for consistency with Nexus' naming scheme.
+Please [[HelpDesk | contact staff]] with any questions or concerns.
-Data stored on the local scratch drives of compute nodes (/scratch0, /scratch1, etc.) will not persist through the reinstalls. Please secure all data in these local scratch drives to a network attached storage location prior to each nodes' move date as listed below.
+= Submission Nodes =
+You can [[SSH]] to <code>nexusclip.umiacs.umd.edu</code> to log in to a submission node.
-You may need to re-compile or re-link your applications due to the changes to the underlying operating system libraries. We have tried to maintain a similar set of software in our GNU [[Modules]] software trees for both operating systems. However, you may need to let us know if there is something missing after the upgrades.
+If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
+* <code>nexusclip00.umiacs.umd.edu</code>
+* <code>nexusclip01.umiacs.umd.edu</code>
-In addition, the general purpose nodes <code>context00.umiacs.umd.edu</code> and <code>context01.umiacs.umd.edu</code> will be retired on Monday, September 5th, 2022 at 9am. Please use <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code> for any general purpose CLIP compute needs after this time.
+= Compute Nodes =
+The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named <code>clip##</code>.
-Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.
+= Network =
+The network infrastructure supporting the CLIP partition consists of:
+# One pair of network switches connected to each other via dual 25GbE links for redundancy, serving the following compute nodes:
+#* clip[00-03,05,07-08,10]: Two 10GbE links per node, one to each switch in the pair (redundancy).
+#* clip04: Two 40GbE links per node, one to each switch in the pair (redundancy).
+#* clip06: Two 25GbE links per node, one to each switch in the pair (redundancy).
+#* clip09: Two 1GbE links per node, one to each switch in the pair (redundancy).
+#* clip[11-13]: Two 100GbE links per node, one to each switch in the pair (redundancy).
-Please [[HelpDesk | contact staff]] with any questions or concerns.
+For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].
+= QoS =
+CLIP users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>clip</code> partition using the <code>clip</code> account.
+The additional job QoSes for the CLIP partition specifically are:
+* <code>huge-long</code>: Allows for longer jobs using higher overall resources.
+Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
-==Usage==
+= Jobs =
-As compute nodes are folded into the Nexus cluster, CLIP users (exclusively) will be able to schedule non-interruptible jobs on the moved nodes by including the <code>--partition=clip</code> and <code>--account=clip</code> submission arguments.
+You will need to specify <code>--partition=clip</code> and <code>--account=clip</code> to be able to submit jobs to the CLIP partition.
-The Quality of Service (QoS) options present on the CLIP SLURM scheduler as is will not be migrated into Nexus' SLURM scheduler by default. The <code>huge-long</code> QoS can be used to request resources beyond those available in the universal Nexus QoSes listed [[Nexus#Quality_of_Service_.28QoS.29 | here]]. If you are interested in migrating any other QoS from the CLIP scheduler to the Nexus scheduler, please [[HelpDesk | contact staff]] and we will evaluate the request.
+<pre>
+[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
+srun: job 218874 queued and waiting for resources
+srun: job 218874 has been allocated resources
+[username@clip00:~ ] $ scontrol show job 218874
+JobId=218874 JobName=bash
+   UserId=username(1000) GroupId=username(21000) MCS_label=N/A
+   Priority=897 Nice=0 Account=clip QOS=default
+   JobState=RUNNING Reason=None Dependency=(null)
+   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
+   RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
+   SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
+   AccrueTime=2022-11-18T11:13:56
+   StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
+   PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
+   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
+   Partition=clip AllocNode:Sid=nexusclip00:25443
+   ReqNodeList=(null) ExcNodeList=(null)
+   NodeList=clip00
+   BatchHost=clip00
+   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
+   TRES=cpu=4,mem=8G,node=1,billing=2266
+   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
+   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
+   Features=(null) DelayBoot=00:00:00
+   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
+   Command=bash
+   WorkDir=/nfshomes/username
+   Power=
+</pre>
-==Timeline==
+= Storage =
-All events are liable to begin as early as 9am US Eastern time on the dates indicated. Each event will be completed by no later than 5pm on the same date.
+All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.
-* July 25th 2022: <code>clipgpu00</code> and <code>clipgpu01</code> are moved into Nexus as <code>clip00</code> and <code>clip01</code>
+CLIP users can also request [[Nexus#Project_Allocations | Nexus project allocations]].
-* August 1st 2022: <code>clipgpu02</code> and <code>clipgpu03</code> are moved into Nexus as <code>clip02</code> and <code>clip03</code>
-* August 8th 2022: <code>clipgpu04</code> and <code>materialgpu00</code> are moved into Nexus as <code>clip04</code> and <code>clip05</code>
-* August 15th 2022: <code>materialgpu01</code> and <code>materialgpu02</code> are moved into Nexus as <code>clip06</code> and <code>clip07</code>
-* September 5th 2022: <code>context00</code> and <code>context01</code> are taken offline
-* September 2022: Announcement is made about remaining compute nodes moving into Nexus
-* Fall 2022: Announcement is made about the deprecation of <code>/fs/cliphomes</code> directories

Nexus/CLIP: Difference between revisions

Latest revision as of 18:54, 3 December 2024

Contents

Submission Nodes

Compute Nodes

Network

QoS

Jobs

Storage

Navigation menu

Nexus/CLIP: Difference between revisions

Latest revision as of 18:54, 3 December 2024

Submission Nodes

Compute Nodes

Network

QoS

Jobs

Storage

Navigation menu

Search