Nexus/CLIP: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
==Overview==
The [[Nexus]] scheduler houses [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP]'s new computational partition.
The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] lab's cluster compute nodes will be gradually folded into UMIACS' new [[Nexus]] cluster beginning on Monday, July 25th, 2022 at 9am in order to further the goal of consolidating all compute nodes in UMIACS onto one common [[SLURM]] scheduler.


The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CLIP users.


As part of the transition, compute nodes will be reinstalled with Red Hat Enterprise Linux 8 (RHEL8) as their operating system. The nodes are currently installed with Red Hat Enterprise Linux 7 (RHEL7) as is. GPU compute nodes' names will also change to be just <code>clip##</code> for consistency with Nexus' naming scheme. CPU-only compute nodes will fold into the <tt>legacy</tt> partition, which is a shared partition with other departments/labs/centers, and will thus be named <code>legacy##</code>.
* <code>nexusclip00.umiacs.umd.edu</code>
* <code>nexusclip01.umiacs.umd.edu</code>


Data stored on the local scratch drives of compute nodes (/scratch0, /scratch1, etc.) will not persist through the reinstalls. Please secure all data in these local scratch drives to a network attached storage location prior to each nodes' move date as listed below.
= Resources =
The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named <code>clip##</code>.


You may need to re-compile or re-link your applications due to the changes to the underlying operating system libraries. We have tried to maintain a similar set of software in our GNU [[Modules]] software trees for both operating systems. However, you may need to let us know if there is something missing after the upgrades.
= QoS =
CLIP users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard QoS']] in the <code>clip</code> partition using the <code>clip</code> account. There is one additional QoS: <code>huge-long</code> that allows for longer jobs using higher overall resources.


In addition, the general purpose nodes <code>context00.umiacs.umd.edu</code> and <code>context01.umiacs.umd.edu</code> were retired on Tuesday, September 6th, 2022 at 9am. <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code> were then retired on Wednesday, March 1st, 2023 at 9am. Please use the <code>nexusclip</code> submission nodes for any general purpose CLIP compute needs.
Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.


Lastly, /cliphomes directories have been retired. The Nexus cluster uses [[NFShomes | /nfshomes]] directories for home directory storage space. On Thursday, April 20th, 2023 between 5-8pm, /cliphomes directories were replaced by /nfshomes directories in all computational resources (CRs) that still used /cliphomes and also were made read-only. On Thursday, May 18th, 2023 between 5-8pm, /cliphomes directories were taken offline completely.
= Jobs =
You will need to specify <code>--partition=clip</code>, <code>--account=clip</code>, and a specific <code>--qos</code> to be able to submit jobs to the CLIP partition.  


Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.
<pre>
[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
  UserId=username(1000) GroupId=username(21000) MCS_label=N/A
  Priority=897 Nice=0 Account=clip QOS=default
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
  RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
  SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
  AccrueTime=2022-11-18T11:13:56
  StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
  PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
  Partition=clip AllocNode:Sid=nexuscbcb00:25443
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=clip00
  BatchHost=clip00
  NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=16,mem=2000G,node=1,billing=2266
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=bash
  WorkDir=/nfshomes/username
  Power=
</pre>


Please [[HelpDesk | contact staff]] with any questions or concerns.
= Storage =
All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.


==Usage==
CLIP users can also request [[Nexus#Project_Allocations | Nexus project allocations]].
The Nexus cluster submission nodes that are allocated to CLIP are <code>nexusclip00.umiacs.umd.edu</code> and <code>nexusclip01.umiacs.umd.edu</code>.
 
CLIP users can schedule non-interruptible jobs on the moved nodes, both GPU-capable and CPU-only, by including the <code>--partition=clip</code> and <code>--account=clip</code> submission arguments. Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
 
The Quality of Service (QoS) options present on the old CLIP SLURM scheduler were not migrated into the Nexus SLURM scheduler by default. The <code>huge-long</code> QoS can be used to request resources beyond those available in the universal Nexus QoSes listed [[Nexus#Quality_of_Service_.28QoS.29 | here]]. If you have a request for a QoS on the Nexus scheduler for CLIP jobs, please [[HelpDesk | contact staff]] and we will evaluate the request.
 
==Timeline==
All events are liable to begin as early as 9am US Eastern time on the dates indicated, unless otherwise indicated. Each event will be completed within the business week (i.e. Fridays at 5pm) or within the timeframe specified.
 
{| class="wikitable"
! Date
! Event
|-
| July 25th 2022
| <code>clipgpu00</code> and <code>clipgpu01</code> are moved into Nexus as <code>clip00</code> and <code>clip01</code>
|-
| August 1st 2022
| <code>clipgpu02</code> and <code>clipgpu03</code> are moved into Nexus as <code>clip02</code> and <code>clip03</code>
|-
| August 8th 2022
| <code>clipgpu04</code> and <code>clipgpu05</code> are moved into Nexus as <code>clip04</code> and <code>clip05</code>
|-
| August 15th 2022
| <code>clipgpu06</code> and <code>materialgpu00</code> are moved into Nexus as <code>clip06</code> and <code>clip07</code>
|-
| August 22nd 2022
| <code>materialgpu01</code> and <code>materialgpu02</code> are moved into Nexus as <code>clip08</code> and <code>clip09</code>
|-
| September 6th 2022
| <code>context00</code> and <code>context01</code> are taken offline
|-
| January 3rd 2023
| <code>chroneme[04,06-07], d[41-46], phoneme[00-09]</code> are moved into Nexus as legacy nodes
|-
| February 1st 2023
| Remaining CLIP cluster compute nodes are taken offline (not moving into Nexus)
|-
| March 1st 2023
| <code>clipsub00</code> and <code>clipsub01</code> (CLIP cluster submission nodes) are taken offline
|-
| [[MonthlyMaintenanceWindow | April 20th 2023, 5-8pm]]
| All CLIP computational resources are changed to use <code>/fs/nfshomes</code> directories, and <code>/fs/cliphomes</code> directories are made read-only
|-
| [[MonthlyMaintenanceWindow | May 18th 2023, 5-8pm]]
| <code>/fs/cliphomes</code> directories are taken offline
|-
|}

Revision as of 20:30, 5 June 2023

The Nexus scheduler houses CLIP's new computational partition.

Submission Nodes

There are two submission nodes for Nexus exclusively available for CLIP users.

  • nexusclip00.umiacs.umd.edu
  • nexusclip01.umiacs.umd.edu

Resources

The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named clip##.

QoS

CLIP users have access to all of the standard QoS' in the clip partition using the clip account. There is one additional QoS: huge-long that allows for longer jobs using higher overall resources.

Please note that the partition has a GrpTRES limit of 100% of the available cores/RAM on the partition-specific nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

Jobs

You will need to specify --partition=clip, --account=clip, and a specific --qos to be able to submit jobs to the CLIP partition.

[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
   UserId=username(1000) GroupId=username(21000) MCS_label=N/A
   Priority=897 Nice=0 Account=clip QOS=default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
   AccrueTime=2022-11-18T11:13:56
   StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
   PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
   Partition=clip AllocNode:Sid=nexuscbcb00:25443
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=clip00
   BatchHost=clip00
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=2000G,node=1,billing=2266
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/nfshomes/username
   Power=

Storage

All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.

CLIP users can also request Nexus project allocations.