Nexus/CLIP: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
 
(26 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Overview==
The [[Nexus]] scheduler houses [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP]'s new computational partition. Only CLIP lab members are able to run non-interruptible jobs on these nodes.
The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] lab's cluster compute nodes will be gradually folded into UMIACS' new [[Nexus]] cluster beginning on Monday, July 25th, 2022 at 9am in order to further the goal of consolidating all compute nodes in UMIACS onto one common [[SLURM]] scheduler.


The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].
= Submission Nodes =
You can [[SSH]] to <code>nexusclip.umiacs.umd.edu</code> to log in to a submission host.


As part of the transition, compute nodes will be reinstalled with Red Hat Enterprise Linux 8 (RHEL8) as their operating system. The nodes are currently installed with Red Hat Enterprise Linux 7 (RHEL7) as is. GPU compute nodes' names will also change to be just <code>clip##</code> for consistency with Nexus' naming scheme. CPU-only compute nodes will fold into the <tt>legacy</tt> partition and will thus be named <code>legacy##</code>.
If you store something in a local directory (/tmp, /scratch0) on one of the two submission hosts, you will need to connect to that same submission host to access it later. The actual submission hosts are:
* <code>nexusclip00.umiacs.umd.edu</code>
* <code>nexusclip01.umiacs.umd.edu</code>


Data stored on the local scratch drives of compute nodes (/scratch0, /scratch1, etc.) will not persist through the reinstalls. Please secure all data in these local scratch drives to a network attached storage location prior to each nodes' move date as listed below.
= Resources =
The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named <code>clip##</code>.


You may need to re-compile or re-link your applications due to the changes to the underlying operating system libraries. We have tried to maintain a similar set of software in our GNU [[Modules]] software trees for both operating systems. However, you may need to let us know if there is something missing after the upgrades.
= QoS =
CLIP users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>clip</code> partition using the <code>clip</code> account.


In addition, the general purpose nodes <code>context00.umiacs.umd.edu</code> and <code>context01.umiacs.umd.edu</code> were retired on Tuesday, September 6th, 2022 at 9am. Please use <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code> (or the <code>nexusclip</code> submission nodes) for any general purpose CLIP compute needs.
The additional job QoSes for the CLIP partition specifically are:
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.


Lastly, /cliphomes directories will be deprecated sometime in the coming year. The Nexus cluster uses [[NFShomes | /nfshomes]] directories for home directory storage space. There will be a future announcement about this deprecation that includes a concrete date after the cluster node moves are done or nearly done. /cliphomes will be made read-only once the cluster node moves are done.
Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.


Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.
= Jobs =
You will need to specify <code>--partition=clip</code>, <code>--account=clip</code>, and a specific <code>--qos</code> to be able to submit jobs to the CLIP partition.  


Please [[HelpDesk | contact staff]] with any questions or concerns.
<pre>
[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
  UserId=username(1000) GroupId=username(21000) MCS_label=N/A
  Priority=897 Nice=0 Account=clip QOS=default
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
  RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
  SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
  AccrueTime=2022-11-18T11:13:56
  StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
  PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
  Partition=clip AllocNode:Sid=nexusclip00:25443
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=clip00
  BatchHost=clip00
  NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=4,mem=8G,node=1,billing=2266
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=bash
  WorkDir=/nfshomes/username
  Power=
</pre>


==Usage==
= Storage =
The Nexus cluster submission nodes that are allocated to CLIP are <code>nexusclip00.umiacs.umd.edu</code> and <code>nexusclip01.umiacs.umd.edu</code>. You will need to log onto one of these submission nodes to use the moved compute nodes. Submission from <code>clipsub00.umiacs.umd.edu</code> or <code>clipsub01.umiacs.umd.edu</code> will not work.
All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.


CLIP users (exclusively) can schedule non-interruptible jobs on the moved GPU-capable nodes by including the <code>--partition=clip</code> and <code>--account=clip</code> submission arguments. To use the moved CPU-only nodes, use the <code>--partition=scavenger</code> and <code>--account=scavenger</code> submission arguments instead.
CLIP users can also request [[Nexus#Project_Allocations | Nexus project allocations]].
 
The Quality of Service (QoS) options present on the CLIP SLURM scheduler will not be migrated into the Nexus SLURM scheduler by default. The <code>huge-long</code> QoS can be used to request resources beyond those available in the universal Nexus QoSes listed [[Nexus#Quality_of_Service_.28QoS.29 | here]]. If you are interested in migrating a QoS from the CLIP scheduler to the Nexus scheduler, please [[HelpDesk | contact staff]] and we will evaluate the request.
 
==Timeline==
All events are liable to begin as early as 9am US Eastern time on the dates indicated. Each event will be completed within the business week (i.e. Fridays at 5pm).
 
{| class="wikitable"
! Date
! Event
|-
| July 25th 2022
| <code>clipgpu00</code> and <code>clipgpu01</code> are moved into Nexus as <code>clip00</code> and <code>clip01</code>
|-
| August 1st 2022
| <code>clipgpu02</code> and <code>clipgpu03</code> are moved into Nexus as <code>clip02</code> and <code>clip03</code>
|-
| August 8th 2022
| <code>clipgpu04</code> and <code>clipgpu05</code> are moved into Nexus as <code>clip04</code> and <code>clip05</code>
|-
| August 15th 2022
| <code>clipgpu06</code> and <code>materialgpu00</code> are moved into Nexus as <code>clip06</code> and <code>clip07</code>
|-
| August 22nd 2022
| <code>materialgpu01</code> and <code>materialgpu02</code> are moved into Nexus as <code>clip08</code> and <code>clip09</code>
|-
| September 6th 2022
| <code>context00</code> and <code>context01</code> are taken offline
|-
| Ongoing
| CPU-only compute nodes are moved into Nexus as <code>legacy##</code>
|-
| Fall 2022
| Announcement is made about the deprecation of <code>/fs/cliphomes</code> directories
|}

Latest revision as of 23:12, 22 February 2024

The Nexus scheduler houses CLIP's new computational partition. Only CLIP lab members are able to run non-interruptible jobs on these nodes.

Submission Nodes

You can SSH to nexusclip.umiacs.umd.edu to log in to a submission host.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission hosts, you will need to connect to that same submission host to access it later. The actual submission hosts are:

  • nexusclip00.umiacs.umd.edu
  • nexusclip01.umiacs.umd.edu

Resources

The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named clip##.

QoS

CLIP users have access to all of the standard job QoSes in the clip partition using the clip account.

The additional job QoSes for the CLIP partition specifically are:

  • huge-long: Allows for longer jobs using higher overall resources.

Please note that the partition has a GrpTRES limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

Jobs

You will need to specify --partition=clip, --account=clip, and a specific --qos to be able to submit jobs to the CLIP partition.

[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
   UserId=username(1000) GroupId=username(21000) MCS_label=N/A
   Priority=897 Nice=0 Account=clip QOS=default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
   AccrueTime=2022-11-18T11:13:56
   StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
   PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
   Partition=clip AllocNode:Sid=nexusclip00:25443
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=clip00
   BatchHost=clip00
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=8G,node=1,billing=2266
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/nfshomes/username
   Power=

Storage

All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.

CLIP users can also request Nexus project allocations.