Nexus/CLIP: Difference between revisions
No edit summary |
|||
(32 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
The previous standalone cluster for [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP]'s compute nodes have folded into [[Nexus]] as of late 2022. | |||
The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] | |||
The Nexus cluster already has a large pool of compute resources made possible through | The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]]. | ||
Please [[HelpDesk | contact staff]] with any questions or concerns. | |||
= Submission Nodes = | |||
You can [[SSH]] to <code>nexusclip.umiacs.umd.edu</code> to log in to a submission node. | |||
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are: | |||
* <code>nexusclip00.umiacs.umd.edu</code> | |||
* <code>nexusclip01.umiacs.umd.edu</code> | |||
= Compute Nodes = | |||
The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named <code>clip##</code>. | |||
= Network = | |||
The network infrastructure supporting the CLIP partition consists of: | |||
# One pair of network switches connected to each other via dual 25GbE links for redundancy, serving the following compute nodes: | |||
#* clip[00-03,05,07-08,10]: Two 10GbE links per node, one to each switch in the pair (redundancy). | |||
#* clip04: Two 40GbE links per node, one to each switch in the pair (redundancy). | |||
#* clip06: Two 25GbE links per node, one to each switch in the pair (redundancy). | |||
#* clip09: Two 1GbE links per node, one to each switch in the pair (redundancy). | |||
#* clip[11-13]: Two 100GbE links per node, one to each switch in the pair (redundancy). | |||
For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]]. | |||
= QoS = | |||
CLIP users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>clip</code> partition using the <code>clip</code> account. | |||
The additional job QoSes for the CLIP partition specifically are: | |||
* <code>huge-long</code>: Allows for longer jobs using higher overall resources. | |||
Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. | |||
= Jobs = | |||
You will need to specify <code>--partition=clip</code> and <code>--account=clip</code> to be able to submit jobs to the CLIP partition. | |||
<pre> | |||
[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash | |||
srun: job 218874 queued and waiting for resources | |||
srun: job 218874 has been allocated resources | |||
[username@clip00:~ ] $ scontrol show job 218874 | |||
JobId=218874 JobName=bash | |||
UserId=username(1000) GroupId=username(21000) MCS_label=N/A | |||
Priority=897 Nice=0 Account=clip QOS=default | |||
JobState=RUNNING Reason=None Dependency=(null) | |||
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 | |||
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A | |||
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56 | |||
AccrueTime=2022-11-18T11:13:56 | |||
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A | |||
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None | |||
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main | |||
Partition=clip AllocNode:Sid=nexusclip00:25443 | |||
ReqNodeList=(null) ExcNodeList=(null) | |||
NodeList=clip00 | |||
BatchHost=clip00 | |||
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | |||
TRES=cpu=4,mem=8G,node=1,billing=2266 | |||
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* | |||
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0 | |||
Features=(null) DelayBoot=00:00:00 | |||
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | |||
Command=bash | |||
WorkDir=/nfshomes/username | |||
Power= | |||
</pre> | |||
= | = Storage = | ||
All | All data filesystems that were available in the standalone CLIP cluster are also available in Nexus. | ||
CLIP users can also request [[Nexus#Project_Allocations | Nexus project allocations]]. | |||
| | |||
Latest revision as of 18:54, 3 December 2024
The previous standalone cluster for CLIP's compute nodes have folded into Nexus as of late 2022.
The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found here.
Please contact staff with any questions or concerns.
Submission Nodes
You can SSH to nexusclip.umiacs.umd.edu
to log in to a submission node.
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
nexusclip00.umiacs.umd.edu
nexusclip01.umiacs.umd.edu
Compute Nodes
The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named clip##
.
Network
The network infrastructure supporting the CLIP partition consists of:
- One pair of network switches connected to each other via dual 25GbE links for redundancy, serving the following compute nodes:
- clip[00-03,05,07-08,10]: Two 10GbE links per node, one to each switch in the pair (redundancy).
- clip04: Two 40GbE links per node, one to each switch in the pair (redundancy).
- clip06: Two 25GbE links per node, one to each switch in the pair (redundancy).
- clip09: Two 1GbE links per node, one to each switch in the pair (redundancy).
- clip[11-13]: Two 100GbE links per node, one to each switch in the pair (redundancy).
For a broader overview of the network infrastructure supporting the Nexus cluster, please see Nexus/Network.
QoS
CLIP users have access to all of the standard job QoSes in the clip
partition using the clip
account.
The additional job QoSes for the CLIP partition specifically are:
huge-long
: Allows for longer jobs using higher overall resources.
Please note that the partition has a GrpTRES
limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
Jobs
You will need to specify --partition=clip
and --account=clip
to be able to submit jobs to the CLIP partition.
[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash srun: job 218874 queued and waiting for resources srun: job 218874 has been allocated resources [username@clip00:~ ] $ scontrol show job 218874 JobId=218874 JobName=bash UserId=username(1000) GroupId=username(21000) MCS_label=N/A Priority=897 Nice=0 Account=clip QOS=default JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56 AccrueTime=2022-11-18T11:13:56 StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main Partition=clip AllocNode:Sid=nexusclip00:25443 ReqNodeList=(null) ExcNodeList=(null) NodeList=clip00 BatchHost=clip00 NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=8G,node=1,billing=2266 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/nfshomes/username Power=
Storage
All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.
CLIP users can also request Nexus project allocations.