Nexus/CBCB: Difference between revisions
(58 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The [[Nexus]] | The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023. | ||
The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]]. | |||
Please [[HelpDesk | contact staff]] with any questions or concerns. | |||
= Submission Nodes = | = Submission Nodes = | ||
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node. | |||
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are: | |||
* <code>nexuscbcb00.umiacs.umd.edu</code> | * <code>nexuscbcb00.umiacs.umd.edu</code> | ||
* <code>nexuscbcb01.umiacs.umd.edu</code> | * <code>nexuscbcb01.umiacs.umd.edu</code> | ||
= | = Compute Nodes = | ||
The | All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are: | ||
* 22 nodes that were purchased in October 2022 with center-wide funding. They are cbcb[00-21]. | |||
* 4 nodes from the previous standalone CBCB cluster that moved in as of Summer 2023. They are cbcb[22-25]. | |||
* A few additional nodes purchased by Dr. Heng Huang since then. They are all remaining 'cbcb' named nodes. | |||
= | {| class="wikitable sortable" | ||
! Nodenames | |||
! Quantity | |||
! CPU cores per node (CPUs) | |||
! Memory per node (type) | |||
! Filesystem storage per node (type/location) | |||
! GPUs per node (type) | |||
|- | |||
|cbcb[00-21] | |||
|22 | |||
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313]) | |||
|~2TB (DDR4 3200MHz) | |||
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]]) | |||
|0 | |||
|- | |||
|cbcb22 | |||
|1 | |||
|28 (Dual [https://ark.intel.com/content/www/us/en/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html Intel Xeon E5-2680 v4]) | |||
|~768GB (DDR4 2400MHz) | |||
|~650GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]) | |||
|0 | |||
|- | |||
|cbcb[23-24] | |||
|2 | |||
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4]) | |||
|~256GB (DDR4 2400MHz) | |||
|~800GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]) | |||
|0 | |||
|- | |||
|cbcb25 | |||
|1 | |||
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4]) | |||
|~256GB (DDR4 2400MHz) | |||
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]) | |||
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti]) | |||
|- | |||
|cbcb26 | |||
|1 | |||
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763]) | |||
|~512GB (DDR4 3200MHz) | |||
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]]) | |||
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000]) | |||
|- | |||
|cbcb27 | |||
|1 | |||
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513]) | |||
|~256GB (DDR4 3200MHz) | |||
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]]) | |||
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000]) | |||
|- | |||
|cbcb[28-29] | |||
|2 | |||
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124]) | |||
|~768GB (DDR5 4800MHz) | |||
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]]) | |||
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation]) | |||
|- class="sortbottom" | |||
!Total | |||
|30 | |||
|1060 (various) | |||
|~49TB (various) | |||
|~94TB (various) | |||
|33 (various) | |||
|} | |||
Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>): | |||
<pre> | <pre> | ||
[root@nexusctl00 ~]# show_nodes | grep cbcb | |||
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE | |||
------------ --------- | cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | ||
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle | |||
cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle | |||
cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle | |||
cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle | |||
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle | |||
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle | |||
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle | |||
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle | |||
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle | |||
</pre> | </pre> | ||
= Partitions = | |||
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job. | |||
* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed. | |||
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed. | |||
There is one additional partition available solely to Dr. Heng Huang's sponsored accounts. | |||
* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed. | |||
= QoS = | |||
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account. | |||
The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are: | |||
* <code>highmem</code>: Allows for significantly increased memory to be allocated. | |||
* <code>huge-long</code>: Allows for longer jobs using higher overall resources. | |||
Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. | |||
The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is: | |||
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>. | |||
= Jobs = | = Jobs = | ||
You will need to specify | You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition. | ||
<pre> | <pre> | ||
[ | [username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash | ||
srun: job 218874 queued and waiting for resources | srun: job 218874 queued and waiting for resources | ||
srun: job 218874 has been allocated resources | srun: job 218874 has been allocated resources | ||
[ | [username@cbcb00:~ ] $ scontrol show job 218874 | ||
JobId=218874 JobName=bash | JobId=218874 JobName=bash | ||
UserId= | UserId=username(1000) GroupId=username(21000) MCS_label=N/A | ||
Priority=897 Nice=0 Account=cbcb QOS=highmem | Priority=897 Nice=0 Account=cbcb QOS=highmem | ||
JobState=RUNNING Reason=None Dependency=(null) | JobState=RUNNING Reason=None Dependency=(null) | ||
Line 62: | Line 171: | ||
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | ||
Command=bash | Command=bash | ||
WorkDir=/nfshomes/ | WorkDir=/nfshomes/username | ||
Power= | Power= | ||
</pre> | </pre> | ||
= Storage = | = Storage = | ||
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation | CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below. | ||
CBCB users can also request | CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]]. | ||
= Migration = | = Migration = | ||
== Operating System / Software == | == Operating System / Software == | ||
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is exclusively running RHEL8, so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8 may not yet be fully populated with RHEL8 software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers]. |
Latest revision as of 20:38, 26 November 2024
The compute nodes from CBCB's previous standalone cluster have folded into Nexus as of mid 2023.
The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found here.
Please contact staff with any questions or concerns.
Submission Nodes
You can SSH to nexuscbcb.umiacs.umd.edu
to log in to a submission node.
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
nexuscbcb00.umiacs.umd.edu
nexuscbcb01.umiacs.umd.edu
Compute Nodes
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format cbcb##
. The sets of nodes are:
- 22 nodes that were purchased in October 2022 with center-wide funding. They are cbcb[00-21].
- 4 nodes from the previous standalone CBCB cluster that moved in as of Summer 2023. They are cbcb[22-25].
- A few additional nodes purchased by Dr. Heng Huang since then. They are all remaining 'cbcb' named nodes.
Nodenames | Quantity | CPU cores per node (CPUs) | Memory per node (type) | Filesystem storage per node (type/location) | GPUs per node (type) |
---|---|---|---|---|---|
cbcb[00-21] | 22 | 32 (Dual AMD EPYC 7313) | ~2TB (DDR4 3200MHz) | ~350GB (SATA SSD /scratch0), ~2TB (NVMe SSD /scratch1) | 0 |
cbcb22 | 1 | 28 (Dual Intel Xeon E5-2680 v4) | ~768GB (DDR4 2400MHz) | ~650GB (SATA SSD /scratch0) | 0 |
cbcb[23-24] | 2 | 24 (Dual Intel Xeon E5-2650 v4) | ~256GB (DDR4 2400MHz) | ~800GB (SATA SSD /scratch0) | 0 |
cbcb25 | 1 | 24 (Dual Intel Xeon E5-2650 v4) | ~256GB (DDR4 2400MHz) | ~1.4TB (SATA SSD /scratch0) | 2 (1x NVIDIA GeForce GTX 1080 Ti, 1x NVIDIA GeForce RTX 2080 Ti) |
cbcb26 | 1 | 128 (Dual AMD EPYC 7763) | ~512GB (DDR4 3200MHz) | ~3.4TB (NVMe SSD /scratch0), ~14TB (NVMe SSD /scratch1) | 7 (NVIDIA RTX A5000) |
cbcb27 | 1 | 64 (Dual AMD EPYC 7513) | ~256GB (DDR4 3200MHz) | ~3.4TB (SATA SSD /scratch0), ~3.5TB (NVMe SSD /scratch1) | 8 (NVIDIA RTX A6000) |
cbcb[28-29] | 2 | 32 (Dual AMD EPYC 9124) | ~768GB (DDR5 4800MHz) | ~350GB (SATA SSD /scratch0), ~7TB (NVMe SSD /scratch1) | 8 (NVIDIA RTX 6000 Ada Generation) |
Total | 30 | 1060 (various) | ~49TB (various) | ~94TB (various) | 33 (various) |
Here is the listing of nodes as shown by the Slurm alias show_nodes
(again, all nodes are named in the format cbcb##
):
[root@nexusctl00 ~]# show_nodes | grep cbcb NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
Partitions
There are two partitions available to general CBCB SLURM users. You must specify one of these two partitions when submitting your job.
- cbcb - This is the default partition. Job allocations on all nodes except those also in the cbcb-heng partition are guaranteed.
- cbcb-interactive - This is a partition that only allows interactive jobs; you cannot submit jobs via
sbatch
to this partition. Job allocations are guaranteed.
There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.
- cbcb-heng - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.
QoS
CBCB users have access to all of the standard job QoSes in the cbcb and cbcb-heng partitions using the cbcb
account.
The additional job QoSes for the cbcb and cbcb-heng partitions specifically are:
highmem
: Allows for significantly increased memory to be allocated.huge-long
: Allows for longer jobs using higher overall resources.
Please note that the partition has a GrpTRES
limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
The only allowed job QoS for the cbcb-interactive partition is:
interactive
: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used viasrun
orsalloc
.
Jobs
You will need to specify --partition=cbcb
and --account=cbcb
to be able to submit jobs to the CBCB partition.
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash srun: job 218874 queued and waiting for resources srun: job 218874 has been allocated resources [username@cbcb00:~ ] $ scontrol show job 218874 JobId=218874 JobName=bash UserId=username(1000) GroupId=username(21000) MCS_label=N/A Priority=897 Nice=0 Account=cbcb QOS=highmem JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56 AccrueTime=2022-11-18T11:13:56 StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main Partition=cbcb AllocNode:Sid=nexuscbcb00:25443 ReqNodeList=(null) ExcNodeList=(null) NodeList=cbcb00 BatchHost=cbcb00 NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=2000G,node=1,billing=2266 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/nfshomes/username Power=
Storage
CBCB still has its current storage allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.
CBCB users can also request Nexus project allocations.
Migration
Operating System / Software
CBCB's standalone cluster submission and compute nodes were running RHEL7. Nexus is exclusively running RHEL8, so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB module tree for RHEL8 may not yet be fully populated with RHEL8 software. If you do not see the modules you need, please reach out to the CBCB software maintainers.