UMIACS - User contributions [en]

Nexus/ClusterOSUpgrade

2026-07-24T16:51:46Z

Mbaney:

Nexus/ClusterOSUpgrade

2026-07-24T16:36:53Z

Mbaney:

MonthlyMaintenanceWindow

2026-07-24T16:25:40Z

Mbaney:

[[HelpDesk | UMIACS staff]] takes a monthly maintenance window to patch and reboot all UMIACS-supported hosts and services. This provides a way for staff to ensure security updates are installed and applied on the numerous different platforms and appliances that UMIACS runs.

The window for each month is calculated by adding 9 days to [https://en.wikipedia.org/wiki/Patch_Tuesday Microsoft's Patch Tuesday] to allow for enough time to marshal patches released that month from Microsoft, Red Hat, Apple, and other OS and application vendors and have enough time to get systems prepared to reboot. This translates to the window being on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month. The window lasts from '''5pm-8pm'''.

[[Nexus]] will always have a reservation in place from 4:45pm-8pm on the day of the upcoming window to prevent jobs from being scheduled on compute nodes. The 15-minute addition before the start of the window is to allow jobs to fully end. Any job submitted before the reservation begins that has a time limit that would run into the reservation will be held until at least the end of the reservation - 8pm on the day of the window. This is to prevent issues with jobs failing to end properly causing delays in work we have scheduled during the window.

A list of upcoming maintenance windows is as follows, with the next one in bold. Again, the window is on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month, and lasts from '''5pm-8pm'''.

* '''August 20th 2026'''
* September 17th 2026
* October 22nd 2026
* November 19th 2026
* December 17th 2026

==Archives==
* January 17th 2013 - BEGIN time of 8pm-12am for this window through February 20th 2020
* February 21st 2013
* March 21st 2013
* April 18th 2013
* May 23rd 2013
* June 20th 2013
* July 18th 2013
* August 22nd 2013
* September 19th 2013
* October 17th 2013
* December 19th 2013
* January 23rd 2014
* February 20th 2014
* March 20th 2014
* April 17th 2014
* May 22nd 2014
* June 19th 2014
* July 17th 2014
* August 21st 2014
* September 18th 2014
* October 23rd 2014
* November 20th 2014
* December 18th 2014
* January 22nd 2015
* February 19th 2015
* March 19th 2015
* May 21st 2015
* June 18th 2015
* July 23rd 2015
* August 20th 2015
* September 17th 2015
* October 22nd 2015
* November 19th 2015
* December 17th 2015
* January 21st 2016
* February 18th 2016
* March 12th 2016 (Adjusted date for AVW power outage)
* April 21st 2016
* May 19th 2016
* June 23rd 2016
* July 21st 2016
* August 18th 2016
* September 22nd 2016
* October 20th 2016
* November 17th 2016
* December 22nd 2016
* January 19th 2017
* February 23rd 2017
* March 23rd 2017
* April 20th 2017
* May 18th 2017
* June 22nd 2017
* July 20th 2017
* August 17th 2017
* September 21st 2017
* October 19th 2017
* December 21st 2017
* January 18th 2018
* February 22nd 2018
* March 22nd 2018
* April 19th 2018
* May 17th 2018
* June 21st 2018
* July 19th 2018
* August 23rd 2018
* September 20th 2018
* October 18th 2018
* December 20th 2018
* January 24th 2019
* February 21st 2019
* April 18th 2019
* May 23rd 2019
* June 20th 2019
* July 18th 2019
* August 22nd 2019
* September 19th 2019
* October 17th 2019
* November 21st 2019
* December 19th 2019
* January 23rd 2020
* February 20th 2020
* April 23rd 2020 - BEGIN time of 5pm-7pm for this window through August 19th 2021
* June 18th 2020
* July 23rd 2020
* August 20th 2020
* September 17th 2020
* October 22nd 2020
* November 19th 2020
* December 17th 2020
* January 21st 2021
* February 18th 2021
* March 25th 2021 (Adjusted date for extended Spring Break)
* April 22nd 2021
* May 20th 2021
* June 17th 2021
* July 22nd 2021
* August 19th 2021
* September 23rd 2021 - BEGIN time of 5pm-8pm for this window and all others below
* October 21st 2021
* November 18th 2021
* January 20th 2022
* February 17th 2022
* March 24th 2022 (Adjusted date for Spring Break)
* April 21st 2022
* May 19th 2022
* June 23rd 2022
* July 21st 2022
* August 18th 2022
* September 22nd 2022
* October 20th 2022
* November 17th 2022
* January 19th 2023
* February 23rd 2023
* April 20th 2023
* May 18th 2023
* June 22nd 2023
* July 20th 2023
* August 17th 2023
* September 21st 2023
* October 19th 2023
* December 20th 2023 (Adjusted date for early Winter Break)
* January 18th 2024
* February 22nd 2024
* March 21st 2024
* April 18th 2024
* May 23th 2024
* June 20th 2024
* July 18th 2024
* August 22nd 2024
* September 19th 2024
* October 17th 2024
* November 21st 2024
* December 19th 2024
* January 23rd 2025
* February 20th 2025
* March 20th 2025
* April 17th 2025
* May 22nd 2025
* June 19th 2025
* July 17th 2025
* August 21st 2025
* September 18th 2025
* October 23rd 2025
* November 20th 2025
* December 18th 2025
* January 22nd 2026
* February 19th 2026
* March 19th 2026
* April 23rd 2026
* May 28th 2026 (Adjusted date for CIO-imposed network change freeze)
* June 18th 2026
* July 23rd 2026

Nexus/ClusterOSUpgrade

2026-07-24T15:41:19Z

Mbaney:

Nexus/ClusterOSUpgrade

2026-07-24T14:53:12Z

Mbaney:

Nexus/ClusterOSUpgrade

2026-07-24T14:29:10Z

Mbaney:

Nexus/ClusterOSUpgrade

2026-07-23T18:20:58Z

Mbaney:

==Overview==
UMIACS Technical Staff has begun the process of upgrading the operating system version on all [[Nexus]] cluster nodes from [[RHEL | Red Hat Enterprise Linux (RHEL)]] 8 to 9 as of 9am on Monday 06/01/2026.

RHEL8 is in the Maintenance Support phase of its life cycle and is transitioning to the Extended Life phase in 2029. More information on Red Hat's lifecycle policy for its operating systems can be found [https://access.redhat.com/support/policy/updates/errata here]. We are staying well ahead of the Extended Life phase date for our cluster nodes by performing these upgrades now.

RHEL9 is still in the Full Support phase of its life cycle and introduces a newer major Linux kernel version and newer [https://www.gnu.org/software/libc glibc] version, improving compatibility with many newer software applications.

==Scheduling==
'''Upgrades for all cluster nodes have begun.''' We expect to be finished with all cluster node upgrades no later than 5pm on Friday 08/21/2026.

===[[SLURM/JobSubmission | Submission Nodes]]===
'''Submission nodes with the number '01' in their hostnames have been upgraded as of 06/01/2026.'''

Submission nodes with the number '00' in their hostnames will be scheduled for upgrade individually, when all of the compute nodes associated with the same lab/center have been upgraded. Staff will send a notification to individual lab's/center's cluster users to schedule the relevant '00' node's upgrade when applicable. The actual date of each upgrade will be no less than one week after the corresponding notification has been sent.

Data in [[FilesystemDataStorage#UNIX_Filesystem_Storage | UNIX filesystem storage]] spaces on each submission node, i.e., /tmp and /scratch0, will not be preserved during upgrade. If you have any data in any such space on the '00' submission node in a pairing that you want to keep, please ensure you copy it to the '01' submission node or a [[FilesystemDataStorage#Network-Attached_Filesystem_Storage | network-attached filesystem storage]] space prior to the '00' node's upgrade date. Data in network-attached filesystem storage spaces, such as /nfshomes or /fs/nexus-scratch, will not be affected.

===Compute Nodes===
Due to the large number of compute nodes and the desire to not interrupt running jobs, we are not generally able to schedule each specific compute node upgrade on a specific date. If you find that a specific node is unavailable to schedule jobs on, you can run the command <code>sinfo --list-reasons --long</code> on a submission node and look to see if the node is in the list with the text "RHEL9 upgrade" - if this is present, the upgrade for that node is underway.

We will generally be prioritizing upgrades for nodes based on how available they are across various partitions; nodes that are only available in partitions that contain large numbers of users for a lab/center, e.g., cbcb, clip, cml-dpart, gamma, vulcan-ampere, vulcan-dpart, etc., and corresponding "scavenger" named partitions, will be prioritized over nodes that are only or are also available in faculty-specific / limited-node partitions. All nodes in the tron partition will also generally be prioritized.

If you are a faculty member authoritative for your own partition or a small group's limited-node partition and have scheduling concerns for the nodes in these partitions, please [[HelpDesk | contact staff]] ASAP to let us know about these concerns and we will make our best effort to accommodate them.

==Interoperability==
===Software and Modules===
Please begin transitioning your [[PythonVirtualEnv | virtual environments]], workflows, etc. to work with RHEL9 as soon as possible. You can use the '01' submission node that you have access to for transitioning and light testing - as always, [[Nexus/Submission_Node_Policy | please do not run any computationally intensive processes on this node]]. It is intended to be a host for configuring environments/workflows and submitting jobs only.

The [[Modules | module tree]] for RHEL9 has already been populated with a large number of the same modules that are available in the RHEL8 module tree, although specific modules may have different versions available in the RHEL9 tree as compared to the RHEL8 tree. If you have a dependency on a specific version of a module that is not available in the RHEL9 tree, please [[HelpDesk | contact staff]] and we can get one created.

===SLURM Scheduling===
If you want or need to schedule a job on only nodes running RHEL8 (or RHEL9, once you have validated whatever is relevant), you can use the submission arguments <code>--prefer=rhel#</code> or <code>--constraint=rhel#</code> in your job arguments to specify this, where # is replaced by the OS version number. The --prefer argument is a soft limitation on which nodes the job can be scheduled on and the --constraint argument is a hard limitation, i.e., if you use the argument <code>--prefer=rhel8</code> but there are no RHEL8 nodes available at present (with your other submission arguments also satisfied) in the partition you are submitting to, the job will be scheduled on an appropriate RHEL9 node if that would result in an earlier (or instantaneous) start time.

Nexus/GPUs

2026-07-17T18:01:26Z

Mbaney:

There are several different types of [https://www.nvidia.com/en-us/ NVIDIA] GPUs in the [[Nexus]] cluster that are available to be scheduled. They are listed below in order of newest to oldest architecture, and then alphanumerically by name.

The exact quantities of GPUs per type are not listed here since these numbers may change frequently due to additions to or removals from the cluster or during compute node troubleshooting. To see which compute nodes have which GPUs and in what quantities, use the <code>show_nodes</code> command on a submission or compute node. The quantities are listed under the <tt>GRES</tt> column.

{| class="wikitable sortable"
! Name
! GRES string ([[SLURM]])
! [https://www.nvidia.com/en-us/technologies Architecture]
! [https://developer.nvidia.com/cuda-toolkit CUDA] Cores
! Memory Amount and Type
! Memory Bandwidth
! FP32 Performance (TFLOPS)
! [https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ TF32] Performance ([https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt Dense / Sparse TOPS])
|-
| RTX PRO 6000 Blackwell Max-Q Workstation Edition
| <code>rtx6000bw-mq</code>
| Blackwell
| 24064
| 96GB GDDR7
| 1.79 TB/s
| 109.7
| 219.5/438.9
|-
| L40S
| <code>l40s</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 864GB/s
| 91.6
| 183/366
|-
| RTX 6000 Ada Generation
| <code>rtx6000ada</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 960GB/s
| 91.1
| 182.1/364.2
|-
| H100 NVLink [0]
| <code>h100-nvl</code>
| Hopper
| 33792
| 188GB HBM3
| 7.87TB/s
| 133.8
| not officially published/1671
|-
| H100 SXM
| <code>h100-sxm</code>
| Hopper
| 16896
| 80GB HBM3
| 3.35TB/s
| 66.9
| not officially published/989
|-
| H200 SXM
| <code>h200-sxm</code>
| Hopper
| 16896
| 141GB HBM3e
| 4.89TB/s
| 66.9
| not officially published/989
|-
| A100 PCIe 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 1.94TB/s
| 19.5
| 156/312
|-
| A100 SXM 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 2.04TB/s
| 19.5
| 156/312
|-
| GeForce RTX 3070
| <code>rtx3070</code>
| Ampere
| 5888
| 8GB GDDR6
| 448GB/s
| 20.3
| 20.3/40.6
|-
| GeForce RTX 3090
| <code>rtx3090</code>
| Ampere
| 10496
| 24GB GDDR6X
| 936GB/s
| 35.6
| 35.6/71
|-
| RTX A4000
| <code>rtxa4000</code>
| Ampere
| 6144
| 16GB GDDR6
| 448GB/s
| 19.2
| not officially published/not officially published
|-
| RTX A5000
| <code>rtxa5000</code>
| Ampere
| 8192
| 24GB GDDR6
| 768GB/s
| 27.8
| not officially published/not officially published
|-
| RTX A6000
| <code>rtxa6000</code>
| Ampere
| 10752
| 48GB GDDR6
| 768GB/s
| 38.7
| 77.4/154.8
|-
| GeForce RTX 2080 Ti
| <code>rtx2080ti</code>
| Turing
| 4352
| 11GB GDDR5X
| 616GB/s
| 13.4
| n/a
|-
| GeForce GTX 1080 Ti
| <code>gtx1080ti</code>
| Pascal
| 3584
| 11GB GDDR5X
| 484GB/s
| 11.3
| n/a
|-
| Quadro P6000
| <code>p6000</code>
| Pascal
| 3840
| 24GB GDDR5X
| 432GB/s
| 12.6
| n/a
|-
| Tesla P100
| <code>p100</code>
| Pascal
| 3584
| 16GB CoWoS HBM2
| 732GB/s
| 9.3
| n/a
|-
| TITAN X (Pascal)
| <code>titanxpascal</code>
| Pascal
| 3584
| 12GB GDDR5X
| 480GB/s
| 11.0
| n/a
|-
| TITAN Xp
| <code>titanxp</code>
| Pascal
| 3840
| 12GB GDDR5X
| 548GB/s
| 12.1
| n/a
|-
| GeForce GTX TITAN X
| <code>gtxtitanx</code>
| Maxwell
| 3072
| 12GB GDDR5
| 336GB/s
| 6.7
| n/a
|-
|}

[0] - This GPU type is actually a pair of two physical cards connected over [https://www.nvidia.com/en-us/data-center/nvlink NVLink] bridges. NVIDIA's provided specifications for this GPU type are for one physical card; to get these specs, we have hence doubled NVIDIA's provided values.

Nexus/GPUs

2026-07-17T17:59:31Z

Mbaney:

There are several different types of [https://www.nvidia.com/en-us/ NVIDIA] GPUs in the [[Nexus]] cluster that are available to be scheduled. They are listed below in order of newest to oldest architecture, and then alphanumerically by name.

The exact quantities of GPUs per type are not listed here since these numbers may change frequently due to additions to or removals from the cluster or during compute node troubleshooting. To see which compute nodes have which GPUs and in what quantities, use the <code>show_nodes</code> command on a submission or compute node. The quantities are listed under the <tt>GRES</tt> column.

{| class="wikitable sortable"
! Name
! GRES string ([[SLURM]])
! [https://www.nvidia.com/en-us/technologies Architecture]
! [https://developer.nvidia.com/cuda-toolkit CUDA] Cores
! Memory Amount and Type
! Memory Bandwidth
! FP32 Performance (TFLOPS)
! [https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ TF32] Performance ([https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt Dense / Sparse TOPS])
|-
| RTX PRO 6000 Blackwell Max-Q Workstation Edition
| <code>rtx6000bw-mq</code>
| Blackwell
| 24064
| 96GB GDDR7
| 1.79 TB/s
| 109.7
| 219.5/438.9
|-
| L40S
| <code>l40s</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 864GB/s
| 91.6
| 183/366
|-
| RTX 6000 Ada Generation
| <code>rtx6000ada</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 960GB/s
| 91.1
| 182.1/364.2
|-
| H100 NVLink [0]
| <code>h100-nvl</code>
| Hopper
| 33792
| 188GB HBM3
| 7.87TB/s
| 133.8
| not officially published/1671
|-
| H100 SXM
| <code>h100-sxm</code>
| Hopper
| 16896
| 80GB HBM3
| 3.35TB/s
| 66.9
| not officially published/989
|-
| H200 SXM
| <code>h200-sxm</code>
| Hopper
| 16896
| 141GB HBM3e
| 4.89TB/s
| 66.9
| not officially published/989
|-
| A100 PCIe 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 1.94TB/s
| 19.5
| 156/312
|-
| A100 SXM 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 2.04TB/s
| 19.5
| 156/312
|-
| GeForce RTX 3070
| <code>rtx3070</code>
| Ampere
| 5888
| 8GB GDDR6
| 448GB/s
| 20.3
| 20.3/40.6
|-
| GeForce RTX 3090
| <code>rtx3090</code>
| Ampere
| 10496
| 24GB GDDR6X
| 936GB/s
| 35.6
| 35.6/71
|-
| RTX A4000
| <code>rtxa4000</code>
| Ampere
| 6144
| 16GB GDDR6
| 448GB/s
| 19.2
| not officially published
|-
| RTX A5000
| <code>rtxa5000</code>
| Ampere
| 8192
| 24GB GDDR6
| 768GB/s
| 27.8
| not officially published
|-
| RTX A6000
| <code>rtxa6000</code>
| Ampere
| 10752
| 48GB GDDR6
| 768GB/s
| 38.7
| 77.4/154.8
|-
| GeForce RTX 2080 Ti
| <code>rtx2080ti</code>
| Turing
| 4352
| 11GB GDDR5X
| 616GB/s
| 13.4
| n/a
|-
| GeForce GTX 1080 Ti
| <code>gtx1080ti</code>
| Pascal
| 3584
| 11GB GDDR5X
| 484GB/s
| 11.3
| n/a
|-
| Quadro P6000
| <code>p6000</code>
| Pascal
| 3840
| 24GB GDDR5X
| 432GB/s
| 12.6
| n/a
|-
| Tesla P100
| <code>p100</code>
| Pascal
| 3584
| 16GB CoWoS HBM2
| 732GB/s
| 9.3
| n/a
|-
| TITAN X (Pascal)
| <code>titanxpascal</code>
| Pascal
| 3584
| 12GB GDDR5X
| 480GB/s
| 11.0
| n/a
|-
| TITAN Xp
| <code>titanxp</code>
| Pascal
| 3840
| 12GB GDDR5X
| 548GB/s
| 12.1
| n/a
|-
| GeForce GTX TITAN X
| <code>gtxtitanx</code>
| Maxwell
| 3072
| 12GB GDDR5
| 336GB/s
| 6.7
| n/a
|-
|}

[0] - This GPU type is actually a pair of two physical cards connected over [https://www.nvidia.com/en-us/data-center/nvlink NVLink] bridges. NVIDIA's provided specifications for this GPU type are for one physical card; to get these specs, we have hence doubled NVIDIA's provided values.

Nexus/GPUs

2026-07-17T17:56:49Z

Mbaney:

There are several different types of [https://www.nvidia.com/en-us/ NVIDIA] GPUs in the [[Nexus]] cluster that are available to be scheduled. They are listed below in order of newest to oldest architecture, and then alphanumerically by name.

The exact quantities of GPUs per type are not listed here since these numbers may change frequently due to additions to or removals from the cluster or during compute node troubleshooting. To see which compute nodes have which GPUs and in what quantities, use the <code>show_nodes</code> command on a submission or compute node. The quantities are listed under the <tt>GRES</tt> column.

{| class="wikitable sortable"
! Name
! GRES string ([[SLURM]])
! [https://www.nvidia.com/en-us/technologies Architecture]
! [https://developer.nvidia.com/cuda-toolkit CUDA] Cores
! Memory Amount and Type
! Memory Bandwidth
! FP32 Performance (TFLOPS)
! [https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ TF32] Performance ([https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt Dense / Sparse TOPS])
|-
| RTX PRO 6000 Blackwell Max-Q Workstation
| <code>rtx6000bw-mq</code>
| Blackwell
| 24064
| 96GB GDDR7
| 1.79 TB/s
| 109.7
| 219.5/438.9
|-
| L40S
| <code>l40s</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 864GB/s
| 91.6
| 183/366
|-
| RTX 6000 Ada Generation
| <code>rtx6000ada</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 960GB/s
| 91.1
| 182.1/364.2
|-
| H100 NVLink [0]
| <code>h100-nvl</code>
| Hopper
| 33792
| 188GB HBM3
| 7.87TB/s
| 133.8
| not officially published/1671
|-
| H100 SXM
| <code>h100-sxm</code>
| Hopper
| 16896
| 80GB HBM3
| 3.35TB/s
| 66.9
| not officially published/989
|-
| H200 SXM
| <code>h200-sxm</code>
| Hopper
| 16896
| 141GB HBM3e
| 4.89TB/s
| 66.9
| not officially published/989
|-
| A100 PCIe 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 1.94TB/s
| 19.5
| 156/312
|-
| A100 SXM 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 2.04TB/s
| 19.5
| 156/312
|-
| GeForce RTX 3070
| <code>rtx3070</code>
| Ampere
| 5888
| 8GB GDDR6
| 448GB/s
| 20.3
| 20.3/40.6
|-
| GeForce RTX 3090
| <code>rtx3090</code>
| Ampere
| 10496
| 24GB GDDR6X
| 936GB/s
| 35.6
| 35.6/71
|-
| RTX A4000
| <code>rtxa4000</code>
| Ampere
| 6144
| 16GB GDDR6
| 448GB/s
| 19.2
| not officially published
|-
| RTX A5000
| <code>rtxa5000</code>
| Ampere
| 8192
| 24GB GDDR6
| 768GB/s
| 27.8
| not officially published
|-
| RTX A6000
| <code>rtxa6000</code>
| Ampere
| 10752
| 48GB GDDR6
| 768GB/s
| 38.7
| 77.4/154.8
|-
| GeForce RTX 2080 Ti
| <code>rtx2080ti</code>
| Turing
| 4352
| 11GB GDDR5X
| 616GB/s
| 13.4
| n/a
|-
| GeForce GTX 1080 Ti
| <code>gtx1080ti</code>
| Pascal
| 3584
| 11GB GDDR5X
| 484GB/s
| 11.3
| n/a
|-
| Quadro P6000
| <code>p6000</code>
| Pascal
| 3840
| 24GB GDDR5X
| 432GB/s
| 12.6
| n/a
|-
| Tesla P100
| <code>p100</code>
| Pascal
| 3584
| 16GB CoWoS HBM2
| 732GB/s
| 9.3
| n/a
|-
| TITAN X (Pascal)
| <code>titanxpascal</code>
| Pascal
| 3584
| 12GB GDDR5X
| 480GB/s
| 11.0
| n/a
|-
| TITAN Xp
| <code>titanxp</code>
| Pascal
| 3840
| 12GB GDDR5X
| 548GB/s
| 12.1
| n/a
|-
| GeForce GTX TITAN X
| <code>gtxtitanx</code>
| Maxwell
| 3072
| 12GB GDDR5
| 336GB/s
| 6.7
| n/a
|-
|}

[0] - This GPU type is actually a pair of two physical cards connected over [https://www.nvidia.com/en-us/data-center/nvlink NVLink] bridges. NVIDIA's provided specifications for this GPU type are for one physical card; to get these specs, we have hence doubled NVIDIA's provided values.

Nexus/GPUs

2026-07-17T17:55:36Z

Mbaney:

There are several different types of [https://www.nvidia.com/en-us/ NVIDIA] GPUs in the [[Nexus]] cluster that are available to be scheduled. They are listed below in order of newest to oldest architecture, and then alphanumerically by name.

The exact quantities of GPUs per type are not listed here since these numbers may change frequently due to additions to or removals from the cluster or during compute node troubleshooting. To see which compute nodes have which GPUs and in what quantities, use the <code>show_nodes</code> command on a submission or compute node. The quantities are listed under the <tt>GRES</tt> column.

{| class="wikitable sortable"
! Name
! GRES string ([[SLURM]])
! [https://www.nvidia.com/en-us/technologies Architecture]
! [https://developer.nvidia.com/cuda-toolkit CUDA] Cores
! Memory Amount and Type
! Memory Bandwidth
! FP32 Performance (TFLOPS)
! [https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ TF32] Performance ([https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt Dense / Sparse TOPS])
|-
| RTX PRO 6000 Blackwell Max-Q Workstation
| <code>rtx6000bw-mq</code>
| Blackwell
| 24064
| 96 GB GDDR7
| 1.79 TB/s
| 126.0
| n/a
|-
| L40S
| <code>l40s</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 864GB/s
| 91.6
| 183/366
|-
| RTX 6000 Ada Generation
| <code>rtx6000ada</code>
| Ada Lovelace
| 18176
| 48GB GDDR6
| 960GB/s
| 91.1
| 182.1/364.2
|-
| H100 NVLink [0]
| <code>h100-nvl</code>
| Hopper
| 33792
| 188GB HBM3
| 7.87TB/s
| 133.8
| not officially published/1671
|-
| H100 SXM
| <code>h100-sxm</code>
| Hopper
| 16896
| 80GB HBM3
| 3.35TB/s
| 66.9
| not officially published/989
|-
| H200 SXM
| <code>h200-sxm</code>
| Hopper
| 16896
| 141GB HBM3e
| 4.89TB/s
| 66.9
| not officially published/989
|-
| A100 PCIe 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 1.94TB/s
| 19.5
| 156/312
|-
| A100 SXM 80GB
| <code>a100</code>
| Ampere
| 6912
| 80GB HBM2e
| 2.04TB/s
| 19.5
| 156/312
|-
| GeForce RTX 3070
| <code>rtx3070</code>
| Ampere
| 5888
| 8GB GDDR6
| 448GB/s
| 20.3
| 20.3/40.6
|-
| GeForce RTX 3090
| <code>rtx3090</code>
| Ampere
| 10496
| 24GB GDDR6X
| 936GB/s
| 35.6
| 35.6/71
|-
| RTX A4000
| <code>rtxa4000</code>
| Ampere
| 6144
| 16GB GDDR6
| 448GB/s
| 19.2
| not officially published
|-
| RTX A5000
| <code>rtxa5000</code>
| Ampere
| 8192
| 24GB GDDR6
| 768GB/s
| 27.8
| not officially published
|-
| RTX A6000
| <code>rtxa6000</code>
| Ampere
| 10752
| 48GB GDDR6
| 768GB/s
| 38.7
| 77.4/154.8
|-
| GeForce RTX 2080 Ti
| <code>rtx2080ti</code>
| Turing
| 4352
| 11GB GDDR5X
| 616GB/s
| 13.4
| n/a
|-
| GeForce GTX 1080 Ti
| <code>gtx1080ti</code>
| Pascal
| 3584
| 11GB GDDR5X
| 484GB/s
| 11.3
| n/a
|-
| Quadro P6000
| <code>p6000</code>
| Pascal
| 3840
| 24GB GDDR5X
| 432GB/s
| 12.6
| n/a
|-
| Tesla P100
| <code>p100</code>
| Pascal
| 3584
| 16GB CoWoS HBM2
| 732GB/s
| 9.3
| n/a
|-
| TITAN X (Pascal)
| <code>titanxpascal</code>
| Pascal
| 3584
| 12GB GDDR5X
| 480GB/s
| 11.0
| n/a
|-
| TITAN Xp
| <code>titanxp</code>
| Pascal
| 3840
| 12GB GDDR5X
| 548GB/s
| 12.1
| n/a
|-
| GeForce GTX TITAN X
| <code>gtxtitanx</code>
| Maxwell
| 3072
| 12GB GDDR5
| 336GB/s
| 6.7
| n/a
|-
|}

[0] - This GPU type is actually a pair of two physical cards connected over [https://www.nvidia.com/en-us/data-center/nvlink NVLink] bridges. NVIDIA's provided specifications for this GPU type are for one physical card; to get these specs, we have hence doubled NVIDIA's provided values.

CML

2026-07-17T17:55:06Z

Mbaney:

The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

=Getting Started=
* [[SLURM/JobSubmission | Submitting Jobs]]
* [[SLURM/JobStatus | Checking Job Status]]
* [[Nexus/CML#Storage | Data Storage]]

Nexus/Vulcan

2026-07-17T17:52:38Z

Mbaney: /* Partitions */

The compute nodes from Vulcan's previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexusvulcan.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusvulcan00.umiacs.umd.edu</code>
* <code>nexusvulcan01.umiacs.umd.edu</code>

Vulcan users (exclusively) can schedule non-interruptible jobs on Vulcan nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all vulcan## in aggregate nodes plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''.

Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Compute Nodes==
There are currently 22 [[Nexus/Vulcan/GPUs | GPU nodes]] available, named vulcan[23-24,27-46], running a mixture of NVIDIA H200, NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, and NVIDIA GeForce RTX 2080 Ti cards. There are also 2 CPU-only nodes available, named brigid[16-17].

All nodes are scheduled with the [[SLURM]] resource manager.

==Network==
The network infrastructure supporting the Vulcan partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* brigid[16-17],vulcan[29-45]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* vulcan46: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of switches through several sets of intermediary switches, and to each other via dual 10GbE links for redundancy. The immediate connection to these sets of intermediary switches is via two 40GbE links to a pair of them, one between the first two switches in each pair and one between the second two switches in each pair for redundancy. This pair serves the following compute nodes:
#* vulcan[23-24,27-28]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all Vulcan [[Nexus/Vulcan#Scratch_Directories | scratch]], [[Nexus/Vulcan#Project_Storage | project]], and [[Nexus/Vulcan#Datasets | dataset]] allocations first connects to a pair of intermediary switches and then the first pair of switches mentioned [[Nexus/Tron#Network | here (Tron page's network section)]]. It then connects to the first pair of switches mentioned on this page through a set of four (different) intermediary switches. The last hop from the four intermediary switches to the first pair of switches mentioned on this page is via 32 100GbE links, four from each switch in the set to each switch in the first pair mentioned on this page for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general Vulcan [[SLURM]] users. You must specify a partition when submitting your job.

* '''vulcan-dpart''' - This is the default partition. Job allocations are guaranteed. Only nodes with GPUs from architectures older than NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] are included in this partition.
* '''vulcan-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other non-scavenger-named <code>vulcan-</code> partitions are ready to be scheduled.
* '''vulcan-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are a few additional partitions available to subsets of Vulcan users based on specific requirements.

* '''vulcan-ampere''' - This partition contains nodes with GPUs from NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] or a newer architecture. Job allocations are guaranteed. Please be aware of the following restrictions on this partition:
*: ''Time limit'': there is a 4 hour time limit on interactive jobs in this partition. If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script.
*: ''CPU/memory per GPU limit'': there is a limit of 4 CPUs and 48G memory maximum per non-H200 GPU requested by a job, and 16 CPUs and 256G memory maximum per H200 GPU requested by a job. If you need to run jobs with more CPUs/memory, you will either need to request more GPUs in the job or use a different partition.

: Submission is restricted to the Slurm [[#Accounts | accounts]] of the faculty who invested in these nodes:
:* Abhinav Shrivastava (vulcan-abhinav)
:* Jia-Bin Huang (vulcan-jbhuang)
:* Christopher Metzler (vulcan-metzler)
:* Ruoshi Liu (vulcan-ruoshi)
:* Matthias Zwicker (vulcan-zwicker)

* '''vulcan-scavenger-multi''' - This partition allows multi-node jobs (up to 9 total nodes per job) and allows jobs more resources than the vulcan-scavenger partition, but only contains nodes with RTX 2080 Ti GPUs in them. As with vulcan-scavenger, it is preemptable when jobs in other non-scavenged-named <code>vulcan-</code> partitions are ready to be scheduled.
*: Access to this partition is on a per-use basis. Please contact Abhinav Shrivastava if you would like to be granted access to this partition.

There is one additional partition available solely to Dr. Ramani Duraiswami's sponsored accounts.

* '''vulcan-ramani''' - This partition is for exclusive priority access to Dr. Duraiswami's purchased GPU nodes. Job allocations are guaranteed.

==Accounts==
Vulcan has a base SLURM account <code>vulcan</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in Vulcan compute infrastructure have an additional account provided to their sponsored accounts on the cluster.

If you do not specify an account when submitting your job, you will receive the '''vulcan''' account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.

The current faculty accounts are:
* vulcan-abhinav
* vulcan-djacobs
* vulcan-jbhuang
* vulcan-metzler
* vulcan-rama
* vulcan-ramani
* vulcan-ruoshi
* vulcan-yaser
* vulcan-zwicker

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
vulcan vulcan vulcan
vulcan-abhinav vulcan - abhinav shrivastava vulcan
vulcan-djacobs vulcan - david jacobs vulcan
vulcan-jbhuang vulcan - jia-bin huang vulcan
vulcan-metzler vulcan - chris metzler vulcan
vulcan-rama vulcan - rama chellappa vulcan
vulcan-ramani vulcan - ramani duraiswami vulcan
vulcan-ruoshi vulcan - ruoshi liu vulcan
vulcan-yaser vulcan - yaser yacoob vulcan
vulcan-zwicker vulcan - matthias zwicker vulcan
... ... ...
</pre>

Faculty can manage this list of users via our [https://intranet.umiacs.umd.edu/directory/secgroup/ Directory application] in the Security Groups section. The security group that controls access has the prefix <code>vulcan_</code> and then the faculty username. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------------------------------------
... ... ... ...
abhinav vulcan 48 vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger
abhinav vulcan-abhinav 48 vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger
... ... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. As shown below, there is a concurrent limit of 64 total GPUs for all users not in a contributing faculty group.

<pre>
$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
vulcan gres/gpu=64
... ...
</pre>

==QoS==
Vulcan currently has 3 QoS for the '''vulcan-dpart''' partition, 1 QoS for the '''vulcan-scavenger''' partition, and 1 QoS for the '''vulcan-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>vulcan-default</code> QoS assuming you are using a Vulcan account.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep vulcan
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
------------------------- ----------- ------------------------------ --------- ------------------------------
...
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-multi 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
...
</pre>

<pre>
$ show_partition_qos --all | grep vulcan
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- ------------------------------ --------------------
...
vulcan 500 cpu=1760,mem=15824G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
...
</pre>

==Storage==
Vulcan has the following storage available. Please also review UMIACS [[FilesystemDataStorage | Filesystem Data Storage]] policies including any volume that is labeled as scratch.

Vulcan users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Vulcan compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 300GB of scratch storage available at <code>/vulcanscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a temporary increase of up to 500GB total space for a maximum of 120 days without any faculty approval by [[HelpDesk | contacting staff]]. Once the temporary increase period is over, you will be contacted and given a one-week window of opportunity to clean and secure your data before staff will forcibly remove data to get your space back under 300GB. If you need space beyond 500GB or for longer than 120 days, you will need faculty approval and/or a project directory.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/vulcan-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Vulcan datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Vulcan here].

===Project Storage===
Users within the Vulcan compute infrastructure can request project based allocations for up to 10TB for up to 180 days by [[HelpDesk | contacting staff]] with approval from the Vulcan faculty manager (Dr. Shrivastava). These allocations will be available from <code>/fs/vulcan-projects</code> under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 180 days (requires re-approval from Dr. Shrivastava).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

===Object Storage===
All Vulcan users can request project allocations in the [https://obj.umiacs.umd.edu/obj/help UMIACS Object Store]. Please [[HelpDesk | contact staff]] with a short project name and the amount of storage you will need to get started.

To access this storage, you'll need to use a [[S3Clients | S3 client]] or our [[UMobj]] command line utilities.

An example on how to use the umobj command line utilities can be found [[UMobj/Example | here]]. A full set of documentation for the utilities can be found on the [https://gitlab.umiacs.umd.edu/staff/umobj/blob/master/README.md#umobj umobj Gitlab page].

Nexus/Vulcan

2026-07-17T17:51:58Z

Mbaney:

The compute nodes from Vulcan's previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexusvulcan.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusvulcan00.umiacs.umd.edu</code>
* <code>nexusvulcan01.umiacs.umd.edu</code>

Vulcan users (exclusively) can schedule non-interruptible jobs on Vulcan nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all vulcan## in aggregate nodes plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''.

Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Compute Nodes==
There are currently 22 [[Nexus/Vulcan/GPUs | GPU nodes]] available, named vulcan[23-24,27-46], running a mixture of NVIDIA H200, NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, and NVIDIA GeForce RTX 2080 Ti cards. There are also 2 CPU-only nodes available, named brigid[16-17].

All nodes are scheduled with the [[SLURM]] resource manager.

==Network==
The network infrastructure supporting the Vulcan partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* brigid[16-17],vulcan[29-45]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* vulcan46: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of switches through several sets of intermediary switches, and to each other via dual 10GbE links for redundancy. The immediate connection to these sets of intermediary switches is via two 40GbE links to a pair of them, one between the first two switches in each pair and one between the second two switches in each pair for redundancy. This pair serves the following compute nodes:
#* vulcan[23-24,27-28]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all Vulcan [[Nexus/Vulcan#Scratch_Directories | scratch]], [[Nexus/Vulcan#Project_Storage | project]], and [[Nexus/Vulcan#Datasets | dataset]] allocations first connects to a pair of intermediary switches and then the first pair of switches mentioned [[Nexus/Tron#Network | here (Tron page's network section)]]. It then connects to the first pair of switches mentioned on this page through a set of four (different) intermediary switches. The last hop from the four intermediary switches to the first pair of switches mentioned on this page is via 32 100GbE links, four from each switch in the set to each switch in the first pair mentioned on this page for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general Vulcan [[SLURM]] users. You must specify a partition when submitting your job.

* '''vulcan-dpart''' - This is the default partition. Job allocations are guaranteed. Only nodes with GPUs from architectures older than NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] are included in this partition.
* '''vulcan-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other non-scavenger-named <code>vulcan-</code> partitions are ready to be scheduled.
* '''vulcan-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are a few additional partitions available to subsets of Vulcan users based on specific requirements.

* '''vulcan-ampere''' - This partition contains nodes with GPUs from NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] or a newer architecture. Job allocations are guaranteed. Please be aware of the following restrictions on this partition:
*: ''Time limit'': there is a 4 hour time limit on interactive jobs in this partition. If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script.
*: ''CPU/memory per GPU limit'': there is a limit of 4 CPUs and 48G memory maximum per non-H200 GPU requested by a job, and 16 CPUs and 256G memory maximum per H200 GPU requested by a job. If you need to run jobs with more CPUs/memory, you will either need to request more GPUs in the job or use a different partition.

: Submission is restricted to the Slurm [[#Accounts | accounts]] of the faculty who invested in these nodes:
:* Abhinav Shrivastava (vulcan-abhinav)
:* Jia-Bin Huang (vulcan-jbhuang)
:* Christopher Metzler (vulcan-metzler)
:* Ruoshi Liu (vulcan-ruoshi)
:* Matthias Zwicker (vulcan-zwicker)

* '''vulcan-scavenger-multi''' - This partition allows multi-node jobs (up to 9 total nodes per job) and allows jobs more resources than the vulcan-scavenger partition, but only contains nodes with GTX 1080 Ti, TITAN Xp, and/or RTX 2080 Ti GPUs in them. As with vulcan-scavenger, it is preemptable when jobs in other non-scavenged-named <code>vulcan-</code> partitions are ready to be scheduled.
*: Access to this partition is on a per-use basis. Please contact Abhinav Shrivastava if you would like to be granted access to this partition.

There is one additional partition available solely to Dr. Ramani Duraiswami's sponsored accounts.

* '''vulcan-ramani''' - This partition is for exclusive priority access to Dr. Duraiswami's purchased GPU nodes. Job allocations are guaranteed.

==Accounts==
Vulcan has a base SLURM account <code>vulcan</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in Vulcan compute infrastructure have an additional account provided to their sponsored accounts on the cluster.

If you do not specify an account when submitting your job, you will receive the '''vulcan''' account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.

The current faculty accounts are:
* vulcan-abhinav
* vulcan-djacobs
* vulcan-jbhuang
* vulcan-metzler
* vulcan-rama
* vulcan-ramani
* vulcan-ruoshi
* vulcan-yaser
* vulcan-zwicker

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
vulcan vulcan vulcan
vulcan-abhinav vulcan - abhinav shrivastava vulcan
vulcan-djacobs vulcan - david jacobs vulcan
vulcan-jbhuang vulcan - jia-bin huang vulcan
vulcan-metzler vulcan - chris metzler vulcan
vulcan-rama vulcan - rama chellappa vulcan
vulcan-ramani vulcan - ramani duraiswami vulcan
vulcan-ruoshi vulcan - ruoshi liu vulcan
vulcan-yaser vulcan - yaser yacoob vulcan
vulcan-zwicker vulcan - matthias zwicker vulcan
... ... ...
</pre>

Faculty can manage this list of users via our [https://intranet.umiacs.umd.edu/directory/secgroup/ Directory application] in the Security Groups section. The security group that controls access has the prefix <code>vulcan_</code> and then the faculty username. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------------------------------------
... ... ... ...
abhinav vulcan 48 vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger
abhinav vulcan-abhinav 48 vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger
... ... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. As shown below, there is a concurrent limit of 64 total GPUs for all users not in a contributing faculty group.

<pre>
$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
vulcan gres/gpu=64
... ...
</pre>

==QoS==
Vulcan currently has 3 QoS for the '''vulcan-dpart''' partition, 1 QoS for the '''vulcan-scavenger''' partition, and 1 QoS for the '''vulcan-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>vulcan-default</code> QoS assuming you are using a Vulcan account.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep vulcan
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
------------------------- ----------- ------------------------------ --------- ------------------------------
...
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-multi 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
...
</pre>

<pre>
$ show_partition_qos --all | grep vulcan
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- ------------------------------ --------------------
...
vulcan 500 cpu=1760,mem=15824G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
...
</pre>

==Storage==
Vulcan has the following storage available. Please also review UMIACS [[FilesystemDataStorage | Filesystem Data Storage]] policies including any volume that is labeled as scratch.

Vulcan users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Vulcan compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 300GB of scratch storage available at <code>/vulcanscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a temporary increase of up to 500GB total space for a maximum of 120 days without any faculty approval by [[HelpDesk | contacting staff]]. Once the temporary increase period is over, you will be contacted and given a one-week window of opportunity to clean and secure your data before staff will forcibly remove data to get your space back under 300GB. If you need space beyond 500GB or for longer than 120 days, you will need faculty approval and/or a project directory.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/vulcan-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Vulcan datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Vulcan here].

===Project Storage===
Users within the Vulcan compute infrastructure can request project based allocations for up to 10TB for up to 180 days by [[HelpDesk | contacting staff]] with approval from the Vulcan faculty manager (Dr. Shrivastava). These allocations will be available from <code>/fs/vulcan-projects</code> under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 180 days (requires re-approval from Dr. Shrivastava).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

===Object Storage===
All Vulcan users can request project allocations in the [https://obj.umiacs.umd.edu/obj/help UMIACS Object Store]. Please [[HelpDesk | contact staff]] with a short project name and the amount of storage you will need to get started.

To access this storage, you'll need to use a [[S3Clients | S3 client]] or our [[UMobj]] command line utilities.

An example on how to use the umobj command line utilities can be found [[UMobj/Example | here]]. A full set of documentation for the utilities can be found on the [https://gitlab.umiacs.umd.edu/staff/umobj/blob/master/README.md#umobj umobj Gitlab page].

Nexus/CLIP

2026-07-17T17:48:57Z

Mbaney:

The previous standalone cluster for [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP]'s compute nodes have folded into [[Nexus]] as of late 2022.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexusclip.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusclip00.umiacs.umd.edu</code>
* <code>nexusclip01.umiacs.umd.edu</code>

= Compute Nodes =
The CLIP partition has nodes brought over from the previous standalone CLIP Slurm scheduler as well as some more recent purchases. The compute nodes are named <code>clip##</code>.

= Network =
The network infrastructure supporting the CLIP partition consists of:
# One pair of network switches connected to each other via dual 25GbE links for redundancy, serving the following compute nodes:
#* clip04: Two 40GbE links, one to each switch in the pair (redundancy).
#* clip[05,14]: Two 10GbE links per node, one to each switch in the pair (redundancy).
#* clip06: Two 25GbE links, one to each switch in the pair (redundancy).
#* clip[11-13]: Two 100GbE links per node, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= QoS =
CLIP users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>clip</code> partition using the <code>clip</code> account.

The additional job QoSes for the CLIP partition specifically are:
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

= Jobs =
You will need to specify <code>--partition=clip</code> and <code>--account=clip</code> to be able to submit jobs to the CLIP partition.

<pre>
[username@nexusclip00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=clip --account=clip --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@clip00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=clip QOS=default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=clip AllocNode:Sid=nexusclip00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=clip04
BatchHost=clip04
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=8G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
All data filesystems that were available in the standalone CLIP cluster are also available in Nexus.

CLIP users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

Nexus/CBCB

2026-07-17T17:46:02Z

Mbaney: /* Compute Nodes */

The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Compute Nodes =
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are:
* 22 nodes that were purchased in October 2022 with center-wide funding: cbcb[00-21]
* 1 node from the previous standalone CBCB cluster that moved in as of Summer 2023: cbcb25
* 4 additional nodes purchased by Dr. Heng Huang: cbcb[26-29]
* 1 additional node purchased by Dr. Mihai Pop: cbcb30

{| class="wikitable sortable"
! Nodenames
! Quantity
! CPU cores per node (CPUs)
! Memory per node (type)
! Filesystem storage per node (type/location)
! GPUs per node (type)
|-
|cbcb[00-21]
|22
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313])
|~2TB (DDR4 3200MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|-
|cbcb25
|1
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti])
|-
|cbcb26
|1
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763])
|~512GB (DDR4 3200MHz)
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000])
|-
|cbcb27
|1
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513])
|~256GB (DDR4 3200MHz)
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000])
|-
|cbcb[28-29]
|2
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124])
|~768GB (DDR5 4800MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation])
|-
|cbcb30
|1
|48 (Single [https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9475f.html AMD EPYC 9475F])
|~1.15TB (DDR5 6400MHz)
|~350GB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~10.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|- class="sortbottom"
!Total
|28
|1032 (various)
|~49TB (various)
|~103TB (various)
|33 (various)
|}

Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>):
<pre>
[root@nexusctl00 ~]# show_nodes | grep cbcb
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb30 48 1157583 rhel8,x86_64,EPYC,EPYC-9475F (null) idle
</pre>

= Network =
The network infrastructure supporting the CBCB partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cbcb[00-21,26-30]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via four 40GbE links, one between every combination of switches across the two pairings for redundancy, and to each other via dual 10GbE links for redundancy.
#* cbcb25: Two 10GbE links, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= Partitions =
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job.

* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed.
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed.

There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.

* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.

= QoS =
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account.

The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are:
* <code>highmem</code>: Allows for significantly increased memory to be allocated.
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is:
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>.

= Jobs =
You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition.

<pre>
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.

CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

= Operating System / Software =
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is running a mixture of RHEL8 and RHEL9, so any software you compiled on the standalone cluster may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8+ may not yet be fully populated with RHEL8+ software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers].

Nexus/CBCB

2026-07-17T17:45:45Z

Mbaney: /* Compute Nodes */

The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Compute Nodes =
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are:
* 22 nodes that were purchased in October 2022 with center-wide funding: cbcb[00-21]
* 1 node from the previous standalone CBCB cluster that moved in as of Summer 2023: cbcb25
* 4 additional nodes purchased by Dr. Heng Huang: cbcb[26-29]
* 1 additional node purchased by Dr. Mihai Pop: cbcb30

{| class="wikitable sortable"
! Nodenames
! Quantity
! CPU cores per node (CPUs)
! Memory per node (type)
! Filesystem storage per node (type/location)
! GPUs per node (type)
|-
|cbcb[00-21]
|22
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313])
|~2TB (DDR4 3200MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|-
|cbcb25
|1
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti])
|-
|cbcb26
|1
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763])
|~512GB (DDR4 3200MHz)
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000])
|-
|cbcb27
|1
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513])
|~256GB (DDR4 3200MHz)
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000])
|-
|cbcb[28-29]
|2
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124])
|~768GB (DDR5 4800MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation])
|-
|cbcb30
|1
|48 (Single [https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9475f.html AMD EPYC 9475F])
|~1.15TB (DDR5 6400MHz)
|~350GB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~10.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|- class="sortbottom"
!Total
|28
|1032 (various)
|~49TB (various)
|~103TB (various)
|33 (various)
|}

Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>):
<pre>
[root@nexusctl00 ~]# show_nodes | grep cbcb
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle
cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb30 48 1157583 rhel8,x86_64,EPYC,EPYC-9475F (null) idle
</pre>

= Network =
The network infrastructure supporting the CBCB partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cbcb[00-21,26-30]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via four 40GbE links, one between every combination of switches across the two pairings for redundancy, and to each other via dual 10GbE links for redundancy.
#* cbcb25: Two 10GbE links, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= Partitions =
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job.

* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed.
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed.

There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.

* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.

= QoS =
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account.

The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are:
* <code>highmem</code>: Allows for significantly increased memory to be allocated.
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is:
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>.

= Jobs =
You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition.

<pre>
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.

CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

= Operating System / Software =
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is running a mixture of RHEL8 and RHEL9, so any software you compiled on the standalone cluster may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8+ may not yet be fully populated with RHEL8+ software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers].

Nexus/CBCB

2026-07-17T17:44:59Z

Mbaney:

The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Compute Nodes =
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are:
* 22 nodes that were purchased in October 2022 with center-wide funding. They are cbcb[00-21].
* 4 nodes from the previous standalone CBCB cluster that moved in as of Summer 2023. They are cbcb[22-25].
* 4 additional nodes purchased by Dr. Heng Huang. They are cbcb[26-29].
* 1 additional node purchased by Dr. Mihai Pop. It is cbcb30.

{| class="wikitable sortable"
! Nodenames
! Quantity
! CPU cores per node (CPUs)
! Memory per node (type)
! Filesystem storage per node (type/location)
! GPUs per node (type)
|-
|cbcb[00-21]
|22
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313])
|~2TB (DDR4 3200MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|-
|cbcb25
|1
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti])
|-
|cbcb26
|1
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763])
|~512GB (DDR4 3200MHz)
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000])
|-
|cbcb27
|1
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513])
|~256GB (DDR4 3200MHz)
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000])
|-
|cbcb[28-29]
|2
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124])
|~768GB (DDR5 4800MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation])
|-
|cbcb30
|1
|48 (Single [https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9475f.html AMD EPYC 9475F])
|~1.15TB (DDR5 6400MHz)
|~350GB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~10.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|- class="sortbottom"
!Total
|28
|1032 (various)
|~49TB (various)
|~103TB (various)
|33 (various)
|}

Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>):
<pre>
[root@nexusctl00 ~]# show_nodes | grep cbcb
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle
cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb30 48 1157583 rhel8,x86_64,EPYC,EPYC-9475F (null) idle
</pre>

= Network =
The network infrastructure supporting the CBCB partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cbcb[00-21,26-30]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via four 40GbE links, one between every combination of switches across the two pairings for redundancy, and to each other via dual 10GbE links for redundancy.
#* cbcb25: Two 10GbE links, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= Partitions =
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job.

* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed.
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed.

There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.

* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.

= QoS =
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account.

The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are:
* <code>highmem</code>: Allows for significantly increased memory to be allocated.
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is:
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>.

= Jobs =
You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition.

<pre>
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.

CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

= Operating System / Software =
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is running a mixture of RHEL8 and RHEL9, so any software you compiled on the standalone cluster may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8+ may not yet be fully populated with RHEL8+ software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers].

Nexus/Vulcan

2026-07-17T16:11:05Z

Mbaney: /* Network */

The compute nodes from Vulcan's previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexusvulcan.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusvulcan00.umiacs.umd.edu</code>
* <code>nexusvulcan01.umiacs.umd.edu</code>

All partitions, QoSes, and account names from the standalone Vulcan cluster have been moved over to Nexus. However, please note that <code>vulcan-</code> is prepended to all of the values that were present in the standalone Vulcan cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named <code>vulcan</code> in the standalone cluster (it is also named just <code>vulcan</code> in Nexus).

Here are some before/after examples of job submission with various parameters:

{| class="wikitable"
! Standalone Vulcan cluster submission command
! Nexus cluster submission command
|-
|<code>srun --partition=dpart --qos=medium --account=abhinav --gres=gpu:gtx1080ti:2 --pty bash</code>
|<code>srun --partition=vulcan-dpart --qos=vulcan-medium --account=vulcan-abhinav --gres=gpu:gtx1080ti:2 --pty bash</code>
|-
|<code>srun --partition=cpu --qos=cpu --pty bash</code>
|<code>srun --partition=vulcan-cpu --qos=vulcan-cpu --account=vulcan --pty bash</code>
|-
|<code>srun --partition=scavenger --qos=scavenger --account=vulcan --gres=gpu:4 --pty bash</code>
|<code>srun --partition=vulcan-scavenger --qos=vulcan-scavenger --account=vulcan --gres=gpu:4 --pty bash</code>
|}

Vulcan users (exclusively) can schedule non-interruptible jobs on Vulcan nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all vulcan## in aggregate nodes plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''.

Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Compute Nodes==
There are currently 46 [[Nexus/Vulcan/GPUs | GPU nodes]] available, named vulcan[00-45], running a mixture of NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, and a number of different older generation cards. There are also 4 CPU-only nodes available, named brigid[16-19].

All nodes are scheduled with the [[SLURM]] resource manager.

==Network==
The network infrastructure supporting the Vulcan partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* brigid[16-17],vulcan[29-45]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* vulcan46: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of switches through several sets of intermediary switches, and to each other via dual 10GbE links for redundancy. The immediate connection to these sets of intermediary switches is via two 40GbE links to a pair of them, one between the first two switches in each pair and one between the second two switches in each pair for redundancy. This pair serves the following compute nodes:
#* vulcan[23-24,27-28]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all Vulcan [[Nexus/Vulcan#Scratch_Directories | scratch]], [[Nexus/Vulcan#Project_Storage | project]], and [[Nexus/Vulcan#Datasets | dataset]] allocations first connects to a pair of intermediary switches and then the first pair of switches mentioned [[Nexus/Tron#Network | here (Tron page's network section)]]. It then connects to the first pair of switches mentioned on this page through a set of four (different) intermediary switches. The last hop from the four intermediary switches to the first pair of switches mentioned on this page is via 32 100GbE links, four from each switch in the set to each switch in the first pair mentioned on this page for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general Vulcan [[SLURM]] users. You must specify a partition when submitting your job.

* '''vulcan-dpart''' - This is the default partition. Job allocations are guaranteed. Only nodes with GPUs from architectures older than NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] are included in this partition.
* '''vulcan-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other non-scavenger-named <code>vulcan-</code> partitions are ready to be scheduled.
* '''vulcan-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are a few additional partitions available to subsets of Vulcan users based on specific requirements.

* '''vulcan-ampere''' - This partition contains nodes with GPUs from NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] or a newer architecture. Job allocations are guaranteed. Please be aware of the following restrictions on this partition:
*: ''Time limit'': there is a 4 hour time limit on interactive jobs in this partition. If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script.
*: ''CPU/memory per GPU limit'': there is a limit of 4 CPUs and 48G memory maximum per non-H200 GPU requested by a job, and 16 CPUs and 256G memory maximum per H200 GPU requested by a job. If you need to run jobs with more CPUs/memory, you will either need to request more GPUs in the job or use a different partition.

: Submission is restricted to the Slurm [[#Accounts | accounts]] of the faculty who invested in these nodes:
:* Abhinav Shrivastava (vulcan-abhinav)
:* Jia-Bin Huang (vulcan-jbhuang)
:* Christopher Metzler (vulcan-metzler)
:* Ruoshi Liu (vulcan-ruoshi)
:* Matthias Zwicker (vulcan-zwicker)

* '''vulcan-scavenger-multi''' - This partition allows multi-node jobs (up to 9 total nodes per job) and allows jobs more resources than the vulcan-scavenger partition, but only contains nodes with GTX 1080 Ti, TITAN Xp, and/or RTX 2080 Ti GPUs in them. As with vulcan-scavenger, it is preemptable when jobs in other non-scavenged-named <code>vulcan-</code> partitions are ready to be scheduled.
*: Access to this partition is on a per-use basis. Please contact Abhinav Shrivastava if you would like to be granted access to this partition.

There is one additional partition available solely to Dr. Ramani Duraiswami's sponsored accounts.

* '''vulcan-ramani''' - This partition is for exclusive priority access to Dr. Duraiswami's purchased GPU nodes. Job allocations are guaranteed.

==Accounts==
Vulcan has a base SLURM account <code>vulcan</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in Vulcan compute infrastructure have an additional account provided to their sponsored accounts on the cluster.

If you do not specify an account when submitting your job, you will receive the '''vulcan''' account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.

The current faculty accounts are:
* vulcan-abhinav
* vulcan-djacobs
* vulcan-jbhuang
* vulcan-metzler
* vulcan-rama
* vulcan-ramani
* vulcan-ruoshi
* vulcan-yaser
* vulcan-zwicker

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
vulcan vulcan vulcan
vulcan-abhinav vulcan - abhinav shrivastava vulcan
vulcan-djacobs vulcan - david jacobs vulcan
vulcan-jbhuang vulcan - jia-bin huang vulcan
vulcan-metzler vulcan - chris metzler vulcan
vulcan-rama vulcan - rama chellappa vulcan
vulcan-ramani vulcan - ramani duraiswami vulcan
vulcan-ruoshi vulcan - ruoshi liu vulcan
vulcan-yaser vulcan - yaser yacoob vulcan
vulcan-zwicker vulcan - matthias zwicker vulcan
... ... ...
</pre>

Faculty can manage this list of users via our [https://intranet.umiacs.umd.edu/directory/secgroup/ Directory application] in the Security Groups section. The security group that controls access has the prefix <code>vulcan_</code> and then the faculty username. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------------------------------------
... ... ... ...
abhinav vulcan 48 vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger
abhinav vulcan-abhinav 48 vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger
... ... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. As shown below, there is a concurrent limit of 64 total GPUs for all users not in a contributing faculty group.

<pre>
$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
vulcan gres/gpu=64
... ...
</pre>

==QoS==
Vulcan currently has 3 QoS for the '''vulcan-dpart''' partition, 1 QoS for the '''vulcan-scavenger''' partition, and 1 QoS for the '''vulcan-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>vulcan-default</code> QoS assuming you are using a Vulcan account.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep vulcan
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
------------------------- ----------- ------------------------------ --------- ------------------------------
...
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-multi 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
...
</pre>

<pre>
$ show_partition_qos --all | grep vulcan
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- ------------------------------ --------------------
...
vulcan 500 cpu=1760,mem=15824G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
...
</pre>

==Storage==
Vulcan has the following storage available. Please also review UMIACS [[FilesystemDataStorage | Filesystem Data Storage]] policies including any volume that is labeled as scratch.

Vulcan users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Vulcan compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 300GB of scratch storage available at <code>/vulcanscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a temporary increase of up to 500GB total space for a maximum of 120 days without any faculty approval by [[HelpDesk | contacting staff]]. Once the temporary increase period is over, you will be contacted and given a one-week window of opportunity to clean and secure your data before staff will forcibly remove data to get your space back under 300GB. If you need space beyond 500GB or for longer than 120 days, you will need faculty approval and/or a project directory.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/vulcan-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Vulcan datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Vulcan here].

===Project Storage===
Users within the Vulcan compute infrastructure can request project based allocations for up to 10TB for up to 180 days by [[HelpDesk | contacting staff]] with approval from the Vulcan faculty manager (Dr. Shrivastava). These allocations will be available from <code>/fs/vulcan-projects</code> under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 180 days (requires re-approval from Dr. Shrivastava).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

===Object Storage===
All Vulcan users can request project allocations in the [https://obj.umiacs.umd.edu/obj/help UMIACS Object Store]. Please [[HelpDesk | contact staff]] with a short project name and the amount of storage you will need to get started.

To access this storage, you'll need to use a [[S3Clients | S3 client]] or our [[UMobj]] command line utilities.

An example on how to use the umobj command line utilities can be found [[UMobj/Example | here]]. A full set of documentation for the utilities can be found on the [https://gitlab.umiacs.umd.edu/staff/umobj/blob/master/README.md#umobj umobj Gitlab page].

Nexus/Vulcan

2026-07-17T16:10:55Z

Mbaney:

The compute nodes from Vulcan's previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexusvulcan.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusvulcan00.umiacs.umd.edu</code>
* <code>nexusvulcan01.umiacs.umd.edu</code>

All partitions, QoSes, and account names from the standalone Vulcan cluster have been moved over to Nexus. However, please note that <code>vulcan-</code> is prepended to all of the values that were present in the standalone Vulcan cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named <code>vulcan</code> in the standalone cluster (it is also named just <code>vulcan</code> in Nexus).

Here are some before/after examples of job submission with various parameters:

{| class="wikitable"
! Standalone Vulcan cluster submission command
! Nexus cluster submission command
|-
|<code>srun --partition=dpart --qos=medium --account=abhinav --gres=gpu:gtx1080ti:2 --pty bash</code>
|<code>srun --partition=vulcan-dpart --qos=vulcan-medium --account=vulcan-abhinav --gres=gpu:gtx1080ti:2 --pty bash</code>
|-
|<code>srun --partition=cpu --qos=cpu --pty bash</code>
|<code>srun --partition=vulcan-cpu --qos=vulcan-cpu --account=vulcan --pty bash</code>
|-
|<code>srun --partition=scavenger --qos=scavenger --account=vulcan --gres=gpu:4 --pty bash</code>
|<code>srun --partition=vulcan-scavenger --qos=vulcan-scavenger --account=vulcan --gres=gpu:4 --pty bash</code>
|}

Vulcan users (exclusively) can schedule non-interruptible jobs on Vulcan nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all vulcan## in aggregate nodes plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''.

Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Compute Nodes==
There are currently 46 [[Nexus/Vulcan/GPUs | GPU nodes]] available, named vulcan[00-45], running a mixture of NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, and a number of different older generation cards. There are also 4 CPU-only nodes available, named brigid[16-19].

All nodes are scheduled with the [[SLURM]] resource manager.

==Network==
The network infrastructure supporting the Vulcan partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* brigid[16-17],vulcan[29-45]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* vulcan46: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of switches through several sets of intermediary switches, and to each other via dual 10GbE links for redundancy. The immediate connection to these sets of intermediary switches is via two 40GbE links to a pair of them, one between the first two switches in each pair and one between the second two switches in each pair for redundancy. This pair serves the following compute nodes:
#* vulcan[23,27-28]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all Vulcan [[Nexus/Vulcan#Scratch_Directories | scratch]], [[Nexus/Vulcan#Project_Storage | project]], and [[Nexus/Vulcan#Datasets | dataset]] allocations first connects to a pair of intermediary switches and then the first pair of switches mentioned [[Nexus/Tron#Network | here (Tron page's network section)]]. It then connects to the first pair of switches mentioned on this page through a set of four (different) intermediary switches. The last hop from the four intermediary switches to the first pair of switches mentioned on this page is via 32 100GbE links, four from each switch in the set to each switch in the first pair mentioned on this page for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general Vulcan [[SLURM]] users. You must specify a partition when submitting your job.

* '''vulcan-dpart''' - This is the default partition. Job allocations are guaranteed. Only nodes with GPUs from architectures older than NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] are included in this partition.
* '''vulcan-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other non-scavenger-named <code>vulcan-</code> partitions are ready to be scheduled.
* '''vulcan-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.

There are a few additional partitions available to subsets of Vulcan users based on specific requirements.

* '''vulcan-ampere''' - This partition contains nodes with GPUs from NVIDIA's [https://www.nvidia.com/en-us/data-center/ampere-architecture/ Ampere architecture] or a newer architecture. Job allocations are guaranteed. Please be aware of the following restrictions on this partition:
*: ''Time limit'': there is a 4 hour time limit on interactive jobs in this partition. If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script.
*: ''CPU/memory per GPU limit'': there is a limit of 4 CPUs and 48G memory maximum per non-H200 GPU requested by a job, and 16 CPUs and 256G memory maximum per H200 GPU requested by a job. If you need to run jobs with more CPUs/memory, you will either need to request more GPUs in the job or use a different partition.

: Submission is restricted to the Slurm [[#Accounts | accounts]] of the faculty who invested in these nodes:
:* Abhinav Shrivastava (vulcan-abhinav)
:* Jia-Bin Huang (vulcan-jbhuang)
:* Christopher Metzler (vulcan-metzler)
:* Ruoshi Liu (vulcan-ruoshi)
:* Matthias Zwicker (vulcan-zwicker)

* '''vulcan-scavenger-multi''' - This partition allows multi-node jobs (up to 9 total nodes per job) and allows jobs more resources than the vulcan-scavenger partition, but only contains nodes with GTX 1080 Ti, TITAN Xp, and/or RTX 2080 Ti GPUs in them. As with vulcan-scavenger, it is preemptable when jobs in other non-scavenged-named <code>vulcan-</code> partitions are ready to be scheduled.
*: Access to this partition is on a per-use basis. Please contact Abhinav Shrivastava if you would like to be granted access to this partition.

There is one additional partition available solely to Dr. Ramani Duraiswami's sponsored accounts.

* '''vulcan-ramani''' - This partition is for exclusive priority access to Dr. Duraiswami's purchased GPU nodes. Job allocations are guaranteed.

==Accounts==
Vulcan has a base SLURM account <code>vulcan</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in Vulcan compute infrastructure have an additional account provided to their sponsored accounts on the cluster.

If you do not specify an account when submitting your job, you will receive the '''vulcan''' account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.

The current faculty accounts are:
* vulcan-abhinav
* vulcan-djacobs
* vulcan-jbhuang
* vulcan-metzler
* vulcan-rama
* vulcan-ramani
* vulcan-ruoshi
* vulcan-yaser
* vulcan-zwicker

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
vulcan vulcan vulcan
vulcan-abhinav vulcan - abhinav shrivastava vulcan
vulcan-djacobs vulcan - david jacobs vulcan
vulcan-jbhuang vulcan - jia-bin huang vulcan
vulcan-metzler vulcan - chris metzler vulcan
vulcan-rama vulcan - rama chellappa vulcan
vulcan-ramani vulcan - ramani duraiswami vulcan
vulcan-ruoshi vulcan - ruoshi liu vulcan
vulcan-yaser vulcan - yaser yacoob vulcan
vulcan-zwicker vulcan - matthias zwicker vulcan
... ... ...
</pre>

Faculty can manage this list of users via our [https://intranet.umiacs.umd.edu/directory/secgroup/ Directory application] in the Security Groups section. The security group that controls access has the prefix <code>vulcan_</code> and then the faculty username. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------------------------------------
... ... ... ...
abhinav vulcan 48 vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger
abhinav vulcan-abhinav 48 vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger
... ... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. As shown below, there is a concurrent limit of 64 total GPUs for all users not in a contributing faculty group.

<pre>
$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
vulcan gres/gpu=64
... ...
</pre>

==QoS==
Vulcan currently has 3 QoS for the '''vulcan-dpart''' partition, 1 QoS for the '''vulcan-scavenger''' partition, and 1 QoS for the '''vulcan-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>vulcan-default</code> QoS assuming you are using a Vulcan account.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep vulcan
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
------------------------- ----------- ------------------------------ --------- ------------------------------
...
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-multi 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
...
</pre>

<pre>
$ show_partition_qos --all | grep vulcan
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- ------------------------------ --------------------
...
vulcan 500 cpu=1760,mem=15824G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
...
</pre>

==Storage==
Vulcan has the following storage available. Please also review UMIACS [[FilesystemDataStorage | Filesystem Data Storage]] policies including any volume that is labeled as scratch.

Vulcan users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Vulcan compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 300GB of scratch storage available at <code>/vulcanscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a temporary increase of up to 500GB total space for a maximum of 120 days without any faculty approval by [[HelpDesk | contacting staff]]. Once the temporary increase period is over, you will be contacted and given a one-week window of opportunity to clean and secure your data before staff will forcibly remove data to get your space back under 300GB. If you need space beyond 500GB or for longer than 120 days, you will need faculty approval and/or a project directory.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/vulcan-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Vulcan datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Vulcan here].

===Project Storage===
Users within the Vulcan compute infrastructure can request project based allocations for up to 10TB for up to 180 days by [[HelpDesk | contacting staff]] with approval from the Vulcan faculty manager (Dr. Shrivastava). These allocations will be available from <code>/fs/vulcan-projects</code> under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 180 days (requires re-approval from Dr. Shrivastava).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

===Object Storage===
All Vulcan users can request project allocations in the [https://obj.umiacs.umd.edu/obj/help UMIACS Object Store]. Please [[HelpDesk | contact staff]] with a short project name and the amount of storage you will need to get started.

To access this storage, you'll need to use a [[S3Clients | S3 client]] or our [[UMobj]] command line utilities.

An example on how to use the umobj command line utilities can be found [[UMobj/Example | here]]. A full set of documentation for the utilities can be found on the [https://gitlab.umiacs.umd.edu/staff/umobj/blob/master/README.md#umobj umobj Gitlab page].

Nexus/CML

2026-07-17T16:10:08Z

Mbaney: /* Network */

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition, i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs.

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36,38]: Four 100GbE links per node, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are two partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-ramani''' - This partition is for exclusive priority access to Dr. Ramani Duraiswami's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-default''' and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts) and 1 QoS for the '''cml-scavenger''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus/CML

2026-07-17T16:09:57Z

Mbaney:

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition, i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs.

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36,38]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are two partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-ramani''' - This partition is for exclusive priority access to Dr. Ramani Duraiswami's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-default''' and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts) and 1 QoS for the '''cml-scavenger''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus

2026-07-17T16:09:16Z

Mbaney:

{{Note|UMIACS Technical Staff has begun the process of upgrading the operating system version on all Nexus cluster nodes as of 06/01/2026. Please see [[Nexus/ClusterOSUpgrade]] for more information.}}

The Nexus is the combined scheduler of resources in UMIACS. The resource manager for Nexus is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account - not a [[Accounts/Collaborator | collaborator account]] - in order to access Nexus.

== Access ==
Your access to submission nodes (alternatively called login nodes) for Nexus computational resources is determined by your account sponsor's department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available submission nodes - generally a pair of nodes with hostnames of the format <tt>nexus<department, lab, or center abbreviation>[00,01]</tt>, e.g., <tt>nexusgroup00</tt> and <tt>nexusgroup01</tt>.

Once you have identified your submission nodes, you can [[SSH]] into them [https://itsupport.umd.edu/itsupport?id=kb_article_view&sysparm_article=KB0016076 after connecting to UMD's GlobalProtect VPN]. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

Please read our [[Nexus/Submission_Node_Policy|Submission Node Policy]] for guidance on appropriate usage of a submission node. If a submission node becomes unresponsive due to disregarding this policy, we may kill user processes on these nodes to resolve the issue. We reserve the right to take action on users who repeatedly cause issues on submission nodes.

== Jobs ==
[[SLURM]] jobs are [[SLURM/JobSubmission | submitted]] by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the who/where/how, you may be required to specify <code>--account</code>, <code>--partition</code>, and/or <code>--qos</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--cpus-per-task</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four; if you don't specify something, you will get the default value for that resource, which is minimal (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs | here]]. You may also use <code>--ntasks</code> to specify the number of parallel processes to run, with each task having its own set of the resources specified above. You can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

For a list of available GPU types on Nexus and their specs, please see [[Nexus/GPUs]].

For details on how the network for Nexus is architected, please see [[Nexus/Network]]. This can be important if you wish to optimize performance of your jobs.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --cpus-per-task=4 --mem=2gb --gres=gpu:1 bash
srun: Job account was unset; set to user default of 'nexus'
srun: Job partition was unset; set to cluster default of 'tron'
srun: Job QoS was unset; set to association default of 'default'
srun: Job time limit was unset; set to partition default of 60 minutes
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
$ hostname
tron62.umiacs.umd.edu
$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-daad6a04-a2ce-1183-ce53-b267048f750a)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus has a number of different partitions of resources. Different Centers, Labs, and Faculty are able to invest in computational resources that are restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all non-class accounts sponsored by either UMIACS or CSD faculty.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that contains [https://en.wikipedia.org/wiki/X86-64 x86_64] architecture nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again. For an overview of things you can check within scripts to determine if your job was preempted/resumed, see [[SLURM/Preemption]].
* Scavenger (aarch64) - Preemption partition identical in design to <tt>scavenger</tt>, but only contains [https://en.wikipedia.org/wiki/AArch64 aarch64] architecture nodes.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts#Cluster_Usage | Class]] - Pool of resources available to class accounts sponsored by either UMIACS or CSD faculty.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CBCB]] - CBCB lab pool available for CBCB lab members.
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/CML]] - CML lab pool available for CML lab members.
* [[Nexus/GAMMA]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MBRC]] - MBRC lab pool available for MBRC lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.
* [[Nexus/QuICS]] - QuICS lab pool available for QuICS lab members.
* [[Nexus/Vulcan]] - Vulcan lab pool available for Vulcan lab members.

You can view the partitions that you have access to by using the <code>show_partitions</code> command. By default, the command will show only the partitions that are available to you.

<pre>
$ show_partitions
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
scavenger scavenger scavenger UNLIMITED brigid[16-19]
cbcb[00-29]
clip[00-13]
cml[00,02-13,15-28,30-33]
gammagpu[00-21]
legacy[00-11,13-28,30-36]
legacygpu[00-07]
quics00
tron[00-69]
vulcan[00-45]
------------------------------------------------------------------------------------------------------------------------
scavenger-aarch64 scavenger scavenger-aarch64 UNLIMITED oasis[00-39]
------------------------------------------------------------------------------------------------------------------------
tron nexus default UNLIMITED tron[00-69]
high
medium
</pre>

If you want to see information for all of the partitions, including those that you do not have access to, you can use the <code>show_partitions --all</code> command.

<pre>
$ show_partitions --all
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
cbcb cbcb default UNLIMITED cbcb[00-20,22-29]
medium legacy[00-11,13-28,30-36]
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-heng cbcb-heng default UNLIMITED cbcb[26-29]
medium
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-interactive cbcb interactive UNLIMITED cbcb21
...
</pre>

= Quality of Service (QoS) =
SLURM uses Quality of Service (QoS) both to provide limits on job sizes (termed by us as "job QoS") as well as to limit resources used by all jobs running in a partition, either per user or per group (termed by us as "partition QoS").

=== Job QoS ===
Job QoS are used to provide limits on the size of job that you can run. You should try to allocate only the resources your job actually needs, as resources that each of your jobs schedules are counted against your [[SLURM/Priority#Fair-share | fair-share priority]] in the future.
* default - Default job QoS. Limited to 4 CPU cores, 1 GPU, and 32GB RAM per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 CPU cores, 2 GPUs, and 64GB RAM per job. The maximum wall time per job is 2 days.
* high - Limited to 16 CPU cores, 4 GPUs, and 128GB RAM per job. The maximum wall time per job is 1 day.
* scavenger - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. 11% of the total resources available for each trackable resource type in the partition (CPUs/GPUs/RAM) is permitted simultaneously across all of your jobs running with this job QoS, enforced via the corresponding partition QoS (below) for the scavenger partition. This job QoS is paired one-to-one with the scavenger partition. To use this job QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any job QoS argument other than <code>--qos=scavenger</code> (optional) or submission will fail.
* scavenger-aarch64 - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. This job QoS is paired one-to-one with the scavenger-aarch64 partition. To use this job QoS, include <code>--partition=scavenger-aarch64</code>, <code>--account=scavenger</code>, and <code>--qos=scavenger-aarch64</code> in your submission arguments.

You can display these job QoS from the command line using the <code>show_qos</code> command. By default, the command will only show job QoS that you can access. The above five job QoS are the ones that everyone can access.

<pre>
$ show_qos
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
</pre>

If you want to see all job QoS, including those that you do not have access to, you can use the <code>show_qos --all</code> command.

<pre>
$ show_qos --all
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
gamma-huge-long 10-00:00:00 cpu=32,gres/gpu=16,mem=256G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
highmem 21-00:00:00 cpu=128,mem=2T
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
interactive 12:00:00 cpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
oasis-exempt 10-00:00:00 cpu=160,mem=28114M
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-mu+ 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
</pre>

You are able to submit to any partition that is listed in the <code>show_partitions</code> command. If you need to use an account other than the default account <tt>nexus</tt>, you will need to specify it via the <code>--account</code> submission argument.

=== Partition QoS ===
Partition QoS are used to limit resources used by all jobs running in a partition, either per user (MaxTRESPU) or per group (GrpTRES).

To view partition QoS, use the <code>show_partition_qos</code> command.

<pre>
$ show_partition_qos
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

The scavenger_part partition QoS has relative TRES limits based on the current hardware in a given partition, represented with percentages. To see the current actual TRES limits of this partition QoS, you can use the <code>-r/--real</code> argument.

<pre>
$ show_partition_qos -r
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=888,gres/gpu=140,mem=12574G
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

If you want to see all partition QoS, including those that you do not have access to, you can use the <code>show_partition_qos --all</code> command.

<pre>
$ show_partition_qos --all
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
cbcb 500 cpu=1406,mem=50359G
cbcb-heng 500
cbcb-interactive 500
class 500 cpu=32,gres/gpu=4,mem=262144M
clip 500 cpu=726,mem=6939G
cml 500 cpu=1226,mem=12116G
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
cml-wriva-high 500
csd-h200 500
gamma 500 cpu=906,mem=7675G
mbrc 500 cpu=370,mem=3571G
mc2 500 cpu=330,mem=3201G
oasis 500
quics 500 cpu=458,mem=4710G
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
vulcan 500 cpu=1402,mem=12936G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
</pre>

'''NOTE''': These QoS cannot be used directly when submitting jobs. Partition QoS limits apply to all jobs running on a given partition, regardless of what job QoS is used.

For example, in the default non-preemption partition (<tt>tron</tt>), you are restricted to 32 total CPU cores, 4 total GPUs, and 256GB total RAM at once across all jobs you have running in the partition.

Lab/group-specific partitions may also have their own user limits, and/or may also have group limits on the total number of resources consumed simultaneously by all users that are using their partition, codified by the line in the output above that matches their lab/group name. Note that the values listed above in the two "TRES" columns are not fixed and may fluctuate per-partition as more resources are added to or removed from each partition.

'''All partitions also only allow a maximum of 500 submitted (running (R) or pending (PD)) jobs per user in the partition simultaneously.''' This is to prevent excess pending jobs causing [https://slurm.schedmd.com/sched_config.html#backfill backfill] issues with the SLURM scheduler.
* If you need to submit more than 500 jobs in batch at once, you can develop and run an "outer submission script" that repeatedly attempts to run an "inner submission script" (your original submission script) to submit jobs in the batch periodically, until all job submissions are successful. The outer submission script should use looping logic to check if you are at the max job limit and should then retry submission after waiting for some time interval.
: An example outer submission script is as follows. In this example, <code>example_inner.sh</code> is your inner submission script and is not an [[SLURM/ArrayJobs | array job]], and you want to run 1000 jobs. If your inner submission script is an array job, adjust the number of jobs accordingly. Array jobs must be of size 500 or less.
<pre>
#!/bin/bash
numjobs=1000
i=0
while [ $i -lt $numjobs ]
do
while [[ "$(sbatch example_inner.sh 2>&1)" =~ "QOSMaxSubmitJobPerUserLimit" ]]
do
echo "Currently at maximum job submissions allowed by the partition's QoS."
echo "Waiting for 5 minutes before trying to submit more jobs."
sleep 300
done
i=$(( $i + 1 ))
echo "Submitted job $i of $numjobs"
done
</pre>

It is suggested that you run the outer submission script in a [[Tmux]] session to keep the terminal window executing it from being interrupted.

= Storage =
All network storage available in Nexus is currently [[NFS]] based, and comes in a few different flavors. Compute nodes also have local scratch storage that can be used.

== Home Directories ==
{{Nfshomes}}

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/<USERNAME></code> where <USERNAME> is your UMIACS username. '''It is not backed up or protected in any way.''' This directory is '''[[Automounter | automounted]]'''; you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access it.

You can view your quota usage by running <code>df -h /fs/nexus-scratch/<USERNAME></code>.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a [[#Project_Allocations | project allocation]] for this. If you choose to increase your scratch space beyond 400GB, the increased space is also subject to the 270 TB days limit mentioned in the project allocation section before we check back in for renewal. For example, if you request 1.4TB total space, you may have this for 270 days (1TB beyond the 400GB permanent increase). The amount increased beyond 400GB will also count against your faculty member's 20TB total storage limit mentioned below.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. and '''are not backed up or protected in any way.''' These directories are almost always more performant than any other storage available to the job as they are mounted from disks directly attached to the compute node. However, you must stage your data within the confines of your job and extract the relevant resultant data elsewhere before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our [[MonthlyMaintenanceWindow | monthly maintenance windows]]. Please make sure you secure any resultant data you wish to keep from these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of permanent lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

Lab space storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc..

A single faculty member can not have more than 20TB of project allocations across all of their sponsored accounts active simultaneously. Network scratch allocation space increases beyond the 400GB permanent maximum also have the increase count against this limit (i.e., a 1TB network scratch allocation would have 600GB counted towards this limit).

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

The maximum allocation length you can request is 540 days (500GB space) and the maximum storage space you can request is 9TB (30 day length).

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations are available via <code>/fs/nexus-projects/<project name></code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If renewal is available, you can renew for up to another 270 TB days with reapproval from the original faculty approver.
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Nexus datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Nexus here].

SLURM/JobSubmission

2026-07-17T16:07:43Z

Mbaney:

=Job Submission=
SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.

The cluster that everyone with a [[Accounts#UMIACS_Account | UMIACS account]] has access to is [[Nexus]]. Please visit the Nexus page for instructions on how to connect to your assigned submit nodes.

'''Computationally intensive processes run on submission nodes will be terminated. Please submit jobs to be scheduled on compute nodes for this purpose.'''

For details on how SLURM decides how to schedule jobs when multiple jobs are waiting in a scheduler's queue, please see [[SLURM/Priority]].

==srun==
The <code>srun</code> command is used to run a process on the compute nodes in the cluster. If you pass it a normal shell command (or command that executes a script), it will submit a job to run that shell command/script on a compute node and then return. <code>srun</code> accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for <code>srun</code>, which can be accessed by running <code>man srun</code>.

<pre>
$ srun --qos=default --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from tron33.umiacs.umd.edu
</pre>

It is important to understand that <code>srun</code> is an interactive command. By default input to <code>srun</code> is broadcast to all compute nodes running your process and output from the compute nodes is redirected to <code>srun</code>. This behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive submission that will remain running after you logout, you will need to wrap your <code>srun</code> commands in a batch script and submit it with [[#sbatch | sbatch]].

===Common srun Arguments===
* <code>--job-name=<JOBNAME></code> ''Requests your job be named <JOBNAME>''
* <code>--mem=1g</code> ''Requests 1GB of memory for your job, if no unit is given MB is assumed''
* <code>--ntasks=2</code> ''Requests 2 "tasks" which map to cores on a CPU for your job; if passed to srun, runs the given command concurrently on each core''
* <code>--cpus-per-task=2</code> ''Requests 2 CPU cores be allocated per task for your job''
* <code>--nodes=2</code> ''Requests 2 nodes be allocated to your job; if passed to srun, runs the given command concurrently on each node''
* <code>--nodelist=<NODENAME></code> ''Requests to run your job on the <NODENAME> node''
* <code>--time=dd-hh:mm:ss</code> ''Requests your job run for dd days, hh hours, mm minutes, and ss seconds''
* <code>--error=<ERRNAME></code> ''Redirects stderr for your job to the <ERRNAME> file''
* <code>--partition=<PARTITIONNAME></code> ''Requests your job run in the <PARTITIONNAME> partition''
* <code>--qos=<QOSNAME>default</code> ''Requests your job run with the <QOSNAME> QOS, to see the available QOS options on a cluster, run'' <code>show_qos</code>
* <code>--account=<ACCOUNTNAME></code> ''Requests your job runs under the <ACCOUNTNAME> Slurm account, different accounts have different available partitions/QOS''
* <code>--output=<OUTNAME></code> ''Redirects stdout for your job to the <OUTNAME> file''
* <code>--requeue</code> ''Requests your job be automatically requeued if it is preempted''
* <code>--exclusive</code> ''Requests your job be the only one running on the node(s) it is assigned to. This requires that your job be allocated all of the resources on the node(s). The scheduler '''does not''' automatically give your job all of the node's/nodes' resources, however, so if you need more than the default, you still need to request these with'' <code>--ntasks</code> ''and'' <code>--mem</code>

===Interactive Shell Sessions===
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use <code>srun</code> with the <code>--pty</code> argument to invoke a shell:
<pre>
$ srun --pty --qos=default --mem=1g --time=01:00:00 bash
$ hostname
tron33.umiacs.umd.edu
</pre>
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''

==salloc==
The <code>salloc</code> command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.

<pre>
$ salloc --qos=default -N 1 --mem=2g --time=01:00:00
salloc: Granted job allocation 159
$ srun /usr/bin/hostname
tron33.umiacs.umd.edu
$ exit
exit
salloc: Relinquishing job allocation 159
</pre>

'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''

==sbatch==
The <code>sbatch</code> command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:

<pre>
#!/bin/bash

srun bash -c 'echo Hello World from `hostname`'
</pre>

Then you need to submit the script with sbatch and request resources:

<pre>
$ sbatch --qos=default --mem=1g --time=1:00:00 helloWorld.sh
Submitted batch job 121
</pre>

SLURM will return a job number that you can use to check the status of your job with squeue:

<pre>
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
121 tron helloWor username R 0:01 1 tron32
</pre>

====Advanced Batch Scripts====
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run tens/hundreds/thousands of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation - each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to.

In the following example, we have a batch job that will request 2 nodes in the cluster. We then load a specific version of [[Python]] into our environment and submit two job steps, each one using one node. Since srun blocks until the command finishes by default, we use the '&' operator to background the process so that both job steps can run at once; however, this means that we then need to use the wait command to block processing until all background processes have finished.

<pre>
#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling. These MUST be above any non-#SBATCH lines in the script to take effect.

#SBATCH --job-name=helloWorld # set job name
#SBATCH --output=helloWorld.out.%j # indicates a file to redirect STDOUT to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --error=helloWorld.err.%j # indicates a file to redirect STDERR to; %j is the jobid. If set, must be set to a file instead of a directory or else submission will fail.
#SBATCH --time=00:05:00 # how long you would like your job to run; format=dd-hh:mm:ss
#SBATCH --account=nexus # set account, this determines which partitions and QOSes are available for your job
#SBATCH --partition=tron # set partition, this determines what nodes are available for your job
#SBATCH --qos=default # set QOS, this determines how many resources can be requested within your job, and for how long
#SBATCH --nodes=2 # number of nodes to allocate for your job
#SBATCH --ntasks=4 # request 4 cpu cores be reserved for your job total
#SBATCH --ntasks-per-node=2 # request 2 cpu cores be reserved per node
#SBATCH --mem=1g # memory required by job; if unit is not specified MB will be assumed. for multi-node jobs, this argument allocates this much memory *per node*

srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" & # use srun to invoke commands within your job; using an '&'
srun --nodes=1 --mem=512m bash -c "hostname; python3 --version" & # will background the process allowing them to run concurrently
wait # wait for any background processes to complete

# Once the end of the batch script is reached, your job allocation will be revoked.
</pre>

Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as <code>${1}</code> for the first argument and so on.

====More Examples====
* [[SLURM/ArrayJobs]]

==scancel==
The <code>scancel</code> command can be used to cancel your own job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.
*<code>scancel 255</code> ''cancel job 255''
*<code>scancel 255.3</code> ''cancel job step 3 of job 255''
*<code>scancel --user=username --partition=tron</code> ''cancel all jobs for username in the tron partition; username must be your UMD directory ID''

=Identifying Resources and Features=
The <code>sinfo</code> command can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like <code>sinfo -o "%40N %8c %8m %35f %35G"</code>.

<pre>
$ sinfo -o "%40N %10c %10m %40f %32G"
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
cbcb[00-21] 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null)
cbcb22,legacy20 24+ 384270+ rhel8,x86_64,Xeon,E5-2680 (null)
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8
cbcb[28-29] 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8
legacy00 48 125940 rhel8,x86_64,Zen,EPYC-7402 (null)
cbcb[23-24],legacy[34-35],twist05 24 255150 rhel8,x86_64,Xeon,E5-2650 (null)
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1
legacy[01-11,13-19,22-28,30] 12+ 61804+ rhel8,x86_64,Xeon,E5-2620 (null)
legacy21 8 61746 rhel8,x86_64,Xeon,E5-2623 (null)
legacy31 8 61727 rhel8,x86_64,Xeon,E5-1660 (null)
tron[46-61] 48 255232 rhel8,x86_64,Zen,EPYC-7352,Ampere gpu:rtxa5000:8
tron[06-09,12-15,21] 16 126214+ rhel8,x86_64,Zen,EPYC-7302P,Ampere gpu:rtxa4000:4
tron[10-11,16-20,34] 16 126217 rhel8,x86_64,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4
tron[22-33,35-44] 16 126214+ rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa4000:4
clip13,cml30,vulcan[29-30,32,45] 32 255218+ rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa6000:8
cml[17-28],gammagpu05 32 255225+ rhel8,x86_64,Zen,EPYC-7282,Ampere gpu:rtxa4000:8
cml12 32 383038 rhel8,x86_64,Xeon,4216,Turing,Ampere gpu:rtx2080ti:7,gpu:rtxa4000:1
cml31 32 384094 rhel8,x86_64,Zen,EPYC-9124,Ampere,Hopper gpu:a100:1,gpu:h100-nvl:1
cml32 64 512999 rhel8,x86_64,Zen,EPYC-7543,Ampere gpu:a100:4
cml33 64 1029019 rhel8,x86_64,Xeon,6448Y,Hopper gpu:h100-sxm:4
cml[00,15-16] 32 351530+ rhel8,x86_64,Xeon,4216,Turing gpu:rtx2080ti:7
cml01 32 383030 rhel8,x86_64,Xeon,4216,Turing gpu:rtx2080ti:6
cml[02-11,13],tron[62-63,65-66,68-69] 32 351770+ rhel8,x86_64,Xeon,4216,Turing gpu:rtx2080ti:8
clip12,gammagpu[10-17] 16 126203+ rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa6000:4
gammagpu[18-21] 32 254883 rhel8,x86_64,Xeon,6526Y,Ada gpu:l40s:4
gammagpu00 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa5000:8
legacygpu06 20 255249 rhel8,x86_64,Xeon,E5-2699,Maxwell gpu:gtxtitanx:4
legacygpu[01-02,07] 20 255249+ rhel8,x86_64,Xeon,E5-2650,Maxwell gpu:gtxtitanx:4
legacygpu05 44 513193 rhel8,x86_64,Xeon,E5-2699,Pascal gpu:gtx1080ti:4
legacygpu00 20 255249 rhel8,x86_64,Xeon,E5-2650,Pascal gpu:titanxp:4
legacygpu[03-04] 16 255268 rhel8,x86_64,Xeon,E5-2630,Maxwell gpu:gtxtitanx:2
mbrc[00-01] 20 189498 rhel8,x86_64,Xeon,4114,Turing gpu:rtx2080ti:8
clip03 20 126243 rhel8,x86_64,Xeon,E5-2630,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:2
clip04 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtx3090:4
clip[05-06] 24 126216 rhel8,x86_64,Zen,EPYC-7352,Ampere gpu:rtxa6000:2
clip09 32 383043 rhel8,x86_64,Xeon,6130,Pascal,Turing gpu:rtx2080ti:5,gpu:gtx1080ti:3
clip10 44 1029404 rhel8,x86_64,Xeon,E5-2699 (null)
clip11 16 126217 rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa4000:4
quics00 128 1545009 rhel8,x86_64,Zen,EPYC-9534 (null)
tron00 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa6000:6
tron[01-03,05] 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa6000:8
tron04 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa6000:7
tron[64,67] 32 383028+ rhel8,x86_64,Xeon,4216,Turing,Ampere gpu:rtx2080ti:7,gpu:rtx3070:1
clip00 32 255276 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:titanxpascal:3
clip01 32 255276 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:titanxpascal:1,gpu:titanxp:2
clip02 20 126255 rhel8,x86_64,Xeon,E5-2630,Pascal gpu:gtx1080ti:3
clip07 8 255263 rhel8,x86_64,Xeon,E5-2623,Pascal gpu:gtx1080ti:3
oasis[00-39] 160 28114 rhel9,aarch64,Altra,Altra-80 (null)
vulcan24 16 126216 rhel8,x86_64,Zen,EPYC-7282,Ampere gpu:rtxa6000:4
gammagpu[01-04,06-07,09],vulcan[33-37] 32 255215+ rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa5000:8
vulcan[38-44] 32 255215 rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa4000:8
brigid[16-17] 48 512897 rhel8,x86_64,Zen,EPYC-7443 (null)
vulcan23 32 383030 rhel8,x86_64,Xeon,4612,Turing gpu:rtx2080ti:8
vulcan[27-28] 56 770093 rhel8,x86_64,Xeon,8280,Turing gpu:rtx2080ti:10
brigid[18-19] 20 61739 rhel8,x86_64,Xeon,E5-2640 (null)
vulcan00 32 255259 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:p6000:7,gpu:p100:1
vulcan[01-02,04,06-07] 32 255259 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:p6000:8
vulcan[03,05] 32 255259 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:p6000:7
clip08,vulcan[08-16,18,20-22,25] 32 255258+ rhel8,x86_64,Xeon,E5-2683,Pascal gpu:gtx1080ti:8
vulcan[17,19] 32 255259 rhel8,x86_64,Xeon,E5-2683,Pascal gpu:gtx1080ti:7
vulcan26 24 770126 rhel8,x86_64,Xeon,6146,Pascal gpu:titanxp:10
</pre>

Note that all of the nodes shown by this may not necessarily be in a partition you are able to submit to.

You can identify further specific information about a node using [[SLURM/ClusterStatus#scontrol | scontrol]] with various flags.

There are also two command aliases developed by UMIACS staff to show various node information in aggregate. They are <code>show_nodes</code> and <code>show_available_nodes</code>.

==show_nodes==
The <code>show_nodes</code> command alias shows each node's name, number of CPUs, memory, {OS, CPU architecture, CPU type, GPU architecture (if the node has GPUs)} (as AVAIL_FEATURES), GRES (GPUs), and State. It essentially wraps the <tt>sinfo</tt> command with some pre-determined output format options and shows each node on its own line, in alphabetical order.

To only view nodes in a specific partition, append <code>-p <partition name></code> to the command alias.

===Examples===
<pre>
$ show_nodes
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
brigid16 48 512897 rhel8,x86_64,Zen,EPYC-7443 (null) idle
brigid17 48 512897 rhel8,x86_64,Zen,EPYC-7443 (null) idle
... ... ... ... ... ...
vulcan45 32 513250 rhel8,x86_64,Zen,EPYC-7313,Ampere gpu:rtxa6000:8 idle
</pre>

(specific partition)
<pre>
$ show_nodes -p tron
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
tron00 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa6000:8 idle
tron01 32 255233 rhel8,x86_64,Zen,EPYC-7302,Ampere gpu:rtxa6000:8 idle
... ... ... ... ... ...
tron69 32 383030 rhel8,x86_64,Xeon,4216,Turing gpu:rtx2080ti:8 idle
</pre>

==show_available_nodes==
The <code>show_available_nodes</code> command alias takes zero or more arguments that correspond to Slurm constructs, resources, or features that you are looking to request a job with and tells you what nodes could '''theoretically'''[0,1] run a job with these arguments immediately. It assumes your job is a single-node job.

These arguments are:
* <code>--partition</code>: Only include nodes in the specified partition(s).
* <code>--account</code>: Only include nodes from partitions that can use the specified account(s).
* <code>--qos</code>: Only include nodes from partitions that can use the specified QoS(es).
* <code>--cpus</code>: Only include nodes with at least this many CPUs free.
* <code>--mem</code>: Only include nodes with at least this much memory free. The default unit is MB if unspecified, but any of {K,M,G,T} can be suffixed to the number provided (will then be interpreted as KB, MB, GB, or TB, respectively).
* GRES-related arguments:
** <code>--gres</code>, <code>--and-gres</code>: Only include nodes whose list of GRES contains ''all'' of the specified GRES type/quantity pairings.
** <code>--or-gres</code>: Only include nodes whose list of GRES contains ''any'' of the specified GRES type/quantity pairings. Functionally identical to <tt>--and-gres</tt> if only one GRES type/quantity pairing is specified.
* GPU-related arguments:
** <code>--gpus</code>, <code>--and-gpus</code>: Only include nodes whose list of GPUs (a subset of GRES) contains ''all'' of the specified GPU type/quantity pairings.
** <code>--or-gpus</code>: Only include nodes whose list of GPUs (a subset of GRES) contains ''any'' of the specified GPU type/quantity pairings. Functionally identical to <tt>--and-gpus</tt> if only one GPU type/quantity pairing is specified.
* Feature-related arguments:
** <code>--feature</code>, <code>--and-feature</code>: Only include nodes whose list of features contains ''all'' of the specified feature(s).
** <code>--or-feature</code>: Only include nodes whose list of features contains ''any'' of the specified feature(s). Functionally identical to <tt>--and-feature</tt> if only one feature is specified.

These arguments are also viewable by running <code>show_available_nodes -h</code>.

If your passed argument set does not contain any resource-based arguments (CPUs/RAM/GRES or GPUs), a node is defined as available if it has at least 1 CPU and 1MB of RAM available.

If there are no nodes available that meet your passed argument set, you will receive the message <tt>There are no nodes that have currently free resources that meet this argument set.</tt>

===Footnotes===
[0] - As of now, this command alias does not factor in resources occupied by jobs that could be preempted (based on the partition(s) passed to it, if present). This is something that we are working on implementing.

[1] - This command alias also does not factor in jobs with higher priority values requesting more resources, in the same partition(s), blocking execution of a job submitted with the resources / other arguments checked by the command alias. This is due to the infeasibility of calculating a job's priority value before it is actually submitted.

===Examples===
Show all available nodes:
<pre>
$ show_available_nodes
brigid17
cpus=16,mem=414593M
brigid18
cpus=8,mem=24875M
...
</pre>

Show nodes available in the <tt>tron</tt> partition:
<pre>
$ show_available_nodes --partition tron
tron00
cpus=14,mem=50433M,gres=gpu:rtxa6000:1
tron01
cpus=10,mem=17665M,gres=gpu:rtxa6000:2
...
</pre>

Show nodes with one or more RTX A5000 or RTX A6000 GPUs available to the <tt>vulcan</tt> account:
<pre>
$ show_available_nodes --account vulcan --or-gpus rtxa5000:1,rtxa6000:1
vulcan32
cpus=16,mem=193778M,gres=gpu:rtxa6000:4
vulcan33
cpus=15,mem=181499M,gres=gpu:rtxa5000:3
...
</pre>

Show nodes with 4 or more CPUs, 48G or more memory, and one or more RTX A6000 GPUs available in the <tt>scavenger</tt> partition:
<pre>
$ show_available_nodes --partition=scavenger --cpus=4 --mem=48g --or-gpus=rtxa6000:1
cbcb27
cpus=59,mem=218303M,gres=gpu:rtxa6000:6
clip06
cpus=20,mem=93448M,gres=gpu:rtxa6000:1
...
</pre>

Show nodes with [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] or [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere] architecture GPUs available in the <tt>scavenger</tt> partition:
<pre>
$ show_available_nodes --partition=scavenger --or-feature=Ampere,Turing
cbcb25
cpus=24,mem=255278M,gres=gpu:rtx2080ti:1,gpu:gtx1080ti:1
cbcb26
cpus=127,mem=447707M,gres=gpu:rtxa5000:7
...
</pre>

Show nodes with [https://en.wikipedia.org/wiki/Zen_(microarchitecture) Zen] architecture CPUs and [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere] architecture GPUs available in the <tt>scavenger</tt> partition:
<pre>
$ show_available_nodes --partition=scavenger --and-feature=Zen,Ampere
cbcb26
cpus=127,mem=447707M,gres=gpu:rtxa5000:7
cbcb27
cpus=59,mem=218303M,gres=gpu:rtxa6000:6
...
</pre>

(bogus example) Attempt to show nodes available in the <tt>bogus</tt> partition:
<pre>
$ show_available_nodes --partition=bogus
There are no nodes that have currently free resources that meet this argument set.
</pre>

=Requesting GPUs=
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or CPU cores. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job, you can use the flag <code>--gres=gpu:#</code> (with the actual number of GPUs you want). If there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag e.g. <code>--gres=gpu:rtxa5000:#</code>. If you do not request a specific type of GPU, you are likely to be scheduled on an older, lower spec'd GPU.

Note that some QoSes may have limits on the number of GPUs you can request per job, so you may need to specify a different QoS to request more GPUs.

<pre>
$ srun --pty --qos=medium --gres=gpu:2 nvidia-smi
...
Wed Mar 6 16:59:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:3D:00.0 Off | N/A |
| 32% 23C P8 1W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:40:00.0 Off | N/A |
| 32% 25C P8 1W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
</pre>

Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please only request 1 GPU. The others on the node (if any) will be left available for other users.

<pre>
$ srun --pty --gres=gpu:rtxa5000:1 nvidia-smi
Thu Aug 25 15:22:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:01:00.0 Off | Off |
| 30% 23C P8 20W / 230W | 0MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
</pre>

As with all other flags, the <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]].

=MPI example=
To run [https://en.wikipedia.org/wiki/Message_Passing_Interface MPI] jobs, you will need to include the <code>--mpi=pmix</code> flag in your submission arguments.

<pre>
#!/usr/bin/bash
#SBATCH --job-name=mpi_test # Job name
#SBATCH --nodes=4 # Number of nodes
#SBATCH --ntasks=8 # Number of MPI ranks
#SBATCH --ntasks-per-node=2 # Number of MPI ranks per node
#SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node
#SBATCH --time=00:30:00 # Time limit hrs:min:sec

srun --mpi=pmix /nfshomes/username/testing/mpi/a.out
</pre>

Nexus/CML

2026-07-16T13:06:27Z

Mbaney:

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition, i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs.

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are two partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-ramani''' - This partition is for exclusive priority access to Dr. Ramani Duraiswami's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-default''' and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts) and 1 QoS for the '''cml-scavenger''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

SLURM/Priority

2026-07-10T12:57:39Z

Mbaney: /* Age */

[[SLURM]] at UMIACS is configured to prioritize jobs based on a number of factors, termed [https://slurm.schedmd.com/priority_multifactor.html multifactor priority] in SLURM. Each job submitted to the scheduler is assigned a priority value, which can be viewed in the output of <code>scontrol show job <jobid></code>.

Example:
<pre>
$ scontrol show job 1
JobId=1 JobName=bash
UserId=username(13337) GroupId=username(13337) MCS_label=N/A
Priority=2000841 Nice=0 Account=nexus QOS=default
...
</pre>

==Pending Jobs==
If the partition that you submit your job to cannot start your job instantly due to no compute node(s) in the partition having the resources free to run it, your job will remain in the Pending state with the listed reason <tt>(Resources)</tt>. If there is another job already pending with this reason in a partition, you submit a job to the same partition, and your job gets assigned a lower priority value than that pending job, your job will instead remain in the Pending state with reason <tt>(Priority)</tt>. If there are multiple jobs pending and your job is not the highest priority job pending, the scheduler will only start execution of your job if doing so would not push the start times for any higher priority jobs in the same partition further back.

Lowering some combination of the resources you are requesting and/or the time limit may allow submitted jobs to start sooner (or instantly) during times where a partition is under resource pressure. The command <code>squeue -j <jobid> --start</code> can be used to provide a time estimate for when your job will start, where <jobid> is the job ID you receive from either srun or sbatch. This time is subject to change depending on if other users' jobs end sooner or more jobs get submitted.

You can use the command alias <code>[[SLURM/JobSubmission#show_available_nodes | show_available_nodes]]</code> with a variety of different submission arguments to get a better idea of what jobs may be able to start sooner, but the output of this command alias is not definitive, for reasons mentioned in the footnotes on the page linked to.

==Priority Factors==
The priority factors in use at UMIACS are, from most-heavily to least-heavily weighted:
* Partition job was submitted to
* Fair-share of resources within SLURM account
* Age of job, i.e., time spent waiting to run in the queue
* Association/SLURM account being used
* "Nice" value that job was submitted with

===Partition===
The partitions whose names are or are prefixed with <code>scavenger</code> on our clusters are always in a lower priority tier and always have lower priority factors for their jobs than all other partitions on that cluster. As mentioned in other UMIACS cluster-specific documentation, jobs submitted to these partitions are also [https://slurm.schedmd.com/preempt.html preemptable]. These two design choices give the partitions their names; jobs submitted to <code>scavenger</code> named or prefixed partitions "scavenge" for available resources on the cluster rather than consume dedicated resources, and are interrupted by jobs asking to consume dedicated resources.

On [[Nexus]], labs/centers may also have their own scavenger partitions, i.e., <code><labname>-scavenger</code>, if the faculty for the lab/center have decided upon some sort of limit on jobs, such as number of simultaneous jobs, number of actively consumed billing resources, etc., in their non-scavenger partitions. These lab/center scavenger partitions allow for more jobs to be run by members of that lab/center on that lab's/center's nodes only, but jobs on these partitions are preemptable by jobs in that lab's/center's non-scavenger partitions and/or account-specific partitions, if any account-specific partitions containing a given node exist. Jobs submitted to lab/center scavenger partitions will preempt jobs submitted to the institute-wide scavenger partitions (running on nodes that are also in those lab/center scavenger partitions).

In decreasing order of priority (highest first), our priority tiers for partitions are:
# Priority access account-specific partitions
# Account-specific partitions
# Lab/center-specific and institute-wide non-"scavenger" named partitions
# Lab/center-specific "scavenger" named partitions
# Institute-wide "scavenger" named partitions

A job in a specific priority tier will never have a higher priority value than any job in a higher priority tier. Corresponding to the above tiers, the priority values that you will see for jobs in each tier:
# >= 4000000
# 3000000 to 3999999
# 2000000 to 2999999
# 1000000 to 1999999
# < 1000000

As such, '''jobs on specific nodes in some non-"scavenger" named partitions may also be subject to preemption''' based on these priority tiers. Generally speaking, though, most nodes are only in one partition in one of the first three (non-"scavenger") priority tiers, and then in an institute-wide "scavenger" named partition, and a lab/center-specific "scavenger" named partition, if one exists for the lab/center that a given node is a part of.

===Fair-share===
The more resources your jobs have already consumed within an account, the lower priority factor your future jobs will have when compared to other users' jobs in the same account who have used fewer resources (so as to "fair-share" with other users). Additionally, if there are multiple accounts that can submit to a partition, and the sum of resources used by all users' jobs within account A is greater than the sum of resources used by all users' jobs within account B, all future jobs from users in account A will have a lower priority factor when compared to all future jobs from users in account B. (In other words, fair-share is hierarchical.)

You can view the various fair-share statistics with the command <code>sshare -l</code>. It will show your specific FairShare values (always between 0.0 and 1.0) within accounts that you have access to. You can also view other accounts' Level Fairshare (LevelFS).
<pre>
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
root 0.000000 68444174744 1.000000 cpu=4797787,mem=70530109515,e+
cbcb 1 0.028571 4454658377 0.065046 0.065046 0.439246 cpu=452139,mem=22276633804,en+
class 1 0.028571 255617290 0.003733 0.003733 7.652841 cpu=7021,mem=74554606,energy=+
clip 1 0.028571 3057933838 0.044674 0.044674 0.639549 cpu=33214,mem=2744443460,ener+
cml 1 0.028571 66866114 0.000975 0.000975 29.299389 cpu=1796,mem=29426756,energy=+
gamma 1 0.028571 2609474948 0.038129 0.038129 0.749334 cpu=34089,mem=360373862,energ+
mbrc 1 0.028571 73411964 0.001073 0.001073 26.635560 cpu=1195,mem=4896358,energy=0+
mc2 1 0.028571 2682557 0.000039 0.000039 728.919551 cpu=0,mem=0,energy=0,node=0,b+
nexus 1 0.028571 5472794067 0.079964 0.079964 0.357302 cpu=278464,mem=3250599000,ene+
nexus username 1 0.000835 69666 0.000001 0.000021 0.457407 37.435501 cpu=0,mem=0,energy=0,node=0,b+
oasis 1 0.028571 330030 0.000005 0.000005 5.9248e+03 cpu=0,mem=0,energy=0,node=0,b+
quics 1 0.028571 4 0.000000 0.000000 4.1683e+08 cpu=0,mem=0,energy=0,node=0,b+
scavenger 1 0.028571 40888195964 0.597419 0.597419 0.047825 cpu=3142204,mem=29902903931,e+
scavenger username 1 0.000835 171 0.000000 0.000000 0.033975 9.8885e+04 cpu=0,mem=0,energy=0,node=0,b+
vulcan 1 0.028571 1247236491 0.018224 0.018224 1.567761 cpu=147273,mem=1161243818,ene+
</pre>

The actual resource billing weights for the three main resources (memory per GB, CPU cores, and number of GPUs if applicable) are per-partition and can be viewed in the <code>TRESBillingWeights</code> line in the output of <code>scontrol show partition</code>. The <code>billing</code> value for a job is the sum of all resource weightings for resources the job has requested. This value is then multiplied by the amount of time a job has run in seconds to get the amount it contributes to the RawUsage for the association within the account it is running under.

====Algorithm====
The algorithm we use for resource weightings differs depending on if there are any GPUs in a partition or not, and is as follows:

=====GPU partitions=====
Each resource (memory/CPU/GPU) is given a weighting value such that their relative billings to each other within the partition are equal (33.33% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU/GPU values are adjusted accordingly.

Different GPU types may also be weighted differently within the GPU relative billing. A baseline GPU type is first chosen. All GPUs of that type and other types that have lower FP32 performance (in [https://en.wikipedia.org/wiki/FLOPS TFLOPS]) are given a weighting factor of 1.0. GPU types with higher FP32 performance than the baseline GPU are given a weighting factor calculated by dividing their FP32 performance by the baseline GPU's FP32 performance. The weighting values for each GPU type are then determined by normalizing the sum of all of GPU cards' billing values multiplied by their weighting factors against the relative billing percentage for GPUs (33.33%).

The current baseline GPU is the [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ NVIDIA RTX A4000].

=====CPU-only partitions=====
Each resource (memory/CPU) is first given a weighting value such that their relative billings to each other within the partition are equal (50% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU value is adjusted accordingly. The final CPU weight value is then divided by 10, which translates to roughly 90.9% of the billing weight being for memory and 9.1% being for CPU. The division of the CPU value is done so as to not affect accounts' fair-share priority factors as much when running jobs in CPU-only partitions given the popularity of GPGPU computing.

===Age===
The longer a job is eligible to run but cannot due to resources being unavailable or it having a lower priority value than one or more other jobs, the higher the job's priority becomes as it continues to wait in the queue. This is the only priority factor that can change a job's priority value after submission, and the priority modifier for this factor reaches its limit after 7 days.

Jobs' age priority factors on our clusters are recalculated every 5 minutes.

===Association===
Some lab/center-specific SLURM accounts have priority values directly attached to them. Jobs run under these accounts gain this many extra points of priority.

===Nice value===
This is a submission argument that you as the user can include when submitting your jobs to deprioritize them. Larger values will deprioritize jobs more, e.g.,
<pre>srun --pty --nice=2 bash</pre>
will have lower priority than
<pre>srun --pty --nice=1 bash</pre>
which will have lower priority than
<pre>srun --pty bash</pre>
assuming all three jobs were submitted at the same time. You cannot use negative values for this argument.

Because this value is absolute, if you want to use it, we would recommend using small numbers - one or two digits - only. Larger numbers may impact your job's ability to run at all as a result of the other factors at play.

Nexus/Tron

2026-07-08T17:22:08Z

Mbaney:

The Tron partition is a subset of resources available in the [[Nexus]]. It was purchased using college-level funding for UMIACS and CSD faculty.

= Compute Nodes =
The partition contains 70 compute nodes with specs as detailed below.

{| class="wikitable sortable"
! Nodenames
! Type
! Quantity
! CPU cores per node
! Memory per node
! GPUs per node
|-
|tron[00-05]
|A6000 GPU Node
|6
|32
|256GB
|8
|-
|tron[06-45]
|A4000 GPU Node
|40
|16
|128GB
|4
|-
|tron[46-61]
|A5000 GPU Node
|16
|48
|256GB
|8
|-
|tron[62-69]
|RTX 2080 Ti GPU Node
|8
|32
|384GB
|8
|- class="sortbottom"
|tron[00-69]
!Total
|70
|1856
|13410GB
|400
|}

= Network =
The network infrastructure supporting the Tron partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* tron[00-05]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* tron[06-45]: Two 50GbE links per node, one to each switch in the pair (redundancy).
#* tron[46-61]: One 100GbE link per node. Half of the overall links for this set of nodes go to one switch in the pair, and the other half go to the other switch in the pair. These nodes do not have redundant links because the switches are currently at port capacity.
# One switch connected to the above pair of network switches via two 100GbE links, one to each switch in the pair for redundancy, serving the following compute nodes:
#* tron[62-69]: Two 10GbE links to the switch per node (increased bandwidth).

The fileserver hosting all Nexus [[Nexus#Scratch_Directories | scratch]], [[Nexus#Faculty_Allocations | faculty]], [[Nexus#Project_Allocations | project]], and [[Nexus#Datasets | dataset]] allocations first connects to a pair of intermediary switches before reaching the compute nodes. The last hop from the pair of intermediary switches to the first pair of switches mentioned on this page (same that nodes tron[00-61] are on) is via four 100GbE links, one for each combination of switches across each pairing, for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

Nexus/CML

2026-07-08T17:16:55Z

Mbaney: /* Partitions */

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition, i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs.

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16],cmlcpu[01-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed. '''Please note that this partition is being permanently removed on 07/16/2026 at 9am.'''

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-ramani''' - This partition is for exclusive priority access to Dr. Ramani Duraiswami's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-cpu''', '''cml-default''', and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-cpu,cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus/CML

2026-07-08T17:16:02Z

Mbaney: /* Usage */

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition, i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs.

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16],cmlcpu[01-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed. '''Please note that this partition is being permanently removed on 07/16/2026 at 9am.'''

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-cpu''', '''cml-default''', and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-cpu,cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus/CML

2026-07-08T17:15:40Z

Mbaney: /* Usage */

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-*</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16],cmlcpu[01-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed. '''Please note that this partition is being permanently removed on 07/16/2026 at 9am.'''

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-cpu''', '''cml-default''', and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-cpu,cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus/CML

2026-07-08T17:14:34Z

Mbaney:

The compute nodes from [[CML]]'s previous standalone cluster have folded into [[Nexus]] as of the scheduled [[MonthlyMaintenanceWindow | maintenance window]] for August 2023 (Thursday 08/17/2023, 5-8pm).

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
You can [[SSH]] to <code>nexuscml.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscml00.umiacs.umd.edu</code>
* <code>nexuscml01.umiacs.umd.edu</code>

All partitions, QoSes, and account names from the standalone CML cluster have been moved over to Nexus. However, please note that <code>cml-</code> is prepended to all of the values that were present in the standalone CML cluster to distinguish them from existing values in Nexus. The lone exception is the base account that was named <code>cml</code> in the standalone cluster (it is also named just <code>cml</code> in Nexus).

Here are some before/after examples of job submission with various parameters:

{| class="wikitable"
! Standalone CML cluster submission command
! Nexus cluster submission command
|-
|<code>srun --partition=dpart --qos=medium --account=tomg --gres=gpu:rtx2080ti:2 --pty bash</code>
|<code>srun --partition=cml-dpart --qos=cml-medium --account=cml-tomg --gres=gpu:rtx2080ti:2 --pty bash</code>
|-
|<code>srun --partition=cpu --qos=cpu --pty bash</code>
|<code>srun --partition=cml-cpu --qos=cml-cpu --account=cml --pty bash</code>
|-
|<code>srun --partition=scavenger --qos=scavenger --account=scavenger --gres=gpu:4 --pty bash</code>
|<code>srun --partition=cml-scavenger --qos=cml-scavenger --account=cml-scavenger --gres=gpu:4 --pty bash</code>
|}

CML users (exclusively) can schedule non-interruptible jobs on CML nodes with any non-scavenger job parameters. Please note that the <code>cml-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on all cml## nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''cml'''.

Please note that the CML compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. CML users still have scavenging priority over these nodes via the <code>cml-scavenger</code> partition (i.e., all <code>cml-</code> partition jobs (other than <code>cml-scavenger</code>) can preempt both <code>cml-scavenger</code> and <code>scavenger</code> partition jobs, and <code>cml-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs).

==Network==
The network infrastructure supporting the CML partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cml[17-28,30-32,34,37]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* cml[35-36]: Four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).
# One pair of network switches connected to the above pair of network switches via two 100GbE links, one between the first two switches in each pair and one between the second two switches in each pair for redundancy, and to each other via dual 25GbE links for redundancy.
#* cml[00,02-09]: Two 25GbE links per node, one to each switch in the pair (redundancy).
#* cml[10-16],cmlcpu[01-04,06-07]: Two 10GbE links per node, one to each switch in the pair (redundancy).

The fileserver hosting all CML [[Nexus/CML#Project_Directories | project]], [[Nexus/CML#Scratch_Directories | scratch]], [[Nexus/CML#Datasets | dataset]], and [[Nexus/CML#Models | model]] allocations also connects to the same pair of switches supporting cml[17-28,30-32] via fourteen 25GbE links, seven to each switch in the pair for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

==Partitions==
There are three partitions available to general CML [[SLURM]] users. You must specify a partition when submitting your job.

* '''cml-dpart''' - This is the default partition. Job allocations are guaranteed.
* '''cml-scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other <code>cml-</code> partitions are ready to be scheduled.
* '''cml-cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed. '''Please note that this partition is being permanently removed on 07/16/2026 at 9am.'''

There are a few additional partitions available solely to specific faculty members and their sponsored user accounts.

* '''cml-furongh''' - This partition is for exclusive priority access to Dr. Furong Huang's purchased nodes. Job allocations are guaranteed.
* '''cml-sfeizi''' - This partition is for exclusive priority access to Dr. Soheil Feizi's purchased nodes. Job allocations are guaranteed.

There is also one additional partition available to user accounts named by CML's director.

* '''cml-director''' - This partition is for exclusive priority access to designated CML-purchased nodes. Job allocations are guaranteed.

==Accounts==
The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.

If you do not specify an account when submitting your job, you will receive the '''cml''' account, which only has access to the '''cml-cpu''', '''cml-default''', and '''cml-medium''' QoSes (see below section).

If you need access to a different QoS, or if the '''cml''' account is at its billing limit (see below in this section), please use your faculty sponsor's account if they have one available. However, keep in mind that if you use your faculty sponsor has their own named partition (see previous section), using the faculty-specific account in the '''cml-dpart''' partition may block access to resources in the faculty-specific partition, since the billing limit for the account is charged regardless of what partition is being used.

The current faculty accounts are:
* cml-abhinav
* cml-furongh
* cml-hajiagha
* cml-ramani
* cml-sfeizi
* cml-tokekar
* cml-tomg

<pre>
$ sacctmgr show account format=account%20,description%30,organization%10
Account Descr Org
-------------------- ------------------------------ ----------
... ... ...
cml cml cml
cml-abhinav cml - abhinav shrivastava cml
cml-furongh cml - furong huang cml
cml-hajiagha cml - mohammad hajiaghayi cml
cml-ramani cml - ramani duraiswami cml
cml-scavenger cml - scavenger cml
cml-sfeizi cml - soheil feizi cml
cml-tokekar cml - pratap tokekar cml
cml-tomg cml - tom goldstein cml
... ... ...
</pre>

Faculty can manage the list of users that have access to their Slurm account via our [https://intranet.umiacs.umd.edu/directory/secgroup Directory application] in the Security Groups section. The security group that controls access has the prefix <code>cml_</code> prepended to their UMD directory ID. It will also list <code>slurm://nexusctl.umiacs.umd.edu</code> as the associated URI.

You can check your account associations by running the '''show_assoc''' command. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association(s).

<pre>
$ show_assoc
User Account MaxJobs GrpTRES QOS
---------- ---------------- ------- ------------- --------------------------------------------------
... ... ...
tomg cml cml-cpu,cml-default,cml-medium
tomg cml-scavenger cml-scavenger
tomg cml-tomg cml-default,cml-high,cml-medium
... ... ...
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Fair-share | resource weightings]] for all nodes appropriated to that account.

<pre>
$ sacctmgr show assoc account=cml format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
cml billing=6481
... ...
</pre>

==QoS==
CML currently has 5 QoS for the '''cml-dpart''' partition (though <code>cml-high_long</code> and <code>cml-very_high</code> may not be available to all faculty accounts), 1 QoS for the '''cml-scavenger''' partition, and 1 QoS for the '''cml-cpu''' partition. If you do not specify a QoS when submitting your job using the <code>--qos</code> parameter, you will receive the <code>cml-default</code> QoS assuming you are using a CML account.

If your faculty member's Slurm account does not have one or both of the <code>cml-high_long</code> or <code>cml-very_high</code> QoS available to it, we can add it to their account provided they approve. Please [[HelpDesk | contact staff]] if this is desired.

The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the cml-scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
$ show_qos --all | grep cml
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
...
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
...
</pre>

<pre>
$ show_partition_qos --all | grep cml
Name MaxSubmitPU MaxTRESPU GrpTRES
-------------------- ----------- ------------------------------ --------------------
...
cml 500 cpu=1128,mem=11T
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
...
</pre>

==Storage==
There are 3 types of user storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

There are also 2 types of read-only storage available for common use among users in the CML:
* Dataset directories
* Model directories

CML users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

===Home Directories===
{{Nfshomes}}

===Project Directories===
You can request project based allocations for up to 6TB for up to 120 days with one or more approvals:
* Allocations up to and including 3TB require approval from a CML faculty member
* Allocations above 3TB (up to 6TB) require approval from both a CML faculty member and the [https://ml.umd.edu/#team director of CML]

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (30 days, 90 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation.

This data is backed up nightly.

====Renewal or Retirement====
Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 120 days (requires re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML).
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then retire the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but a faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and a faculty approver responding, staff will retire the allocation.

===Scratch Directories===
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

====Network Scratch Directory====
You have 200GB of scratch storage available at <code>/cmlscratch/<username></code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 800GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 800GB, you will need faculty approval and/or a project directory. Space increases beyond 800GB also have a maximum request period of 120 days (as with project directories), after which they will need to be renewed with re-approval from a CML faculty member and, if the allocation is over 3TB, the director of CML.
* As with project directories, allocations over 3TB total space require approval from the [https://ml.umd.edu/#team director of CML] in addition to your faculty member.

This file system is available on all submission and computational nodes within the cluster.

====Local Scratch Directories====
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.

===Datasets===
We have read-only dataset storage available at <code>/fs/cml-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of CML datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=CML here].

===Models===
We have read-only model storage available at <code>/fs/cml-models</code>. If there are models that you would like to see downloaded and made available, please see [[Datasets | this page]].

Nexus/CBCB

2026-07-01T15:15:25Z

Mbaney:

The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Compute Nodes =
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are:
* 22 nodes that were purchased in October 2022 with center-wide funding. They are cbcb[00-21].
* 4 nodes from the previous standalone CBCB cluster that moved in as of Summer 2023. They are cbcb[22-25].
* 4 additional nodes purchased by Dr. Heng Huang. They are cbcb[26-29].
* 1 additional node purchased by Dr. Mihai Pop. It is cbcb30.

{| class="wikitable sortable"
! Nodenames
! Quantity
! CPU cores per node (CPUs)
! Memory per node (type)
! Filesystem storage per node (type/location)
! GPUs per node (type)
|-
|cbcb[00-21]
|22
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313])
|~2TB (DDR4 3200MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|-
|cbcb22
|1
|28 (Dual [https://ark.intel.com/content/www/us/en/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html Intel Xeon E5-2680 v4])
|~768GB (DDR4 2400MHz)
|~650GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|0
|-
|cbcb[23-24]
|2
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~800GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|0
|-
|cbcb25
|1
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti])
|-
|cbcb26
|1
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763])
|~512GB (DDR4 3200MHz)
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000])
|-
|cbcb27
|1
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513])
|~256GB (DDR4 3200MHz)
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000])
|-
|cbcb[28-29]
|2
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124])
|~768GB (DDR5 4800MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation])
|-
|cbcb30
|1
|48 (Single [https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9475f.html AMD EPYC 9475F])
|~1.15TB (DDR5 6400MHz)
|~350GB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~10.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|- class="sortbottom"
!Total
|31
|1108 (various)
|~50TB (various)
|~105TB (various)
|33 (various)
|}

Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>):
<pre>
[root@nexusctl00 ~]# show_nodes | grep cbcb
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle
cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb30 48 1157583 rhel8,x86_64,EPYC,EPYC-9475F (null) idle
</pre>

= Network =
The network infrastructure supporting the CBCB partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cbcb[00-21,26-30]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via four 40GbE links, one between every combination of switches across the two pairings for redundancy, and to each other via dual 10GbE links for redundancy.
#* cbcb[22-25]: Two 10GbE links per node, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= Partitions =
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job.

* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed.
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed.

There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.

* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.

= QoS =
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account.

The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are:
* <code>highmem</code>: Allows for significantly increased memory to be allocated.
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is:
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>.

= Jobs =
You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition.

<pre>
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.

CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

= Operating System / Software =
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is running a mixture of RHEL8 and RHEL9, so any software you compiled on the standalone cluster may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8+ may not yet be fully populated with RHEL8+ software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers].

Nexus/CBCB

2026-06-30T16:48:58Z

Mbaney:

The compute nodes from [[CBCB]]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexuscbcb.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Compute Nodes =
All compute nodes in CBCB-owned partitions (see below section) owned by CBCB faculty are named in the format <code>cbcb##</code>. The sets of nodes are:
* 22 nodes that were purchased in October 2022 with center-wide funding. They are cbcb[00-21].
* 4 nodes from the previous standalone CBCB cluster that moved in as of Summer 2023. They are cbcb[22-25].
* 4 additional nodes purchased by Dr. Heng Huang. They are cbcb[26-29].
* 1 additional node purchased by Dr. Mihai Pop. It is cbcb30.

{| class="wikitable sortable"
! Nodenames
! Quantity
! CPU cores per node (CPUs)
! Memory per node (type)
! Filesystem storage per node (type/location)
! GPUs per node (type)
|-
|cbcb[00-21]
|22
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7313.html AMD EPYC 7313])
|~2TB (DDR4 3200MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~2TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|-
|cbcb22
|1
|28 (Dual [https://ark.intel.com/content/www/us/en/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html Intel Xeon E5-2680 v4])
|~768GB (DDR4 2400MHz)
|~650GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|0
|-
|cbcb[23-24]
|2
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~800GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|0
|-
|cbcb25
|1
|24 (Dual [https://www.intel.com/content/www/us/en/products/sku/91767/intel-xeon-processor-e52650-v4-30m-cache-2-20-ghz/specifications.html Intel Xeon E5-2650 v4])
|~256GB (DDR4 2400MHz)
|~1.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]])
|2 (1x [https://www.nvidia.com/en-gb/geforce/graphics-cards/geforce-gtx-1080-ti/specifications/ NVIDIA GeForce GTX 1080 Ti], 1x [https://www.nvidia.com/en-us/geforce/graphics-cards/compare/?section=compare-20 NVIDIA GeForce RTX 2080 Ti])
|-
|cbcb26
|1
|128 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7763.html AMD EPYC 7763])
|~512GB (DDR4 3200MHz)
|~3.4TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~14TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|7 ([https://www.nvidia.com/en-us/design-visualization/rtx-a5000 NVIDIA RTX A5000])
|-
|cbcb27
|1
|64 (Dual [https://www.amd.com/en/products/processors/server/epyc/7003-series/amd-epyc-7513.html AMD EPYC 7513])
|~256GB (DDR4 3200MHz)
|~3.4TB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~3.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-a6000 NVIDIA RTX A6000])
|-
|cbcb[28-29]
|2
|32 (Dual [https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9124.html AMD EPYC 9124])
|~768GB (DDR5 4800MHz)
|~350GB (SATA SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~7TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|8 ([https://www.nvidia.com/en-us/design-visualization/rtx-6000 NVIDIA RTX 6000 Ada Generation])
|-
|cbcb30
|1
|48 (Single [https://www.amd.com/en/products/processors/server/epyc/9005-series/amd-epyc-9475f.html AMD EPYC 9475F])
|~1.15TB (DDR5 6400MHz)
|~350GB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch0]]), ~10.5TB (NVMe SSD [[FilesystemDataStorage#UNIX_Filesystem_Storage | /scratch1]])
|0
|- class="sortbottom"
!Total
|31
|1108 (various)
|~50TB (various)
|~105TB (various)
|33 (various)
|}

Here is the listing of nodes as shown by the Slurm alias <code>show_nodes</code> (again, all nodes are named in the format <code>cbcb##</code>):
<pre>
[root@nexusctl00 ~]# show_nodes | grep cbcb
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE
cbcb00 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb01 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb02 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb03 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb04 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb05 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb06 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb07 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb08 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb09 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb10 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb11 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb12 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb13 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb14 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb15 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb16 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb17 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb18 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb19 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb20 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb21 32 2061175 rhel8,x86_64,Zen,EPYC-7313 (null) idle
cbcb22 28 771245 rhel8,x86_64,Xeon,E5-2680 (null) idle
cbcb23 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb24 24 255150 rhel8,x86_64,Xeon,E5-2650 (null) idle
cbcb25 24 255278 rhel8,x86_64,Xeon,E5-2650,Pascal,Turing gpu:rtx2080ti:1,gpu:gtx1080ti:1 idle
cbcb26 128 513243 rhel8,x86_64,Zen,EPYC-7763,Ampere gpu:rtxa5000:7 idle
cbcb27 64 255167 rhel8,x86_64,Zen,EPYC-7513,Ampere gpu:rtxa6000:8 idle
cbcb28 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb29 32 771166 rhel8,x86_64,Zen,EPYC-9124,Ada gpu:rtx6000ada:8 idle
cbcb30 48 1157583 rhel8,x86_64,EPYC,EPYC-9475F (null) idle
</pre>

= Network =
The network infrastructure supporting the CBCB partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* cbcb[00-21,26-30]: Two 100GbE links per node, one to each switch in the pair (redundancy).
# One pair of network switches connected to the above pair of network switches via four 40GbE links, one between every combination of switches across the two pairings for redundancy, and to each other via dual 10GbE links for redundancy.
#* cbcb[22-25]: Two 10GbE links per node, one to each switch in the pair (redundancy).

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

= Partitions =
There are two partitions available to general CBCB [[SLURM]] users. You must specify one of these two partitions when submitting your job.

* '''cbcb''' - This is the default partition. Job allocations on all nodes except those also in the '''cbcb-heng''' partition are guaranteed.
* '''cbcb-interactive''' - This is a partition that only allows interactive jobs; you cannot submit jobs via <code>sbatch</code> to this partition. Job allocations are guaranteed.

There is one additional partition available solely to Dr. Heng Huang's sponsored accounts.

* '''cbcb-heng''' - This partition is for exclusive priority access to Dr. Huang's purchased GPU nodes. Job allocations are guaranteed.

= QoS =
CBCB users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the '''cbcb''' and '''cbcb-heng''' partitions using the <code>cbcb</code> account.

The additional job QoSes for the '''cbcb''' and '''cbcb-heng''' partitions specifically are:
* <code>highmem</code>: Allows for significantly increased memory to be allocated.
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The ''only'' allowed job QoS for the '''cbcb-interactive''' partition is:
* <code>interactive</code>: Allows for 4 CPU / 128G mem jobs up to 12 hours in length - can only be used via <code>srun</code> or <code>salloc</code>.

= Jobs =
You will need to specify <code>--partition=cbcb</code> and <code>--account=cbcb</code> to be able to submit jobs to the CBCB partition.

<pre>
[username@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation in place. All data filesystems that were available in the standalone CBCB cluster are also available in Nexus. Please note about the change in your home directory in the migration section below.

CBCB users can also request [[Nexus#Project_Allocations | Nexus project allocations]].

= Migration =

== Operating System / Software ==
CBCB's standalone cluster submission and compute nodes were running RHEL7. [[Nexus]] is exclusively running RHEL8, so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules CBCB module tree] for RHEL8 may not yet be fully populated with RHEL8 software. If you do not see the modules you need, please reach out to the [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules#Contact CBCB software maintainers].

Nexus/Tron

2026-06-29T17:19:33Z

Mbaney:

The Tron partition is a subset of resources available in the [[Nexus]]. It was purchased using college-level funding for UMIACS and CSD faculty.

= Compute Nodes =
The partition contains 70 compute nodes with specs as detailed below.

{| class="wikitable sortable"
! Nodenames
! Type
! Quantity
! CPU cores per node
! Memory per node
! GPUs per node
|-
|tron[00-05]
|A6000 GPU Node
|6
|32
|256GB
|8
|-
|tron[06-45]
|A4000 GPU Node
|40
|16
|128GB
|4
|-
|tron[46-61]
|A5000 GPU Node
|16
|48
|256GB
|8
|-
|tron[62-69]
|RTX 2080 Ti GPU Node
|8
|32
|384GB
|8
|- class="sortbottom"
|tron[00-69]
!Total
|70
|1856
|13410GB
|400
|}

= Network =
The network infrastructure supporting the Tron partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* tron[00-05]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* tron[06-45]: Two 50GbE links per node, one to each switch in the pair (redundancy).
#* tron[46-61]: One 100GbE link per node. Half of the overall links for this set of nodes go to one switch in the pair, and the other half go to the other switch in the pair. These nodes do not have redundant links because the switches are currently at port capacity.
# One switch connected to the above pair of network switches via two 100GbE links, one to each switch in the pair for redundancy, serving the following compute nodes:
#* tron[62-69]: Two 10GbE links to the switch per node (increased bandwidth).

The fileserver hosting all Nexus [[Nexus#Scratch_Directories | scratch]], [[Nexus#Faculty_Allocations | faculty]], [[Nexus#Project_Allocations | project]], and [[Nexus#Datasets | dataset]] allocations first connects to a pair of intermediary switches before reaching the compute nodes. The last hop from the pair of intermediary switches to the first pair of switches mentioned on this page (same that nodes tron[00-44,46-61] are on) is via four 100GbE links, one for each combination of switches across each pairing, for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

Nexus/Tron

2026-06-29T17:19:16Z

Mbaney:

The Tron partition is a subset of resources available in the [[Nexus]]. It was purchased using college-level funding for UMIACS and CSD faculty.

= Compute Nodes =
The partition contains 70 compute nodes with specs as detailed below.

{| class="wikitable sortable"
! Nodenames
! Type
! Quantity
! CPU cores per node
! Memory per node
! GPUs per node
|-
|tron[00-05]
|A6000 GPU Node
|6
|32
|256GB
|8
|-
|tron[06-45]
|A4000 GPU Node
|40
|16
|128GB
|4
|-
|tron[46-61]
|A5000 GPU Node
|16
|48
|256GB
|8
|-
|tron[62-69]
|RTX 2080 Ti GPU Node
|8
|32
|384GB
|8
|- class="sortbottom"
|tron[00-69]
!Total
|70
|1856
|13410GB
|400
|}

= Network =
The network infrastructure supporting the Tron partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
#* tron[00-05]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* tron[06-44]: Two 50GbE links per node, one to each switch in the pair (redundancy).
#* tron[46-61]: One 100GbE link per node. Half of the overall links for this set of nodes go to one switch in the pair, and the other half go to the other switch in the pair. These nodes do not have redundant links because the switches are currently at port capacity.
# One switch connected to the above pair of network switches via two 100GbE links, one to each switch in the pair for redundancy, serving the following compute nodes:
#* tron[62-69]: Two 10GbE links to the switch per node (increased bandwidth).

The fileserver hosting all Nexus [[Nexus#Scratch_Directories | scratch]], [[Nexus#Faculty_Allocations | faculty]], [[Nexus#Project_Allocations | project]], and [[Nexus#Datasets | dataset]] allocations first connects to a pair of intermediary switches before reaching the compute nodes. The last hop from the pair of intermediary switches to the first pair of switches mentioned on this page (same that nodes tron[00-44,46-61] are on) is via four 100GbE links, one for each combination of switches across each pairing, for redundancy and increased bandwidth.

For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]].

Nexus/MBRC

2026-06-26T15:52:51Z

Mbaney: /* Project Directories */

The compute nodes from [https://mbrc.umd.edu MBRC]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexusmbrc.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusmbrc00.umiacs.umd.edu</code>
* <code>nexusmbrc01.umiacs.umd.edu</code>

= Compute Nodes =
The MBRC partition only has older nodes purchased by other labs/centers. The compute nodes are named <code>legacy##</code>.

= QoS =
MBRC users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>mbrc</code> partition using the <code>mbrc</code> account.

The additional job QoSes for the MBRC partition specifically are:
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

= Jobs =
You will need to specify <code>--partition=mbrc</code> and <code>--account=mbrc</code> to be able to submit jobs to the MBRC partition.

<pre>
[username@nexusmbrc00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=mbrc --account=mbrc --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@legacy00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=mbrc QOS=default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=mbrc AllocNode:Sid=nexusmbrc00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=legacy00
BatchHost=legacy00
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=8G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
In addition to [[Nexus#Storage | storage types available to all Nexus users]], MBRC users can also request MBRC project directories.

== Project Directories ==
For this cluster we have decided to allocate network storage on a project by project basis. Jonathan Heagerty will be the point of contact as it pertains to allocating the requested/required storage for each project. As a whole, the MBRC Cluster has limited network storage and for this there will be limits to how much and how long network storage can be appropriated.

If the requested storage size is significantly large relative to the total allotted amount, the request will be relayed from Jonathan Heagerty to the MBRC Cluster faculty for approval. Two other situations that would need approval from the MBRC Cluster faculty would be: To request an increase to a projects current storage allotment or To request a time extension for a projects storage.

When making a request for storage please provide the following information when [[HelpDesk | contacting staff]]:
- Name of user requesting storage:
Example: jheager2
- Name of project:
Example: Foveated Rendering
- Collaborators working on the project:
Example: sidali
- Storage size:
Example: 1TB
- Length of time for storage:
Example: 6-8 months

Nexus/MBRC

2026-06-26T15:52:40Z

Mbaney:

The compute nodes from [https://mbrc.umd.edu MBRC]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexusmbrc.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusmbrc00.umiacs.umd.edu</code>
* <code>nexusmbrc01.umiacs.umd.edu</code>

= Compute Nodes =
The MBRC partition only has older nodes purchased by other labs/centers. The compute nodes are named <code>legacy##</code>.

= QoS =
MBRC users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>mbrc</code> partition using the <code>mbrc</code> account.

The additional job QoSes for the MBRC partition specifically are:
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

= Jobs =
You will need to specify <code>--partition=mbrc</code> and <code>--account=mbrc</code> to be able to submit jobs to the MBRC partition.

<pre>
[username@nexusmbrc00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=mbrc --account=mbrc --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@legacy00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=mbrc QOS=default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=mbrc AllocNode:Sid=nexusmbrc00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=legacy00
BatchHost=legacy00
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=8G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
In addition to [[Nexus#Storage | storage types available to all Nexus users]], MBRC users can also request MBRC project directories.

== Project Directories ==
For this cluster we have decided to allocate network storage on a project by project basis. Jonathan Heagerty will be the point of contact as it pertains to allocating the requested/required storage for each project. As a whole, the MBRC Cluster has limited network storage and for this there will be limits to how much and how long network storage can be appropriated.

If the requested storage size is significantly large relative to the total allotted amount, the request will be relayed from Jonathan Heagerty to the MBRC Cluster faculty for approval. Two other situations that would need approval from the MBRC Cluster faculty would be: To request an increase to a projects current storage allotment or To request a time extension for a projects storage.

When making a request for storage please provide the following information when [[HelpDesk | contacting staff]]:
- Name of user requesting storage:
Example: jheager2
- Name of project:
Example: Foveated Rendering
- Collaborators working on the project:
Example: Sida Li
- Storage size:
Example: 1TB
- Length of time for storage:
Example: 6-8 months

Nexus/MBRC

2026-06-26T14:14:03Z

Mbaney:

The compute nodes from [https://mbrc.umd.edu MBRC]'s previous standalone cluster have folded into [[Nexus]] as of mid 2023.

The Nexus cluster already has a large pool of compute resources made possible through college-level funding for UMIACS and CSD faculty. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

Please [[HelpDesk | contact staff]] with any questions or concerns.

= Submission Nodes =
You can [[SSH]] to <code>nexusmbrc.umiacs.umd.edu</code> to log in to a submission node.

If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission nodes, you will need to connect to that same submission node to access it later. The actual submission nodes are:
* <code>nexusmbrc00.umiacs.umd.edu</code>
* <code>nexusmbrc01.umiacs.umd.edu</code>

= Compute Nodes =
The MBRC partition only has older nodes purchased by other labs/centers. The compute nodes are named <code>legacy##</code>.

= QoS =
MBRC users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>mbrc</code> partition using the <code>mbrc</code> account.

The additional job QoSes for the MBRC partition specifically are:
* <code>huge-long</code>: Allows for longer jobs using higher overall resources.

Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

= Jobs =
You will need to specify <code>--partition=mbrc</code> and <code>--account=mbrc</code> to be able to submit jobs to the MBRC partition.

<pre>
[username@nexusmbrc00:~ ] $ srun --pty --ntasks=4 --mem=8G --qos=default --partition=mbrc --account=mbrc --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[username@mbrc00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=username(1000) GroupId=username(21000) MCS_label=N/A
Priority=897 Nice=0 Account=mbrc QOS=default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=mbrc AllocNode:Sid=nexusmbrc00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=mbrc00
BatchHost=mbrc00
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=8G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/username
Power=
</pre>

= Storage =
In addition to [[Nexus#Storage | storage types available to all Nexus users]], MBRC users can also request MBRC project directories.

== Project Directories ==
For this cluster we have decided to allocate network storage on a project by project basis. Jonathan Heagerty will be the point of contact as it pertains to allocating the requested/required storage for each project. As a whole, the MBRC Cluster has limited network storage and for this there will be limits to how much and how long network storage can be appropriated.

If the requested storage size is significantly large relative to the total allotted amount, the request will be relayed from Jonathan Heagerty to the MBRC Cluster faculty for approval. Two other situations that would need approval from the MBRC Cluster faculty would be: To request an increase to a projects current storage allotment or To request a time extension for a projects storage.

When making a request for storage please provide the following information when [[HelpDesk | contacting staff]]:
- Name of user requesting storage:
Example: jheager2
- Name of project:
Example: Foveated Rendering
- Collaborators working on the project:
Example: Sida Li
- Storage size:
Example: 1TB
- Length of time for storage:
Example: 6-8 months

MonthlyMaintenanceWindow

2026-06-19T02:32:54Z

Mbaney:

[[HelpDesk | UMIACS staff]] takes a monthly maintenance window to patch and reboot all UMIACS-supported hosts and services. This provides a way for staff to ensure security updates are installed and applied on the numerous different platforms and appliances that UMIACS runs.

The window for each month is calculated by adding 9 days to [https://en.wikipedia.org/wiki/Patch_Tuesday Microsoft's Patch Tuesday] to allow for enough time to marshal patches released that month from Microsoft, Red Hat, Apple, and other OS and application vendors and have enough time to get systems prepared to reboot. This translates to the window being on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month. The window lasts from '''5pm-8pm'''.

[[Nexus]] will always have a reservation in place from 4:45pm-8pm on the day of the upcoming window to prevent jobs from being scheduled on compute nodes. The 15-minute addition before the start of the window is to allow jobs to fully end. Any job submitted before the reservation begins that has a time limit that would run into the reservation will be held until at least the end of the reservation - 8pm on the day of the window. This is to prevent issues with jobs failing to end properly causing delays in work we have scheduled during the window.

A list of upcoming maintenance windows is as follows, with the next one in bold. Again, the window is on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month, and lasts from '''5pm-8pm'''.

* '''July 23rd 2026'''
* August 20th 2026
* September 17th 2026
* October 22nd 2026
* November 19th 2026
* December 17th 2026

==Archives==
* January 17th 2013 - BEGIN time of 8pm-12am for this window through February 20th 2020
* February 21st 2013
* March 21st 2013
* April 18th 2013
* May 23rd 2013
* June 20th 2013
* July 18th 2013
* August 22nd 2013
* September 19th 2013
* October 17th 2013
* December 19th 2013
* January 23rd 2014
* February 20th 2014
* March 20th 2014
* April 17th 2014
* May 22nd 2014
* June 19th 2014
* July 17th 2014
* August 21st 2014
* September 18th 2014
* October 23rd 2014
* November 20th 2014
* December 18th 2014
* January 22nd 2015
* February 19th 2015
* March 19th 2015
* May 21st 2015
* June 18th 2015
* July 23rd 2015
* August 20th 2015
* September 17th 2015
* October 22nd 2015
* November 19th 2015
* December 17th 2015
* January 21st 2016
* February 18th 2016
* March 12th 2016 (Adjusted date for AVW power outage)
* April 21st 2016
* May 19th 2016
* June 23rd 2016
* July 21st 2016
* August 18th 2016
* September 22nd 2016
* October 20th 2016
* November 17th 2016
* December 22nd 2016
* January 19th 2017
* February 23rd 2017
* March 23rd 2017
* April 20th 2017
* May 18th 2017
* June 22nd 2017
* July 20th 2017
* August 17th 2017
* September 21st 2017
* October 19th 2017
* December 21st 2017
* January 18th 2018
* February 22nd 2018
* March 22nd 2018
* April 19th 2018
* May 17th 2018
* June 21st 2018
* July 19th 2018
* August 23rd 2018
* September 20th 2018
* October 18th 2018
* December 20th 2018
* January 24th 2019
* February 21st 2019
* April 18th 2019
* May 23rd 2019
* June 20th 2019
* July 18th 2019
* August 22nd 2019
* September 19th 2019
* October 17th 2019
* November 21st 2019
* December 19th 2019
* January 23rd 2020
* February 20th 2020
* April 23rd 2020 - BEGIN time of 5pm-7pm for this window through August 19th 2021
* June 18th 2020
* July 23rd 2020
* August 20th 2020
* September 17th 2020
* October 22nd 2020
* November 19th 2020
* December 17th 2020
* January 21st 2021
* February 18th 2021
* March 25th 2021 (Adjusted date for extended Spring Break)
* April 22nd 2021
* May 20th 2021
* June 17th 2021
* July 22nd 2021
* August 19th 2021
* September 23rd 2021 - BEGIN time of 5pm-8pm for this window and all others below
* October 21st 2021
* November 18th 2021
* January 20th 2022
* February 17th 2022
* March 24th 2022 (Adjusted date for Spring Break)
* April 21st 2022
* May 19th 2022
* June 23rd 2022
* July 21st 2022
* August 18th 2022
* September 22nd 2022
* October 20th 2022
* November 17th 2022
* January 19th 2023
* February 23rd 2023
* April 20th 2023
* May 18th 2023
* June 22nd 2023
* July 20th 2023
* August 17th 2023
* September 21st 2023
* October 19th 2023
* December 20th 2023 (Adjusted date for early Winter Break)
* January 18th 2024
* February 22nd 2024
* March 21st 2024
* April 18th 2024
* May 23th 2024
* June 20th 2024
* July 18th 2024
* August 22nd 2024
* September 19th 2024
* October 17th 2024
* November 21st 2024
* December 19th 2024
* January 23rd 2025
* February 20th 2025
* March 20th 2025
* April 17th 2025
* May 22nd 2025
* June 19th 2025
* July 17th 2025
* August 21st 2025
* September 18th 2025
* October 23rd 2025
* November 20th 2025
* December 18th 2025
* January 22nd 2026
* February 19th 2026
* March 19th 2026
* April 23rd 2026
* May 28th 2026 (Adjusted date for CIO-imposed network change freeze)
* June 18th 2026

Windows Patch Management

2026-06-12T15:35:18Z

Mbaney:

UMIACS uses Windows' built-in Windows Update mechanism to patch the Windows operating system, Windows device firmware and drivers, and other Microsoft products that Microsoft uses Windows Update to push updates for, in tandem with a software distribution tool called [https://umd-dit.atlassian.net/wiki/spaces/DMS/pages/45285463/Patch+My+PC Patch My PC] that operates through the Division of IT's managed [https://umd-dit.atlassian.net/wiki/spaces/DMS/pages/45285467/Intune Intune] service to patch third party applications.

==Windows Update==
* '''Desktops''' will run updates available through Windows Update daily between 3am and 4am US Eastern.
* '''Laptops''' will run updates available through Windows Update at any time they are on and connected to the internet.

The only updates available through Windows Update that should require computer restarts are Windows operating system monthly rollups, released on [https://en.wikipedia.org/wiki/Patch_Tuesday Microsoft's Patch Tuesday], new [[WindowsServicing | feature updates]], when they are pushed by us annually, and some device firmware or driver updates.

After a update that requires a reboot is installed on your computer, you will receive a notification stating that your machine needs to be restarted in the next 7 days. You can choose either to restart immediately or to schedule the restart. If you do not restart by the deadline, your computer will automatically restart no more than a day after the deadline is exceeded.

==Patch My PC==
Patches are deployed as they come out for the Patch My PC catalog. They should install automatically and require little to no input.

Nexus/ClusterOSUpgrade

2026-06-10T16:05:53Z

Mbaney:

==Overview==
UMIACS Technical Staff has begun the process of upgrading the operating system version on all [[Nexus]] cluster nodes from [[RHEL | Red Hat Enterprise Linux (RHEL)]] 8 to 9 as of 9am on 06/01/2026.

RHEL8 is in the Maintenance Support phase of its life cycle and is transitioning to the Extended Life phase in 2029. More information on Red Hat's lifecycle policy for its operating systems can be found [https://access.redhat.com/support/policy/updates/errata here]. We are staying well ahead of the Extended Life phase date for our cluster nodes by performing these upgrades now.

RHEL9 is still in the Full Support phase of its life cycle and introduces a newer major Linux kernel version and newer [https://www.gnu.org/software/libc glibc] version, improving compatibility with many newer software applications.

==Scheduling==
'''Upgrades for all cluster nodes have begun.''' We expect to be finished with all cluster node upgrades no later than Friday 08/21/2026 at 5pm.

===[[SLURM/JobSubmission | Submission Nodes]]===
'''Submission nodes with the number '01' in their hostnames have been upgraded as of 06/01/2026.'''

Submission nodes with the number '00' in their hostnames will be scheduled for upgrade individually, when all of the compute nodes associated with the same lab/center have been upgraded. Staff will send a notification to individual lab's/center's cluster users to schedule the relevant '00' node's upgrade when applicable. The actual date of each upgrade will be no less than one week after the corresponding notification has been sent.

Data in [[FilesystemDataStorage#UNIX_Filesystem_Storage | UNIX filesystem storage]] spaces on each submission node, i.e., /tmp and /scratch0, will not be preserved during upgrade. If you have any data in any such space on the '00' submission node in a pairing that you want to keep, please ensure you copy it to the '01' submission node or a [[FilesystemDataStorage#Network-Attached_Filesystem_Storage | network-attached filesystem storage]] space prior to the '00' node's upgrade date. Data in network-attached filesystem storage spaces, such as /nfshomes or /fs/nexus-scratch, will not be affected.

===Compute Nodes===
Due to the large number of compute nodes and the desire to not interrupt running jobs, we are not generally able to schedule each specific compute node upgrade on a specific date. If you find that a specific node is unavailable to schedule jobs on, you can run the command <code>sinfo --list-reasons --long</code> on a submission node and look to see if the node is in the list with the text "RHEL9 upgrade" - if this is present, the upgrade for that node is underway.

We will generally be prioritizing upgrades for nodes based on how available they are across various partitions; nodes that are only available in partitions that contain large numbers of users for a lab/center, e.g., cbcb, clip, cml-dpart, gamma, vulcan-ampere, vulcan-dpart, etc., and corresponding "scavenger" named partitions, will be prioritized over nodes that are only or are also available in faculty-specific / limited-node partitions. All nodes in the tron partition will also generally be prioritized.

If you are a faculty member authoritative for your own partition or a small group's limited-node partition and have scheduling concerns for the nodes in these partitions, please [[HelpDesk | contact staff]] ASAP to let us know about these concerns and we will make our best effort to accommodate them.

==Interoperability==
===Software and Modules===
Please begin transitioning your [[PythonVirtualEnv | virtual environments]], workflows, etc. to work with RHEL9 as soon as possible. You can use the '01' submission node that you have access to for transitioning and light testing - as always, [[Nexus/Submission_Node_Policy | please do not run any computationally intensive processes on this node]]. It is intended to be a host for configuring environments/workflows and submitting jobs only.

The [[Modules | module tree]] for RHEL9 has already been populated with a large number of the same modules that are available in the RHEL8 module tree, although specific modules may have different versions available in the RHEL9 tree as compared to the RHEL8 tree. If you have a dependency on a specific version of a module that is not available in the RHEL9 tree, please [[HelpDesk | contact staff]] and we can get one created.

===SLURM Scheduling===
If you want or need to schedule a job on only nodes running RHEL8 (or RHEL9, once you have validated whatever is relevant), you can use the submission arguments <code>--prefer=rhel#</code> or <code>--constraint=rhel#</code> in your job arguments to specify this, where # is replaced by the OS version number. The --prefer argument is a soft limitation on which nodes the job can be scheduled on and the --constraint argument is a hard limitation, i.e., if you use the argument <code>--prefer=rhel8</code> but there are no RHEL8 nodes available at present (with your other submission arguments also satisfied) in the partition you are submitting to, the job will be scheduled on an appropriate RHEL9 node if that would result in an earlier (or instantaneous) start time.

Nexus/ClusterOSUpgrade

2026-06-10T15:54:38Z

Mbaney:

==Overview==
UMIACS Technical Staff has begun the process of upgrading the operating system version on all [[Nexus]] cluster nodes from [[RHEL | Red Hat Enterprise Linux (RHEL)]] 8 to 9 as of 9am on 06/01/2026.

RHEL8 is in the Maintenance Support phase of its life cycle and is transitioning to the Extended Life phase in 2029. More information on Red Hat's lifecycle policy for its operating systems can be found [https://access.redhat.com/support/policy/updates/errata here]. We are staying well ahead of the Extended Life phase date for our cluster nodes by performing these upgrades now.

RHEL9 is still in the Full Support phase of its life cycle and introduces a newer major Linux kernel version and newer [https://www.gnu.org/software/libc glibc] version, improving compatibility with many newer software applications.

==Scheduling==
'''Upgrades for all cluster nodes have begun.''' We expect to be finished with all cluster node upgrades no later than Friday 08/21/2026 at 5pm.

===[[SLURM/JobSubmission | Submission Nodes]]===
'''Submission nodes with the number '01' in their hostnames have been upgraded as of 06/01/2026.'''

Submission nodes with the number '00' in their hostnames will be scheduled for upgrade individually, when all of the compute nodes associated with the same lab/center have been upgraded. Staff will send a notification to individual lab's/center's cluster users to schedule the relevant '00' node's upgrade when applicable. The actual date of each upgrade will be no less than one week after the corresponding notification has been sent.

Data in [[FilesystemDataStorage#UNIX_Filesystem_Storage | UNIX filesystem storage]] spaces on each submission node, i.e., /tmp and /scratch0, will not be preserved during upgrade. If you have any data in any such space on either submission node in a pairing that you want to keep, please ensure you copy it to the other submission node or a [[FilesystemDataStorage#Network-Attached_Filesystem_Storage | network-attached filesystem storage]] space prior to each node's upgrade date. Data in network-attached filesystem storage spaces, such as /nfshomes or /fs/nexus-scratch, will not be affected.

===Compute Nodes===
Due to the large number of compute nodes and the desire to not interrupt running jobs, we are not generally able to schedule each specific compute node upgrade on a specific date. If you find that a specific node is unavailable to schedule jobs on, you can run the command <code>sinfo --list-reasons --long</code> on a submission node and look to see if the node is in the list with the text "RHEL9 upgrade" - if this is present, the upgrade for that node is underway.

We will generally be prioritizing upgrades for nodes based on how available they are across various partitions; nodes that are only available in partitions that contain large numbers of users for a lab/center, e.g., cbcb, clip, cml-dpart, gamma, vulcan-ampere, vulcan-dpart, etc., and corresponding "scavenger" named partitions, will be prioritized over nodes that are only or are also available in faculty-specific / limited-node partitions. All nodes in the tron partition will also generally be prioritized.

If you are a faculty member authoritative for your own partition or a small group's limited-node partition and have scheduling concerns for the nodes in these partitions, please [[HelpDesk | contact staff]] ASAP to let us know about these concerns and we will make our best effort to accommodate them.

==Interoperability==
===Software and Modules===
Please begin transitioning your [[PythonVirtualEnv | virtual environments]], workflows, etc. to work with RHEL9 as soon as possible. You can use the '01' submission node that you have access to for transitioning and light testing - as always, [[Nexus/Submission_Node_Policy | please do not run any computationally intensive processes on this node]]. It is intended to be a host for configuring environments/workflows and submitting jobs only.

The [[Modules | module tree]] for RHEL9 has already been populated with a large number of the same modules that are available in the RHEL8 module tree, although specific modules may have different versions available in the RHEL9 tree as compared to the RHEL8 tree. If you have a dependency on a specific version of a module that is not available in the RHEL9 tree, please [[HelpDesk | contact staff]] and we can get one created.

===SLURM Scheduling===
If you want or need to schedule a job on only nodes running RHEL8 (or RHEL9, once you have validated whatever is relevant), you can use the submission arguments <code>--prefer=rhel#</code> or <code>--constraint=rhel#</code> in your job arguments to specify this, where # is replaced by the OS version number. The --prefer argument is a soft limitation on which nodes the job can be scheduled on and the --constraint argument is a hard limitation, i.e., if you use the argument <code>--prefer=rhel8</code> but there are no RHEL8 nodes available at present (with your other submission arguments also satisfied) in the partition you are submitting to, the job will be scheduled on an appropriate RHEL9 node if that would result in an earlier (or instantaneous) start time.

Nexus

2026-06-01T17:21:55Z

Mbaney:

{{Note|UMIACS Technical Staff has begun the process of upgrading the operating system version on all Nexus cluster nodes as of 06/01/2026. Please see [[Nexus/ClusterOSUpgrade]] for more information.}}

The Nexus is the combined scheduler of resources in UMIACS. The resource manager for Nexus is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account - not a [[Accounts/Collaborator | collaborator account]] - in order to access Nexus.

== Access ==
Your access to submission nodes (alternatively called login nodes) for Nexus computational resources is determined by your account sponsor's department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available submission nodes - generally a pair of nodes with hostnames of the format <tt>nexus<department, lab, or center abbreviation>[00,01]</tt>, e.g., <tt>nexusgroup00</tt> and <tt>nexusgroup01</tt>.

Once you have identified your submission nodes, you can [[SSH]] into them [https://itsupport.umd.edu/itsupport?id=kb_article_view&sysparm_article=KB0016076 after connecting to UMD's GlobalProtect VPN]. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

Please read our [[Nexus/Submission_Node_Policy|Submission Node Policy]] for guidance on appropriate usage of a submission node. If a submission node becomes unresponsive due to disregarding this policy, we may kill user processes on these nodes to resolve the issue. We reserve the right to take action on users who repeatedly cause issues on submission nodes.

== Jobs ==
[[SLURM]] jobs are [[SLURM/JobSubmission | submitted]] by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the who/where/how, you may be required to specify <code>--account</code>, <code>--partition</code>, and/or <code>--qos</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--cpus-per-task</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four; if you don't specify something, you will get the default value for that resource, which is minimal (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs | here]]. You may also use <code>--ntasks</code> to specify the number of parallel processes to run, with each task having its own set of the resources specified above. You can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

For a list of available GPU types on Nexus and their specs, please see [[Nexus/GPUs]].

For details on how the network for Nexus is architected, please see [[Nexus/Network]]. This can be important if you wish to optimize performance of your jobs.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --cpus-per-task=4 --mem=2gb --gres=gpu:1 bash
srun: Job account was unset; set to user default of 'nexus'
srun: Job partition was unset; set to cluster default of 'tron'
srun: Job QoS was unset; set to association default of 'default'
srun: Job time limit was unset; set to partition default of 60 minutes
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
$ hostname
tron62.umiacs.umd.edu
$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-daad6a04-a2ce-1183-ce53-b267048f750a)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus has a number of different partitions of resources. Different Centers, Labs, and Faculty are able to invest in computational resources that are restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all non-class accounts sponsored by either UMIACS or CSD faculty.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that contains [https://en.wikipedia.org/wiki/X86-64 x86_64] architecture nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again. For an overview of things you can check within scripts to determine if your job was preempted/resumed, see [[SLURM/Preemption]].
* Scavenger (aarch64) - Preemption partition identical in design to <tt>scavenger</tt>, but only contains [https://en.wikipedia.org/wiki/AArch64 aarch64] architecture nodes.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts#Cluster_Usage | Class]] - Pool of resources available to class accounts sponsored by either UMIACS or CSD faculty.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CBCB]] - CBCB lab pool available for CBCB lab members.
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/CML]] - CML lab pool available for CML lab members.
* [[Nexus/GAMMA]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MBRC]] - MBRC lab pool available for MBRC lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.
* [[Nexus/QuICS]] - QuICS lab pool available for QuICS lab members.
* [[Nexus/Vulcan]] - Vulcan lab pool available for Vulcan lab members.

You can view the partitions that you have access to by using the <code>show_partitions</code> command. By default, the command will show only the partitions that are available to you.

<pre>
$ show_partitions
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
scavenger scavenger scavenger UNLIMITED brigid[16-19]
cbcb[00-29]
clip[00-13]
cml[00,02-13,15-28,30-33]
cmlcpu[00-04,06-07]
gammagpu[00-21]
legacy[00-11,13-28,30-36]
legacygpu[00-07]
quics00
tron[00-44,46-69]
vulcan[00-45]
------------------------------------------------------------------------------------------------------------------------
scavenger-aarch64 scavenger scavenger-aarch64 UNLIMITED oasis[00-39]
------------------------------------------------------------------------------------------------------------------------
tron nexus default UNLIMITED tron[00-44,46-69]
high
medium
</pre>

If you want to see information for all of the partitions, including those that you do not have access to, you can use the <code>show_partitions --all</code> command.

<pre>
$ show_partitions --all
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
cbcb cbcb default UNLIMITED cbcb[00-20,22-29]
medium legacy[00-11,13-28,30-36]
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-heng cbcb-heng default UNLIMITED cbcb[26-29]
medium
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-interactive cbcb interactive UNLIMITED cbcb21
...
</pre>

= Quality of Service (QoS) =
SLURM uses Quality of Service (QoS) both to provide limits on job sizes (termed by us as "job QoS") as well as to limit resources used by all jobs running in a partition, either per user or per group (termed by us as "partition QoS").

=== Job QoS ===
Job QoS are used to provide limits on the size of job that you can run. You should try to allocate only the resources your job actually needs, as resources that each of your jobs schedules are counted against your [[SLURM/Priority#Fair-share | fair-share priority]] in the future.
* default - Default job QoS. Limited to 4 CPU cores, 1 GPU, and 32GB RAM per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 CPU cores, 2 GPUs, and 64GB RAM per job. The maximum wall time per job is 2 days.
* high - Limited to 16 CPU cores, 4 GPUs, and 128GB RAM per job. The maximum wall time per job is 1 day.
* scavenger - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. 11% of the total resources available for each trackable resource type in the partition (CPUs/GPUs/RAM) is permitted simultaneously across all of your jobs running with this job QoS, enforced via the corresponding partition QoS (below) for the scavenger partition. This job QoS is paired one-to-one with the scavenger partition. To use this job QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any job QoS argument other than <code>--qos=scavenger</code> (optional) or submission will fail.
* scavenger-aarch64 - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. This job QoS is paired one-to-one with the scavenger-aarch64 partition. To use this job QoS, include <code>--partition=scavenger-aarch64</code>, <code>--account=scavenger</code>, and <code>--qos=scavenger-aarch64</code> in your submission arguments.

You can display these job QoS from the command line using the <code>show_qos</code> command. By default, the command will only show job QoS that you can access. The above five job QoS are the ones that everyone can access.

<pre>
$ show_qos
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
</pre>

If you want to see all job QoS, including those that you do not have access to, you can use the <code>show_qos --all</code> command.

<pre>
$ show_qos --all
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
gamma-huge-long 10-00:00:00 cpu=32,gres/gpu=16,mem=256G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
highmem 21-00:00:00 cpu=128,mem=2T
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
interactive 12:00:00 cpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
oasis-exempt 10-00:00:00 cpu=160,mem=28114M
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-mu+ 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
</pre>

You are able to submit to any partition that is listed in the <code>show_partitions</code> command. If you need to use an account other than the default account <tt>nexus</tt>, you will need to specify it via the <code>--account</code> submission argument.

=== Partition QoS ===
Partition QoS are used to limit resources used by all jobs running in a partition, either per user (MaxTRESPU) or per group (GrpTRES).

To view partition QoS, use the <code>show_partition_qos</code> command.

<pre>
$ show_partition_qos
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

The scavenger_part partition QoS has relative TRES limits based on the current hardware in a given partition, represented with percentages. To see the current actual TRES limits of this partition QoS, you can use the <code>-r/--real</code> argument.

<pre>
$ show_partition_qos -r
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=888,gres/gpu=140,mem=12574G
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

If you want to see all partition QoS, including those that you do not have access to, you can use the <code>show_partition_qos --all</code> command.

<pre>
$ show_partition_qos --all
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
cbcb 500 cpu=1406,mem=50359G
cbcb-heng 500
cbcb-interactive 500
class 500 cpu=32,gres/gpu=4,mem=262144M
clip 500 cpu=726,mem=6939G
cml 500 cpu=1226,mem=12116G
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
cml-wriva-high 500
csd-h200 500
gamma 500 cpu=906,mem=7675G
mbrc 500 cpu=370,mem=3571G
mc2 500 cpu=330,mem=3201G
oasis 500
quics 500 cpu=458,mem=4710G
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
vulcan 500 cpu=1402,mem=12936G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
</pre>

'''NOTE''': These QoS cannot be used directly when submitting jobs. Partition QoS limits apply to all jobs running on a given partition, regardless of what job QoS is used.

For example, in the default non-preemption partition (<tt>tron</tt>), you are restricted to 32 total CPU cores, 4 total GPUs, and 256GB total RAM at once across all jobs you have running in the partition.

Lab/group-specific partitions may also have their own user limits, and/or may also have group limits on the total number of resources consumed simultaneously by all users that are using their partition, codified by the line in the output above that matches their lab/group name. Note that the values listed above in the two "TRES" columns are not fixed and may fluctuate per-partition as more resources are added to or removed from each partition.

'''All partitions also only allow a maximum of 500 submitted (running (R) or pending (PD)) jobs per user in the partition simultaneously.''' This is to prevent excess pending jobs causing [https://slurm.schedmd.com/sched_config.html#backfill backfill] issues with the SLURM scheduler.
* If you need to submit more than 500 jobs in batch at once, you can develop and run an "outer submission script" that repeatedly attempts to run an "inner submission script" (your original submission script) to submit jobs in the batch periodically, until all job submissions are successful. The outer submission script should use looping logic to check if you are at the max job limit and should then retry submission after waiting for some time interval.
: An example outer submission script is as follows. In this example, <code>example_inner.sh</code> is your inner submission script and is not an [[SLURM/ArrayJobs | array job]], and you want to run 1000 jobs. If your inner submission script is an array job, adjust the number of jobs accordingly. Array jobs must be of size 500 or less.
<pre>
#!/bin/bash
numjobs=1000
i=0
while [ $i -lt $numjobs ]
do
while [[ "$(sbatch example_inner.sh 2>&1)" =~ "QOSMaxSubmitJobPerUserLimit" ]]
do
echo "Currently at maximum job submissions allowed by the partition's QoS."
echo "Waiting for 5 minutes before trying to submit more jobs."
sleep 300
done
i=$(( $i + 1 ))
echo "Submitted job $i of $numjobs"
done
</pre>

It is suggested that you run the outer submission script in a [[Tmux]] session to keep the terminal window executing it from being interrupted.

= Storage =
All network storage available in Nexus is currently [[NFS]] based, and comes in a few different flavors. Compute nodes also have local scratch storage that can be used.

== Home Directories ==
{{Nfshomes}}

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/<USERNAME></code> where <USERNAME> is your UMIACS username. '''It is not backed up or protected in any way.''' This directory is '''[[Automounter | automounted]]'''; you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access it.

You can view your quota usage by running <code>df -h /fs/nexus-scratch/<USERNAME></code>.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a [[#Project_Allocations | project allocation]] for this. If you choose to increase your scratch space beyond 400GB, the increased space is also subject to the 270 TB days limit mentioned in the project allocation section before we check back in for renewal. For example, if you request 1.4TB total space, you may have this for 270 days (1TB beyond the 400GB permanent increase). The amount increased beyond 400GB will also count against your faculty member's 20TB total storage limit mentioned below.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. and '''are not backed up or protected in any way.''' These directories are almost always more performant than any other storage available to the job as they are mounted from disks directly attached to the compute node. However, you must stage your data within the confines of your job and extract the relevant resultant data elsewhere before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our [[MonthlyMaintenanceWindow | monthly maintenance windows]]. Please make sure you secure any resultant data you wish to keep from these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of permanent lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

Lab space storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc..

A single faculty member can not have more than 20TB of project allocations across all of their sponsored accounts active simultaneously. Network scratch allocation space increases beyond the 400GB permanent maximum also have the increase count against this limit (i.e., a 1TB network scratch allocation would have 600GB counted towards this limit).

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

The maximum allocation length you can request is 540 days (500GB space) and the maximum storage space you can request is 9TB (30 day length).

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations are available via <code>/fs/nexus-projects/<project name></code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If renewal is available, you can renew for up to another 270 TB days with reapproval from the original faculty approver.
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Nexus datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Nexus here].

Nexus/ClusterOSUpgrade

2026-06-01T17:10:26Z

Mbaney:

==Overview==
UMIACS Technical Staff has begun the process of upgrading the operating system version on all [[Nexus]] cluster nodes from [[RHEL | Red Hat Enterprise Linux (RHEL)]] 8 to 9 as of 9am on 06/01/2026.

RHEL8 is in the Maintenance Support phase of its life cycle and is transitioning to the Extended Life phase in 2029. More information on Red Hat's lifecycle policy for its operating systems can be found [https://access.redhat.com/support/policy/updates/errata here]. We are staying well ahead of the Extended Life phase date for our cluster nodes by performing these upgrades now.

RHEL9 is still in the Full Support phase of its life cycle and introduces a newer major Linux kernel version and newer [https://www.gnu.org/software/libc glibc] version, improving compatibility with many newer software applications.

==Scheduling==
'''Upgrades for all cluster nodes have begun.''' We expect to be finished with all cluster node upgrades no later than Friday 08/21/2026 at 5pm.

===[[SLURM/JobSubmission | Submission Nodes]]===
'''Submission nodes with the number '01' in their hostnames have been upgraded as of 06/01/2026.'''

Submission nodes with the number '00' in their hostnames will be scheduled for upgrade individually, when all of the compute nodes associated with the same lab/center have been upgraded. Staff will send a notification to individual lab's/center's cluster users to schedule the relevant '00' node's upgrade when applicable. The actual date of each upgrade will be no less than one week after the corresponding notification has been sent.

Data in [[FilesystemDataStorage#UNIX_Filesystem_Storage | UNIX filesystem storage]] spaces on each submission node, i.e., /tmp and /scratch0, will not be preserved during upgrade. If you have any data in any such space on either submission node in a pairing that you want to keep, please ensure you copy it to the other submission node or a [[FilesystemDataStorage#Network-Attached_Filesystem_Storage | network-attached filesystem storage]] space prior to each node's upgrade date. Data in network-attached filesystem storage spaces, such as /nfshomes or /fs/nexus-scratch, will not be affected.

===Compute Nodes===
Due to the large number of compute nodes and the desire to not interrupt running jobs, we are not generally able to schedule each specific compute node upgrade on a specific date. If you find that a specific node is unavailable to schedule jobs on, you can run the command <code>sinfo --list-reasons --long</code> on a submission node and look to see if the node is in the list with the text "RHEL9 upgrade" - if this is present, the upgrade for that node is underway.

We will generally be prioritizing upgrades for nodes based on how available they are across various partitions; nodes that are only available in partitions that contain large numbers of users for a lab/center, e.g., cbcb, clip, cml-dpart, gamma, vulcan-ampere, vulcan-dpart, etc., and corresponding "scavenger" named partitions, will be prioritized over nodes that are only or are also available in faculty-specific / limited-node partitions. All nodes in the tron partition will also generally be prioritized.

If you are a faculty member authoritative for your own partition or a small group's limited-node partition and have scheduling concerns for the nodes in these partitions, please [[HelpDesk | contact staff]] ASAP to let us know about these concerns and we will make our best effort to accommodate them.

==Interoperability==
===Software and Modules===
Please begin transitioning your [[PythonVirtualEnv | virtual environments]], workflows, etc. to work with RHEL9 as soon as possible. You can use the '01' submission node that you have access to for transitioning and light testing - as always, [[Nexus/Submission_Node_Policy | please do not run any computationally intensive processes on this node]]. It is intended to be a host for configuring environments/workflows and submitting jobs only.

The [[Modules | module tree]] for RHEL9 has already been populated with a large number of the same modules that are available in the RHEL8 module tree, although specific modules may have different versions available in the RHEL9 tree as compared to the RHEL8 tree. If you have a dependency on a specific version of a module that is not available in the RHEL9 tree, please [[HelpDesk | contact staff]] and we can get one created.
* If you want to check to see if a specific version is available now but do not have access to any RHEL9 node yet, the current module tree for RHEL9 is located at /fs/UMos/RedHat-9/x86_64/local/stow/.modulefiles and can be viewed from any UMIACS-supported host.

===SLURM Scheduling===
If you want or need to schedule a job on only nodes running RHEL8 (or RHEL9, once you have validated whatever is relevant), you can use the submission arguments <code>--prefer=rhel#</code> or <code>--constraint=rhel#</code> in your job arguments to specify this, where # is replaced by the OS version number. The --prefer argument is a soft limitation on which nodes the job can be scheduled on and the --constraint argument is a hard limitation, i.e., if you use the argument <code>--prefer=rhel8</code> but there are no RHEL8 nodes available at present (with your other submission arguments also satisfied) in the partition you are submitting to, the job will be scheduled on an appropriate RHEL9 node if that would result in an earlier (or instantaneous) start time.

MonthlyMaintenanceWindow

2026-05-29T00:16:16Z

Mbaney:

[[HelpDesk | UMIACS staff]] takes a monthly maintenance window to patch and reboot all UMIACS-supported hosts and services. This provides a way for staff to ensure security updates are installed and applied on the numerous different platforms and appliances that UMIACS runs.

The window for each month is calculated by adding 9 days to [https://en.wikipedia.org/wiki/Patch_Tuesday Microsoft's Patch Tuesday] to allow for enough time to marshal patches released that month from Microsoft, Red Hat, Apple, and other OS and application vendors and have enough time to get systems prepared to reboot. This translates to the window being on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month. The window lasts from '''5pm-8pm'''.

[[Nexus]] will always have a reservation in place from 4:45pm-8pm on the day of the upcoming window to prevent jobs from being scheduled on compute nodes. The 15-minute addition before the start of the window is to allow jobs to fully end. Any job submitted before the reservation begins that has a time limit that would run into the reservation will be held until at least the end of the reservation - 8pm on the day of the window. This is to prevent issues with jobs failing to end properly causing delays in work we have scheduled during the window.

A list of upcoming maintenance windows is as follows, with the next one in bold. Again, the window is on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month, and lasts from '''5pm-8pm'''.

* '''June 18th 2026'''
* July 23rd 2026
* August 20th 2026
* September 17th 2026
* October 22nd 2026
* November 19th 2026
* December 17th 2026

==Archives==
* January 17th 2013 - BEGIN time of 8pm-12am for this window through February 20th 2020
* February 21st 2013
* March 21st 2013
* April 18th 2013
* May 23rd 2013
* June 20th 2013
* July 18th 2013
* August 22nd 2013
* September 19th 2013
* October 17th 2013
* December 19th 2013
* January 23rd 2014
* February 20th 2014
* March 20th 2014
* April 17th 2014
* May 22nd 2014
* June 19th 2014
* July 17th 2014
* August 21st 2014
* September 18th 2014
* October 23rd 2014
* November 20th 2014
* December 18th 2014
* January 22nd 2015
* February 19th 2015
* March 19th 2015
* May 21st 2015
* June 18th 2015
* July 23rd 2015
* August 20th 2015
* September 17th 2015
* October 22nd 2015
* November 19th 2015
* December 17th 2015
* January 21st 2016
* February 18th 2016
* March 12th 2016 (Adjusted date for AVW power outage)
* April 21st 2016
* May 19th 2016
* June 23rd 2016
* July 21st 2016
* August 18th 2016
* September 22nd 2016
* October 20th 2016
* November 17th 2016
* December 22nd 2016
* January 19th 2017
* February 23rd 2017
* March 23rd 2017
* April 20th 2017
* May 18th 2017
* June 22nd 2017
* July 20th 2017
* August 17th 2017
* September 21st 2017
* October 19th 2017
* December 21st 2017
* January 18th 2018
* February 22nd 2018
* March 22nd 2018
* April 19th 2018
* May 17th 2018
* June 21st 2018
* July 19th 2018
* August 23rd 2018
* September 20th 2018
* October 18th 2018
* December 20th 2018
* January 24th 2019
* February 21st 2019
* April 18th 2019
* May 23rd 2019
* June 20th 2019
* July 18th 2019
* August 22nd 2019
* September 19th 2019
* October 17th 2019
* November 21st 2019
* December 19th 2019
* January 23rd 2020
* February 20th 2020
* April 23rd 2020 - BEGIN time of 5pm-7pm for this window through August 19th 2021
* June 18th 2020
* July 23rd 2020
* August 20th 2020
* September 17th 2020
* October 22nd 2020
* November 19th 2020
* December 17th 2020
* January 21st 2021
* February 18th 2021
* March 25th 2021 (Adjusted date for extended Spring Break)
* April 22nd 2021
* May 20th 2021
* June 17th 2021
* July 22nd 2021
* August 19th 2021
* September 23rd 2021 - BEGIN time of 5pm-8pm for this window and all others below
* October 21st 2021
* November 18th 2021
* January 20th 2022
* February 17th 2022
* March 24th 2022 (Adjusted date for Spring Break)
* April 21st 2022
* May 19th 2022
* June 23rd 2022
* July 21st 2022
* August 18th 2022
* September 22nd 2022
* October 20th 2022
* November 17th 2022
* January 19th 2023
* February 23rd 2023
* April 20th 2023
* May 18th 2023
* June 22nd 2023
* July 20th 2023
* August 17th 2023
* September 21st 2023
* October 19th 2023
* December 20th 2023 (Adjusted date for early Winter Break)
* January 18th 2024
* February 22nd 2024
* March 21st 2024
* April 18th 2024
* May 23th 2024
* June 20th 2024
* July 18th 2024
* August 22nd 2024
* September 19th 2024
* October 17th 2024
* November 21st 2024
* December 19th 2024
* January 23rd 2025
* February 20th 2025
* March 20th 2025
* April 17th 2025
* May 22nd 2025
* June 19th 2025
* July 17th 2025
* August 21st 2025
* September 18th 2025
* October 23rd 2025
* November 20th 2025
* December 18th 2025
* January 22nd 2026
* February 19th 2026
* March 19th 2026
* April 23rd 2026
* May 28th 2026 (Adjusted date for CIO-imposed network change freeze)

Nexus

2026-05-19T15:28:45Z

Mbaney: /* Partition QoS */

{{Note|UMIACS Technical Staff will begin the process of upgrading the operating system version on all Nexus cluster nodes in Summer 2026. Please see [[Nexus/ClusterOSUpgrade]] for more information.}}

The Nexus is the combined scheduler of resources in UMIACS. The resource manager for Nexus is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account - not a [[Accounts/Collaborator | collaborator account]] - in order to access Nexus.

== Access ==
Your access to submission nodes (alternatively called login nodes) for Nexus computational resources is determined by your account sponsor's department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available submission nodes - generally a pair of nodes with hostnames of the format <tt>nexus<department, lab, or center abbreviation>[00,01]</tt>, e.g., <tt>nexusgroup00</tt> and <tt>nexusgroup01</tt>.

Once you have identified your submission nodes, you can [[SSH]] into them [https://itsupport.umd.edu/itsupport?id=kb_article_view&sysparm_article=KB0016076 after connecting to UMD's GlobalProtect VPN]. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

Please read our [[Nexus/Submission_Node_Policy|Submission Node Policy]] for guidance on appropriate usage of a submission node. If a submission node becomes unresponsive due to disregarding this policy, we may kill user processes on these nodes to resolve the issue. We reserve the right to take action on users who repeatedly cause issues on submission nodes.

== Jobs ==
[[SLURM]] jobs are [[SLURM/JobSubmission | submitted]] by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the who/where/how, you may be required to specify <code>--account</code>, <code>--partition</code>, and/or <code>--qos</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--cpus-per-task</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four; if you don't specify something, you will get the default value for that resource, which is minimal (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs | here]]. You may also use <code>--ntasks</code> to specify the number of parallel processes to run, with each task having its own set of the resources specified above. You can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

For a list of available GPU types on Nexus and their specs, please see [[Nexus/GPUs]].

For details on how the network for Nexus is architected, please see [[Nexus/Network]]. This can be important if you wish to optimize performance of your jobs.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --cpus-per-task=4 --mem=2gb --gres=gpu:1 bash
srun: Job account was unset; set to user default of 'nexus'
srun: Job partition was unset; set to cluster default of 'tron'
srun: Job QoS was unset; set to association default of 'default'
srun: Job time limit was unset; set to partition default of 60 minutes
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
$ hostname
tron62.umiacs.umd.edu
$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-daad6a04-a2ce-1183-ce53-b267048f750a)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus has a number of different partitions of resources. Different Centers, Labs, and Faculty are able to invest in computational resources that are restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all non-class accounts sponsored by either UMIACS or CSD faculty.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that contains [https://en.wikipedia.org/wiki/X86-64 x86_64] architecture nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again. For an overview of things you can check within scripts to determine if your job was preempted/resumed, see [[SLURM/Preemption]].
* Scavenger (aarch64) - Preemption partition identical in design to <tt>scavenger</tt>, but only contains [https://en.wikipedia.org/wiki/AArch64 aarch64] architecture nodes.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts#Cluster_Usage | Class]] - Pool of resources available to class accounts sponsored by either UMIACS or CSD faculty.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CBCB]] - CBCB lab pool available for CBCB lab members.
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/CML]] - CML lab pool available for CML lab members.
* [[Nexus/GAMMA]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MBRC]] - MBRC lab pool available for MBRC lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.
* [[Nexus/QuICS]] - QuICS lab pool available for QuICS lab members.
* [[Nexus/Vulcan]] - Vulcan lab pool available for Vulcan lab members.

You can view the partitions that you have access to by using the <code>show_partitions</code> command. By default, the command will show only the partitions that are available to you.

<pre>
$ show_partitions
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
scavenger scavenger scavenger UNLIMITED brigid[16-19]
cbcb[00-29]
clip[00-13]
cml[00,02-13,15-28,30-33]
cmlcpu[00-04,06-07]
gammagpu[00-21]
legacy[00-11,13-28,30-36]
legacygpu[00-07]
quics00
tron[00-44,46-69]
vulcan[00-45]
------------------------------------------------------------------------------------------------------------------------
scavenger-aarch64 scavenger scavenger-aarch64 UNLIMITED oasis[00-39]
------------------------------------------------------------------------------------------------------------------------
tron nexus default UNLIMITED tron[00-44,46-69]
high
medium
</pre>

If you want to see information for all of the partitions, including those that you do not have access to, you can use the <code>show_partitions --all</code> command.

<pre>
$ show_partitions --all
Name AllowAccounts AllowQos MaxNodes Nodes
------------------------ ----------------------- ------------------------------ ----------- ----------------------------
cbcb cbcb default UNLIMITED cbcb[00-20,22-29]
medium legacy[00-11,13-28,30-36]
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-heng cbcb-heng default UNLIMITED cbcb[26-29]
medium
high
huge-long
highmem
------------------------------------------------------------------------------------------------------------------------
cbcb-interactive cbcb interactive UNLIMITED cbcb21
...
</pre>

= Quality of Service (QoS) =
SLURM uses Quality of Service (QoS) both to provide limits on job sizes (termed by us as "job QoS") as well as to limit resources used by all jobs running in a partition, either per user or per group (termed by us as "partition QoS").

=== Job QoS ===
Job QoS are used to provide limits on the size of job that you can run. You should try to allocate only the resources your job actually needs, as resources that each of your jobs schedules are counted against your [[SLURM/Priority#Fair-share | fair-share priority]] in the future.
* default - Default job QoS. Limited to 4 CPU cores, 1 GPU, and 32GB RAM per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 CPU cores, 2 GPUs, and 64GB RAM per job. The maximum wall time per job is 2 days.
* high - Limited to 16 CPU cores, 4 GPUs, and 128GB RAM per job. The maximum wall time per job is 1 day.
* scavenger - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. 11% of the total resources available for each trackable resource type in the partition (CPUs/GPUs/RAM) is permitted simultaneously across all of your jobs running with this job QoS, enforced via the corresponding partition QoS (below) for the scavenger partition. This job QoS is paired one-to-one with the scavenger partition. To use this job QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any job QoS argument other than <code>--qos=scavenger</code> (optional) or submission will fail.
* scavenger-aarch64 - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. This job QoS is paired one-to-one with the scavenger-aarch64 partition. To use this job QoS, include <code>--partition=scavenger-aarch64</code>, <code>--account=scavenger</code>, and <code>--qos=scavenger-aarch64</code> in your submission arguments.

You can display these job QoS from the command line using the <code>show_qos</code> command. By default, the command will only show job QoS that you can access. The above five job QoS are the ones that everyone can access.

<pre>
$ show_qos
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
</pre>

If you want to see all job QoS, including those that you do not have access to, you can use the <code>show_qos --all</code> command.

<pre>
$ show_qos --all
Name MaxWall MaxTRES MaxJobsPU MaxTRESPU
-------------------- ----------- ------------------------------ --------- ------------------------------
cml-cpu 7-00:00:00 8
cml-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
cml-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
cml-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
cml-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
cml-scavenger 3-00:00:00 gres/gpu=24
cml-very_high 1-12:00:00 cpu=32,gres/gpu=8,mem=256G 8 gres/gpu=12
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
gamma-huge-long 10-00:00:00 cpu=32,gres/gpu=16,mem=256G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
highmem 21-00:00:00 cpu=128,mem=2T
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
interactive 12:00:00 cpu=4,mem=128G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
oasis-exempt 10-00:00:00 cpu=160,mem=28114M
scavenger 3-00:00:00
scavenger-aarch64 3-00:00:00
vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4
vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2
vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2
vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2
vulcan-high_long 14-00:00:00 cpu=32,gres/gpu=8 8 gres/gpu=8
vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2
vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48
vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G
vulcan-scavenger-mu+ 3-00:00:00 cpu=288,gres/gpu=72,mem=1152G
</pre>

You are able to submit to any partition that is listed in the <code>show_partitions</code> command. If you need to use an account other than the default account <tt>nexus</tt>, you will need to specify it via the <code>--account</code> submission argument.

=== Partition QoS ===
Partition QoS are used to limit resources used by all jobs running in a partition, either per user (MaxTRESPU) or per group (GrpTRES).

To view partition QoS, use the <code>show_partition_qos</code> command.

<pre>
$ show_partition_qos
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

The scavenger_part partition QoS has relative TRES limits based on the current hardware in a given partition, represented with percentages. To see the current actual TRES limits of this partition QoS, you can use the <code>-r/--real</code> argument.

<pre>
$ show_partition_qos -r
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
scavenger-aarch64_part 500
scavenger_part 500 cpu=888,gres/gpu=140,mem=12574G
tron 500 cpu=32,gres/gpu=4,mem=262144M
</pre>

If you want to see all partition QoS, including those that you do not have access to, you can use the <code>show_partition_qos --all</code> command.

<pre>
$ show_partition_qos --all
Name MaxSubmitPU MaxTRESPU GrpTRES
------------------------- ----------- -------------------------------- --------------------
cbcb 500 cpu=1406,mem=50359G
cbcb-heng 500
cbcb-interactive 500
class 500 cpu=32,gres/gpu=4,mem=262144M
clip 500 cpu=726,mem=6939G
cml 500 cpu=1226,mem=12116G
cml-cpu 500
cml-director 500
cml-furongh 500
cml-scavenger 500 gres/gpu=24
cml-sfeizi 500
cml-wriva 500
cml-wriva-high 500
csd-h200 500
gamma 500 cpu=906,mem=7675G
mbrc 500 cpu=370,mem=3571G
mc2 500 cpu=330,mem=3201G
oasis 500
quics 500 cpu=458,mem=4710G
scavenger-aarch64_part 500
scavenger_part 500 cpu=11%,gres/gpu=11%,mem=11%
tron 500 cpu=32,gres/gpu=4,mem=262144M
vulcan 500 cpu=1402,mem=12936G
vulcan-ampere 500
vulcan-cpu 500
vulcan-ramani 500
vulcan-scavenger 500
vulcan-scavenger-multi 500
</pre>

'''NOTE''': These QoS cannot be used directly when submitting jobs. Partition QoS limits apply to all jobs running on a given partition, regardless of what job QoS is used.

For example, in the default non-preemption partition (<tt>tron</tt>), you are restricted to 32 total CPU cores, 4 total GPUs, and 256GB total RAM at once across all jobs you have running in the partition.

Lab/group-specific partitions may also have their own user limits, and/or may also have group limits on the total number of resources consumed simultaneously by all users that are using their partition, codified by the line in the output above that matches their lab/group name. Note that the values listed above in the two "TRES" columns are not fixed and may fluctuate per-partition as more resources are added to or removed from each partition.

'''All partitions also only allow a maximum of 500 submitted (running (R) or pending (PD)) jobs per user in the partition simultaneously.''' This is to prevent excess pending jobs causing [https://slurm.schedmd.com/sched_config.html#backfill backfill] issues with the SLURM scheduler.
* If you need to submit more than 500 jobs in batch at once, you can develop and run an "outer submission script" that repeatedly attempts to run an "inner submission script" (your original submission script) to submit jobs in the batch periodically, until all job submissions are successful. The outer submission script should use looping logic to check if you are at the max job limit and should then retry submission after waiting for some time interval.
: An example outer submission script is as follows. In this example, <code>example_inner.sh</code> is your inner submission script and is not an [[SLURM/ArrayJobs | array job]], and you want to run 1000 jobs. If your inner submission script is an array job, adjust the number of jobs accordingly. Array jobs must be of size 500 or less.
<pre>
#!/bin/bash
numjobs=1000
i=0
while [ $i -lt $numjobs ]
do
while [[ "$(sbatch example_inner.sh 2>&1)" =~ "QOSMaxSubmitJobPerUserLimit" ]]
do
echo "Currently at maximum job submissions allowed by the partition's QoS."
echo "Waiting for 5 minutes before trying to submit more jobs."
sleep 300
done
i=$(( $i + 1 ))
echo "Submitted job $i of $numjobs"
done
</pre>

It is suggested that you run the outer submission script in a [[Tmux]] session to keep the terminal window executing it from being interrupted.

= Storage =
All network storage available in Nexus is currently [[NFS]] based, and comes in a few different flavors. Compute nodes also have local scratch storage that can be used.

== Home Directories ==
{{Nfshomes}}

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/<USERNAME></code> where <USERNAME> is your UMIACS username. '''It is not backed up or protected in any way.''' This directory is '''[[Automounter | automounted]]'''; you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access it.

You can view your quota usage by running <code>df -h /fs/nexus-scratch/<USERNAME></code>.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a [[#Project_Allocations | project allocation]] for this. If you choose to increase your scratch space beyond 400GB, the increased space is also subject to the 270 TB days limit mentioned in the project allocation section before we check back in for renewal. For example, if you request 1.4TB total space, you may have this for 270 days (1TB beyond the 400GB permanent increase). The amount increased beyond 400GB will also count against your faculty member's 20TB total storage limit mentioned below.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. and '''are not backed up or protected in any way.''' These directories are almost always more performant than any other storage available to the job as they are mounted from disks directly attached to the compute node. However, you must stage your data within the confines of your job and extract the relevant resultant data elsewhere before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our [[MonthlyMaintenanceWindow | monthly maintenance windows]]. Please make sure you secure any resultant data you wish to keep from these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of permanent lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

Lab space storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc..

A single faculty member can not have more than 20TB of project allocations across all of their sponsored accounts active simultaneously. Network scratch allocation space increases beyond the 400GB permanent maximum also have the increase count against this limit (i.e., a 1TB network scratch allocation would have 600GB counted towards this limit).

Project storage is fully protected. It has [[Snapshots | snapshots]] enabled and is [[NightlyBackups | backed up nightly]].

The maximum allocation length you can request is 540 days (500GB space) and the maximum storage space you can request is 9TB (30 day length).

To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that the project is under involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations are available via <code>/fs/nexus-projects/<project name></code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If renewal is available, you can renew for up to another 270 TB days with reapproval from the original faculty approver.
* If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
* If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]].

The list of Nexus datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=Nexus here].

MonthlyMaintenanceWindow

2026-05-18T14:34:22Z

Mbaney:

[[HelpDesk | UMIACS staff]] takes a monthly maintenance window to patch and reboot all UMIACS-supported hosts and services. This provides a way for staff to ensure security updates are installed and applied on the numerous different platforms and appliances that UMIACS runs.

The window for each month is calculated by adding 9 days to [https://en.wikipedia.org/wiki/Patch_Tuesday Microsoft's Patch Tuesday] to allow for enough time to marshal patches released that month from Microsoft, Red Hat, Apple, and other OS and application vendors and have enough time to get systems prepared to reboot. This translates to the window being on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month. The window lasts from '''5pm-8pm'''.

[[Nexus]] will always have a reservation in place from 4:45pm-8pm on the day of the upcoming window to prevent jobs from being scheduled on compute nodes. The 15-minute addition before the start of the window is to allow jobs to fully end. Any job submitted before the reservation begins that has a time limit that would run into the reservation will be held until at least the end of the reservation - 8pm on the day of the window. This is to prevent issues with jobs failing to end properly causing delays in work we have scheduled during the window.

A list of upcoming maintenance windows is as follows, with the next one in bold. Again, the window is on the '''Thursday that occurs between the 17th and the 23rd (inclusive)''' of each month, and lasts from '''5pm-8pm'''.

* '''May 28th 2026''' (Adjusted date for CIO-imposed network change freeze)
* June 18th 2026
* July 23rd 2026
* August 20th 2026
* September 17th 2026
* October 22nd 2026
* November 19th 2026
* December 17th 2026

==Archives==
* January 17th 2013 - BEGIN time of 8pm-12am for this window through February 20th 2020
* February 21st 2013
* March 21st 2013
* April 18th 2013
* May 23rd 2013
* June 20th 2013
* July 18th 2013
* August 22nd 2013
* September 19th 2013
* October 17th 2013
* December 19th 2013
* January 23rd 2014
* February 20th 2014
* March 20th 2014
* April 17th 2014
* May 22nd 2014
* June 19th 2014
* July 17th 2014
* August 21st 2014
* September 18th 2014
* October 23rd 2014
* November 20th 2014
* December 18th 2014
* January 22nd 2015
* February 19th 2015
* March 19th 2015
* May 21st 2015
* June 18th 2015
* July 23rd 2015
* August 20th 2015
* September 17th 2015
* October 22nd 2015
* November 19th 2015
* December 17th 2015
* January 21st 2016
* February 18th 2016
* March 12th 2016 (Adjusted date for AVW power outage)
* April 21st 2016
* May 19th 2016
* June 23rd 2016
* July 21st 2016
* August 18th 2016
* September 22nd 2016
* October 20th 2016
* November 17th 2016
* December 22nd 2016
* January 19th 2017
* February 23rd 2017
* March 23rd 2017
* April 20th 2017
* May 18th 2017
* June 22nd 2017
* July 20th 2017
* August 17th 2017
* September 21st 2017
* October 19th 2017
* December 21st 2017
* January 18th 2018
* February 22nd 2018
* March 22nd 2018
* April 19th 2018
* May 17th 2018
* June 21st 2018
* July 19th 2018
* August 23rd 2018
* September 20th 2018
* October 18th 2018
* December 20th 2018
* January 24th 2019
* February 21st 2019
* April 18th 2019
* May 23rd 2019
* June 20th 2019
* July 18th 2019
* August 22nd 2019
* September 19th 2019
* October 17th 2019
* November 21st 2019
* December 19th 2019
* January 23rd 2020
* February 20th 2020
* April 23rd 2020 - BEGIN time of 5pm-7pm for this window through August 19th 2021
* June 18th 2020
* July 23rd 2020
* August 20th 2020
* September 17th 2020
* October 22nd 2020
* November 19th 2020
* December 17th 2020
* January 21st 2021
* February 18th 2021
* March 25th 2021 (Adjusted date for extended Spring Break)
* April 22nd 2021
* May 20th 2021
* June 17th 2021
* July 22nd 2021
* August 19th 2021
* September 23rd 2021 - BEGIN time of 5pm-8pm for this window and all others below
* October 21st 2021
* November 18th 2021
* January 20th 2022
* February 17th 2022
* March 24th 2022 (Adjusted date for Spring Break)
* April 21st 2022
* May 19th 2022
* June 23rd 2022
* July 21st 2022
* August 18th 2022
* September 22nd 2022
* October 20th 2022
* November 17th 2022
* January 19th 2023
* February 23rd 2023
* April 20th 2023
* May 18th 2023
* June 22nd 2023
* July 20th 2023
* August 17th 2023
* September 21st 2023
* October 19th 2023
* December 20th 2023 (Adjusted date for early Winter Break)
* January 18th 2024
* February 22nd 2024
* March 21st 2024
* April 18th 2024
* May 23th 2024
* June 20th 2024
* July 18th 2024
* August 22nd 2024
* September 19th 2024
* October 17th 2024
* November 21st 2024
* December 19th 2024
* January 23rd 2025
* February 20th 2025
* March 20th 2025
* April 17th 2025
* May 22nd 2025
* June 19th 2025
* July 17th 2025
* August 21st 2025
* September 18th 2025
* October 23rd 2025
* November 20th 2025
* December 18th 2025
* January 22nd 2026
* February 19th 2026
* March 19th 2026
* April 23rd 2026

BarracudaSpamFirewall

2026-05-11T19:57:33Z

Mbaney:

===Introduction===
UMIACS has deployed a system with two Barracuda Networks spam firewalls. This allows for enterprise level Virus and Spam scoring and filtering for our email architecture. You can log in to the system either of the two firewalls using your UMIACS email address (username@umiacs.umd.edu) and UMD directory passphrase:
*[https://bubs.umiacs.umd.edu bubs.umiacs.umd.edu]
*[https://pompom.umiacs.umd.edu pompom.umiacs.umd.edu]

===Mail Flow Through Barracudas===
*The first time your mail flows through one of the Barracudas it will send you a mail with a new username and password. Subsequently, you will receive, every day (unless you configure otherwise), a mail at approximately 3:30pm US Eastern from the Barracuda with your quarantine summary. '''Please note that the links provides in the Actions column in the summary do not work anymore due to security hardening on the Barracudas. Please use the links provided above to log in.'''

===Scoring===
The Barracudas will score every message that passes through them and inject varying message headers based on that score. You can then create email filters based on these headers to filter out messages that are tagged as spam by the Barracudas. See [[BarracudaSpamFirewall/Scoring]] for more details.

===Quarantine===
*Mail that has been deemed as spam will be kept on the Barracudas in quarantine. It will not be delivered to your mailbox unless you configure the Barracudas to do so.
*Your quarantine will be preserved for '''21 days'''. Individual mail messages are purged if they are still in the quarantine 22 days after they are first received.
*You can search your spam quarantine by following the steps [[BarracudaSpamFirewall/SearchingQuarantine | here]].

===Quarantine Passthrough===
*If you wish to have the mail that would ordinarily be quarantined by Barracuda delivered to your mailbox instead you can configure this using the Barracuda web configuration.
*You can enable this functionality by following the steps [[BarracudaSpamFirewall/QuarantinePassthrough | here]].

===Allow Lists, Block Lists, and Bayesian Filtering===
*You may also setup allow lists, block lists, and Bayesian filtering options through the Preferences tab at the top of the Barracuda web portal.

===More Information===
*For more information on how to use the Barracuda please download the user's guide:

https://wiki.umiacs.umd.edu/umiacs/images/5/5a/Barracuda_usersguide.pdf

BarracudaSpamFirewall

2026-05-11T19:56:37Z

Mbaney:

===Introduction===
UMIACS has deployed a system with two Barracuda Networks spam firewalls. This allows for enterprise level Virus and Spam scoring and filtering for our email architecture. You can log in to the system either of the two firewalls using your UMIACS email address (username@umiacs.umd.edu) and UMD passphrase:
*[https://bubs.umiacs.umd.edu bubs.umiacs.umd.edu]
*[https://pompom.umiacs.umd.edu pompom.umiacs.umd.edu]

===Mail Flow Through Barracudas===
*The first time your mail flows through one of the Barracudas it will send you a mail with a new username and password. Subsequently, you will receive, every day (unless you configure otherwise), a mail at approximately 3:30pm US Eastern from the Barracuda with your quarantine summary. '''Please note that the links provides in the Actions column in the summary do not work anymore due to security hardening on the Barracudas. Please use the links provided above to log in.'''

===Scoring===
The Barracudas will score every message that passes through them and inject varying message headers based on that score. You can then create email filters based on these headers to filter out messages that are tagged as spam by the Barracudas. See [[BarracudaSpamFirewall/Scoring]] for more details.

===Quarantine===
*Mail that has been deemed as spam will be kept on the Barracudas in quarantine. It will not be delivered to your mailbox unless you configure the Barracudas to do so.
*Your quarantine will be preserved for '''21 days'''. Individual mail messages are purged if they are still in the quarantine 22 days after they are first received.
*You can search your spam quarantine by following the steps [[BarracudaSpamFirewall/SearchingQuarantine | here]].

===Quarantine Passthrough===
*If you wish to have the mail that would ordinarily be quarantined by Barracuda delivered to your mailbox instead you can configure this using the Barracuda web configuration.
*You can enable this functionality by following the steps [[BarracudaSpamFirewall/QuarantinePassthrough | here]].

===Allow Lists, Block Lists, and Bayesian Filtering===
*You may also setup allow lists, block lists, and Bayesian filtering options through the Preferences tab at the top of the Barracuda web portal.

===More Information===
*For more information on how to use the Barracuda please download the user's guide:

https://wiki.umiacs.umd.edu/umiacs/images/5/5a/Barracuda_usersguide.pdf

BarracudaSpamFirewall

2026-05-11T19:51:52Z

Mbaney:

===Introduction===
UMIACS has deployed a system with two Barracuda Networks spam firewalls. This allows for enterprise level Virus and Spam scoring and filtering for our email architecture. You can log in to the system either of the two firewalls using your UMIACS email address (username@umiacs.umd.edu) and UMD passphrase:
*[https://bubs.umiacs.umd.edu bubs.umiacs.umd.edu]
*[https://pompom.umiacs.umd.edu pompom.umiacs.umd.edu]

===Mail Flow Through Barracudas===
*The first time your mail flows through one of the Barracudas it will send you a mail with a new username and password. Subsequently, you will receive, every day (unless you configure otherwise), a mail at approximately 3:30pm US Eastern from the Barracuda with your quarantine summary. '''Please note that auto log-in through the link provided in the summary does not work anymore due to security hardening on the Barracudas. Please use the links provided above to log in.'''

===Scoring===
The Barracudas will score every message that passes through them and inject varying message headers based on that score. You can then create email filters based on these headers to filter out messages that are tagged as spam by the Barracudas. See [[BarracudaSpamFirewall/Scoring]] for more details.

===Quarantine===
*Mail that has been deemed as spam will be kept on the Barracudas in quarantine. It will not be delivered to your mailbox unless you configure the Barracudas to do so.
*Your quarantine will be preserved for '''21 days'''. Individual mail messages are purged if they are still in the quarantine 22 days after they are first received.
*You can search your spam quarantine by following the steps [[BarracudaSpamFirewall/SearchingQuarantine | here]].

===Quarantine Passthrough===
*If you wish to have the mail that would ordinarily be quarantined by Barracuda delivered to your mailbox instead you can configure this using the Barracuda web configuration.
*You can enable this functionality by following the steps [[BarracudaSpamFirewall/QuarantinePassthrough | here]].

===Whitelists, Blacklists & Bayesian Filtering===
*You may also setup whitelists, blacklists, and Bayesian filtering options through the Preferences tab at the top of the Barracuda web portal.

===More Information===
*For more information on how to use the Barracuda please download the user's guide:

https://wiki.umiacs.umd.edu/umiacs/images/5/5a/Barracuda_usersguide.pdf