Nexus/Tron: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Tron partition is a subset of resources available in the [[Nexus]].
The Tron partition is a subset of resources available in the [[Nexus]].  It was purchased using college-level funding for UMIACS and CSD faculty.


= Hardware =
= Compute Nodes =
Not all hardware is available yet.  The full configuration, when finally received and powered on, will include 62 nodes with a total of 48 A6000, 128 A5000, and 160 A4000 Nvidia GPUs.  This computationally provides 2.5 million CUDA cores, 8TB of GPU memory, and more than 70,000 TensorFLOPS of aggregate performance.  In addition to the GPUs, this computationally provides 1,344 CPU cores and 10.7TB of main memory across the nodes.
The partition contains 69 compute nodes with specs as detailed below.


{| class="wikitable"
{| class="wikitable sortable"
! style="text-align:center;"| Quantity
! Nodenames
! Type
! Type
! CPUs
! Quantity
! Memory
! CPU cores per node
! GPUs
! Memory per node
! Nodes
! GPUs per node
|-
|-
|tron[00-05]
|A6000 GPU Node
|6
|6
|A6000 GPU Node
|32
|32
|256GB
|256GB
|8
|8
|tron[00-05]
|-
|-
|tron[06-44]
|A4000 GPU Node
|39
|16
|16
|128GB
|4
|-
|tron[46-61]
|A5000 GPU Node
|A5000 GPU Node
|32
|16
|48
|256GB
|256GB
|8
|8
|tron[46-61]
|-
|40
|A4000 GPU Node
|16
|128GB
|4
|tron[06-45]
|-
|-
|tron[62-69]
|RTX 2080 Ti GPU Node
|8
|32
|384GB
|8
|- class="sortbottom"
|tron[00-44,46-69]
!Total
!Total
|
|69
|1,344
|1840
|10.7 TB
|13282GB
|336
|396
|tron[00-21]
|}
|}
= Network =
The network infrastructure supporting the Tron partition consists of:
# One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following hosts:
#* tron[00-05]: Two 100GbE links per node, one to each switch in the pair (redundancy).
#* tron[06-44]: Two 50GbE links per node, one to each switch in the pair (redundancy).
#* tron[46-61]: One 100GbE link per node. Half of the overall links for this set of nodes go to one switch in the pair, and the other half go to the other switch in the pair.
# One switch connected to the above pair of network switches via two 100GbE links, one to each switch in the pair, serving the following hosts:
#* tron[62-69]: Two 10GbE links to the switch per node (increased bandwidth).
The fileserver hosting all Nexus [[Nexus#Scratch_Directories | scratch]], [[Nexus#Faculty_Allocations | faculty]], [[Nexus#Project_Allocations | project]], and [[Nexus#Datasets | dataset]] allocations also connects to the same pair of switches supporting tron[00-44,46-61] via four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).

Latest revision as of 21:34, 26 November 2024

The Tron partition is a subset of resources available in the Nexus. It was purchased using college-level funding for UMIACS and CSD faculty.

Compute Nodes

The partition contains 69 compute nodes with specs as detailed below.

Nodenames Type Quantity CPU cores per node Memory per node GPUs per node
tron[00-05] A6000 GPU Node 6 32 256GB 8
tron[06-44] A4000 GPU Node 39 16 128GB 4
tron[46-61] A5000 GPU Node 16 48 256GB 8
tron[62-69] RTX 2080 Ti GPU Node 8 32 384GB 8
tron[00-44,46-69] Total 69 1840 13282GB 396

Network

The network infrastructure supporting the Tron partition consists of:

  1. One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following hosts:
    • tron[00-05]: Two 100GbE links per node, one to each switch in the pair (redundancy).
    • tron[06-44]: Two 50GbE links per node, one to each switch in the pair (redundancy).
    • tron[46-61]: One 100GbE link per node. Half of the overall links for this set of nodes go to one switch in the pair, and the other half go to the other switch in the pair.
  2. One switch connected to the above pair of network switches via two 100GbE links, one to each switch in the pair, serving the following hosts:
    • tron[62-69]: Two 10GbE links to the switch per node (increased bandwidth).

The fileserver hosting all Nexus scratch, faculty, project, and dataset allocations also connects to the same pair of switches supporting tron[00-44,46-61] via four 100GbE links, two to each switch in the pair (redundancy and increased bandwidth).