Nexus/Network: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
==Overview== | ==Overview== | ||
The [[Nexus]] cluster runs on a [https://en.wikipedia.org/wiki/Hierarchical_internetworking_model hierarchical] [https://en.wikipedia.org/wiki/Ethernet Ethernet | The [[Nexus]] cluster runs on a [https://en.wikipedia.org/wiki/Hierarchical_internetworking_model hierarchical] [https://en.wikipedia.org/wiki/Ethernet Ethernet-based] network with node-level speeds ranging anywhere from [https://en.wikipedia.org/wiki/Gigabit_Ethernet 1GbE] to [https://en.wikipedia.org/wiki/100_Gigabit_Ethernet 100GbE]. Generally speaking, but not always, newer-purchased compute nodes often come with hardware capable of using, and therefore use, faster speeds. Increasingly faster speeds require increasingly more expensive network switches and cables, so some labs/centers have opted to stay with slower speeds. | ||
If you are running multi-node jobs in [[SLURM]], or simply want the best performance for a single-node job depending on what filesystem path(s) your job uses, it can be important to know the basics of the cluster's architecture to optimize performance. | If you are running multi-node jobs in [[SLURM]], or simply want the best performance for a single-node job depending on what filesystem path(s) your job uses, it can be important to know the basics of the cluster's architecture to optimize performance. | ||
Line 7: | Line 7: | ||
==Network Core== | ==Network Core== | ||
The network core for Nexus is the same network core used by all UMIACS-supported systems. | The network core for Nexus is the same network core used by all UMIACS-supported systems. It consists of a pair of network switches that are connected to each other via a single 40GbE link for redundancy. Node-to-node communications for nodes in the same [[Nexus#Partitions | partition]] rarely ever need to traverse the network core. | ||
==Network Access== | ==Network Access== | ||
Different labs and centers have invested differently in the network infrastructure supporting their purchased compute nodes. Generally speaking, but not always, this consists of one or more pairs of network switches, with each switch in a pair being connected to the other switch in its pair via one or more links for redundancy. Purchased compute nodes are then connected to each switch in one of these pairs of switches via a single link, again, for redundancy. | |||
For lab/center-specific documentation, please look at that lab's/center's specific [[Nexus#Partitions | partition]] page. (documentation still under active development) |
Revision as of 20:32, 26 November 2024
Overview
The Nexus cluster runs on a hierarchical Ethernet-based network with node-level speeds ranging anywhere from 1GbE to 100GbE. Generally speaking, but not always, newer-purchased compute nodes often come with hardware capable of using, and therefore use, faster speeds. Increasingly faster speeds require increasingly more expensive network switches and cables, so some labs/centers have opted to stay with slower speeds.
If you are running multi-node jobs in SLURM, or simply want the best performance for a single-node job depending on what filesystem path(s) your job uses, it can be important to know the basics of the cluster's architecture to optimize performance.
In the future, SLURM's topology-aware resource allocation support may be implemented on the cluster, but it is not currently.
Network Core
The network core for Nexus is the same network core used by all UMIACS-supported systems. It consists of a pair of network switches that are connected to each other via a single 40GbE link for redundancy. Node-to-node communications for nodes in the same partition rarely ever need to traverse the network core.
Network Access
Different labs and centers have invested differently in the network infrastructure supporting their purchased compute nodes. Generally speaking, but not always, this consists of one or more pairs of network switches, with each switch in a pair being connected to the other switch in its pair via one or more links for redundancy. Purchased compute nodes are then connected to each switch in one of these pairs of switches via a single link, again, for redundancy.
For lab/center-specific documentation, please look at that lab's/center's specific partition page. (documentation still under active development)