Nexus/ClusterOSUpgrade

From UMIACS
Jump to navigation Jump to search

Overview

UMIACS Technical Staff will begin the process of upgrading the operating system version on all Nexus cluster nodes from Red Hat Enterprise Linux (RHEL) 8 to 9 in Summer 2026.

RHEL8 is in the Maintenance Support phase of its life cycle and is transitioning to the Extended Life phase in 2029. More information on Red Hat's lifecycle policy for its operating systems can be found here. We are staying well ahead of the Extended Life phase date for our cluster nodes by performing these upgrades now.

RHEL9 is still in the Full Support phase of its life cycle and introduces a newer major Linux kernel version and newer glibc version, improving compatibility with many newer software applications.

Scheduling

Upgrades for all cluster nodes will begin as early as Monday 06/01/2026 at 9am. We expect to be finished with all cluster node upgrades no later than Friday 08/21/2026 at 5pm.

Submission Nodes

Submission nodes with the number '01' in their hostnames will be taken offline for upgrade on 06/01/2026 at 9am. We expect that they will be back online no later than 5pm on that same day.

Submission nodes with the number '00' in their hostnames will be scheduled for upgrade individually, when all of the compute nodes associated with the same lab/center have been upgraded. Staff will send a notification to individual lab's/center's cluster users to schedule the relevant '00' node's upgrade when applicable. The actual date of each upgrade will be no less than one week after the corresponding notification has been sent.

Data in UNIX filesystem storage spaces on each submission node, i.e., /tmp and /scratch0, will not be preserved during upgrade. If you have any data in any such space on either submission node in a pairing that you want to keep, please ensure you copy it to the other submission node or a network-attached filesystem storage space prior to each node's upgrade date. Data in network-attached filesystem storage spaces, such as /nfshomes or /fs/nexus-scratch, will not be affected.

Compute Nodes

Due to the large number of compute nodes and the desire to not interrupt running jobs, we are not generally able to schedule each specific compute node upgrade on a specific date. If you find that a specific node is unavailable to schedule jobs on, you can run the command sinfo --list-reasons --long on a submission node and look to see if the node is in the list with the text "RHEL9 upgrade" - if this is present, the upgrade for that node is underway.

We will generally be prioritizing upgrades for nodes based on how available they are across various partitions; nodes that are only available in partitions that contain large numbers of users for a lab/center, e.g., cbcb, clip, cml-dpart, gamma, vulcan-ampere, vulcan-dpart, etc., and corresponding "scavenger" named partitions, will be prioritized over nodes that are only or are also available in faculty-specific / limited-node partitions. All nodes in the tron partition will also generally be prioritized.

If you are a faculty member authoritative for your own partition or a small group's limited-node partition and have scheduling concerns for the nodes in these partitions, please contact staff ASAP to let us know about these concerns and we will make our best effort to accommodate them.

Interoperability

Software and Modules

Please begin transitioning your virtual environments, workflows, etc. to work with RHEL9 as soon as possible. You can use the '01' submission node that you have access to after it is back online sometime after 9am on 06/01/2026 for transitioning and light testing - as always, please do not run any computationally intensive processes on this node. It is intended to be a host for configuring environments/workflows and submitting jobs only.

The module tree for RHEL9 has already been populated with a large number of the same modules that are available in the RHEL8 module tree, although specific modules may have different versions available in the RHEL9 tree as compared to the RHEL8 tree. If you have a dependency on a specific version of a module that is not available in the RHEL9 tree, please contact staff and we can get one created.

  • If you want to check to see if a specific version is available now but do not have access to any RHEL9 node yet, the current module tree for RHEL9 is located at /fs/UMos/RedHat-9/x86_64/local/stow/.modulefiles and can be viewed from any UMIACS-supported host.

SLURM Scheduling

If you want or need to schedule a job on only nodes running RHEL8 (or RHEL9, once you have validated whatever is relevant), you can use the submission arguments --prefer=rhel# or --constraint=rhel# in your job arguments to specify this, where # is replaced by the OS version number. The --prefer argument is a soft limitation on which nodes the job can be scheduled on and the --constraint argument is a hard limitation, i.e., if you use the argument --prefer=rhel8 but there are no RHEL8 nodes available at present (with your other submission arguments also satisfied) in the partition you are submitting to, the job will be scheduled on an appropriate RHEL9 node if that would result in an earlier (or instantaneous) start time.