Nexus

From UMIACS
Revision as of 20:12, 28 February 2022 by Derek (talk | contribs) (→‎Jobs)
Jump to navigation Jump to search

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that are discrete will be folding into this scheduler. The resource manager for this is SLURM and resources will be arranged into partitions of resources where users will be able to schedule computational jobs. Users will be arranged into a number of Slurm accounts based on faculty, lab or center investments.

Getting Started

All accounts in UMIACS are sponsored. If you don't already have a UMIACS account please see Nexus/Accounts for information on getting one.

Access

The submission nodes for the Nexus computational resources are determined by department, center or lab affiliation. Users can log into the UMIACS Directory application and select their Computational Resources (CR). They will find a CR that has the prefix nexus, select it and in the Host section will list the available login nodes.

Note - UMIACS requires multi-factor authentication through our Duo instance. This is completely discrete from both UMD and/or CSD Duo instances and users will need to enroll device(s) to access resources in UMIACS. Users will be prompted when they log into the Directory application the first time.

Once users have identified their submission nodes they will be able to SSH directly into them. From there users will be able to submit to the cluster via our SLURM workload manager. Users will need to make sure that they submit jobs with correct account, partition and qos.

Jobs

SLURM jobs are submitted by either srun or sbatch depending if you are doing an interactive or batch job respectively. You will need to provide the where/who the job will run and specify the resources you need to run with. There are defaults for both so if you don't specify something you may be scheduled with a very minimal set of time and resources (including NO GPUs unless specifically requested).

For the where/who you may be required to specify --account, --qos and/or --partition to be able to adequately submit jobs to the Nexus.

For resources you may need to specify for cpus (--nprocs), memory (--mem), and GPUs (--gres=gpu) in your submission arguments to meet your requirements. For more information about submitting for GPU resources see SLURM/JobSubmission#Requesting_GPUs.

Interactive

Once logged into a submission node you can run simple interactive jobs. If you session is interrupted from the submission node the job will be killed so we encourage users to use a terminal multiplexer such as Tmux.

$ srun --pty --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)

Batch

Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by #SBATCH lines at the top of the file. You can find some examples in our SLURM/JobSubmission documentation.

Partitions

The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus when fully operational will have a number of different partitions of resources. Different Centers, Labs, and Faculty will be able to invest in computational resources that will be restricted to approved users through these partitions.

  • Nexus/Tron - This is currently the pool of resources available to all UMIACS and CSD faculty and graduate students. It will provide access for undergraduate and graduate teaching resources.
  • Scavenger - This is a preemption partition that supports nodes from multiple other partitions. Jobs will be subject to preemption rules however more resources are available to users to schedule. Jobs also have to handle this preemption correctly otherwise they will just be restarted from the beginning after they are re-queued and are then available to run again.

Quality of Service (QoS)

SLURM uses a QoS to provide limits on job sizes to users. Note that users should still try to only allocate the minimum resources for their jobs as resources that your job schedules are counted against your FairShare priority in the future.

  • normal - Default QoS which will limit users to 4 cores, 32GB RAM and 1 GPU. The maximum wall time will be 3 days and users will be able to run up to 4 jobs.
  • medium - Limited to 8 cores, 64GB RAM, 2 GPUs. The maximum wall time will be 2 days and users will be able to run 2 jobs.
  • high - Limited to 16 cores, 128GB RAM, 4 GPUs. The maximum wall time will be 1 day and users will only be able to run 1 job.
  • scavenger - Limited to 64 cores, 256GB RAM and 8 GPUs. The maximum wall time will be 2 days and users may only run 2 jobs. This QoS is only available in the scavenger partition.

To find out what accounts and partitions you have access to you use the show_assoc command.

Storage

All storage available in Nexus is NFS based. These storage allocation procedures will be revised and approved by the launch of Phase 2 by a CSD and UMIACS faculty committee.

Home Directories

Each user account in UMIACS is allocated 20GB of storage in their home directory (/nfshomes/$username). This file system has snapshots and backups available. The quota is fixed however and is not available to increase.

In phase2 other standalone compute clusters fold into partitions in Nexus you will start to have the same home directory across all systems.

Scratch Directories

Each user will be allocated a 200GB scratch allocation under /fs/nexus-scratch/$username. Once filled users may request an increase of up to 400GB. This space is does not have snapshots and is not backed up. Please ensure that any data you have under the scratch is reproducible.

Faculty Allocations

Each faculty will have 1TB of lab space to be allocated to them when their account is installed. We also can support grouping these individual allocations together into a larger center, lab or research group allocations if desired by the faculty. Please contact staff@umiacs.umd.edu to inquire.

This lab space will by default not have snapshots (but are available if requested) and it is backed up.

Project Allocations

Project allocations are available per user for 270 TB days. Which means that you can have a 1TB allocation for up to 270 days, or a 3TB allocation for 90 days. A single faculty member can not have more than 20 TB of sponsored account project allocations active at any point.

When requesting an allocation please CC your account sponsor when you send email to staff@umiacs.umd.edu. Please include the following details:

  • Project Name (short)
  • Description
  • Size (1TB, 2TB, etc.)
  • Length in days (180days)

These allocations will be available via /fs/nexus-projects/$project_name.

Data Sets

Data sets will be hosted in /fs/nexus-datasets. If you want to request a data set for for consideration please email staff@umiacs.umd.edu. We will have a more formal process to approve data sets by phase 2 of Nexus. Please note that data sets that require accepting a license will need to be reviewed by ORA which may require some time to process.