Nexus: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
Line 49: Line 49:


= Quality of Service (QoS) =
= Quality of Service (QoS) =
SLURM uses Quality of Service (QoS) both to provide limits on job sizes (termed by us as "job QoS") as well as to limit resources used by all jobs running in a partition, either per user or per group (termed by us as "partition QoS").
===Job QoS===
===Job QoS===
SLURM uses Quality of Service (QoS) to provide limits on job sizes to users. Note that you should still try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [https://slurm.schedmd.com/fair_tree.html FairShare priority] in the future.
Job QoS are used to provide limits on job sizes to users. You should try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [[SLURM/Priority#Fair-share | fair-share priority]] in the future.
* default - Default job QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job.  The maximum wall time per job is 3 days.
* default - Default job QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job.  The maximum wall time per job is 3 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job.  The maximum wall time per job is 2 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job.  The maximum wall time per job is 2 days.
Line 100: Line 102:


===Partition QoS===
===Partition QoS===
In addition to using QoS to provide limits on job parameters, termed by us as "job QoS", SLURM can also have QoS assigned to partitions themselves, termed by us as "partition QoS".
Partition QoS are used to limit resources used by all jobs running in a partition, either per user or per group.


To view partition QoS, use the <code>show_partition_qos</code> command.
To view partition QoS, use the <code>show_partition_qos</code> command.

Revision as of 19:02, 22 September 2023

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that have discrete schedulers will be folding into this scheduler in the future (see below). The resource manager for Nexus is SLURM. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

Getting Started

All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see Accounts for information on getting one. You need a full UMIACS account (not a collaborator account) in order to access Nexus.

Access

Your access to submission nodes for Nexus computational resources are determined by your account sponsor's department, center, or lab affiliation. You can log into the UMIACS Directory CR application and select the Computational Resource (CR) in the list that has the prefix nexus. The Hosts section lists your available submission nodes, generally a pair of nodes of the format nexus<department, lab, or center abbreviation>[00,01], e.g., nexuscfar00 and nexuscfar01.

Note - UMIACS requires multi-factor authentication through our Duo instance. This is completely discrete from both UMD's and CSD's Duo instances. You will need to enroll one or more devices to access resources in UMIACS, and will be prompted to enroll when you log into the Directory application for the first time.

Once you have identified your submission nodes, you can SSH directly into them. From there, you are able to submit to the cluster via our SLURM workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

Jobs

SLURM jobs are submitted by either srun or sbatch depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the who/where/how, you may be required to specify --account, --partition, and/or --qos (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify --time for time, --tasks for CPUs, --mem for RAM, and --gres=gpu for GPUs in your submission arguments to meet your requirements. There are defaults for all four, so if you don't specify something, you may be scheduled with a very minimal set of time and resources (e.g., by default, NO GPUs are included if you do not specify --gres=gpu). For more information about submission flags for GPU resources, see SLURM/JobSubmission#Requesting_GPUs. You can also can run man srun on your submission node for a complete list of available submission arguments.

Interactive

Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as Tmux.

$ srun --pty --ntasks 4 --mem=2gb --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)

Batch

Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by #SBATCH lines at the top of the file. You can find some examples in our SLURM/JobSubmission documentation.

Partitions

The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus has a number of different partitions of resources. Different Centers, Labs, and Faculty are able to invest in computational resources that are restricted to approved users through these partitions.

Partitions usable by all non-class account users:

  • Nexus/Tron - Pool of resources available to all UMIACS and CSD faculty and graduate students.
  • Scavenger - Preemption partition that supports nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again.

Partitions usable by ClassAccounts:

  • Class - Pool available for UMIACS class accounts sponsored by either UMIACS or CSD faculty.

Partitions usable by specific lab/center users:

  • Nexus/CBCB - CBCB lab pool available for CBCB lab members.
  • Nexus/CLIP - CLIP lab pool available for CLIP lab members.
  • Nexus/CML - CML lab pool available for CML lab members. (all compute nodes from standalone cluster folded in as of 08/17/2023)
  • Nexus/Gamma - GAMMA lab pool available for GAMMA lab members.
  • Nexus/MBRC - MBRC lab pool available for MBRC lab members.
  • Nexus/MC2 - MC2 lab pool available for MC2 lab members.
  • Nexus/Vulcan - Vulcan lab pool available for Vulcan lab members. (all compute nodes from standalone cluster folded in as of 08/17/2023)

Quality of Service (QoS)

SLURM uses Quality of Service (QoS) both to provide limits on job sizes (termed by us as "job QoS") as well as to limit resources used by all jobs running in a partition, either per user or per group (termed by us as "partition QoS").

Job QoS

Job QoS are used to provide limits on job sizes to users. You should try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your fair-share priority in the future.

  • default - Default job QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job. The maximum wall time per job is 3 days.
  • medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job. The maximum wall time per job is 2 days.
  • high - Limited to 16 cores, 128GB RAM, and 4 GPUs per job. The maximum wall time per job is 1 day.
  • scavenger - No resource limits per job, only a maximum wall time per job of 3 days. You are responsible for ensuring your job requests multiple nodes if it requests resources beyond what any one node is capable of. Only 576 total cores, 2304GB total RAM, and 72 total GPUs are permitted simultaneously across all of your jobs running with this job QoS. This job QoS is both only available in the scavenger partition and the only job QoS available in the scavenger partition. To use this job QoS, include --partition=scavenger and --account=scavenger in your submission arguments. Do not include any job QoS argument other than --qos=scavenger (optional) or submission will fail.

You can display these job QoS from the command line using the show_qos command. By default, the command will only show job QoS that your user can access. The above four job QoS are the ones that everyone can submit using.

$ show_qos
                Name     MaxWall                        MaxTRES MaxJobsPU                      MaxTRESPU 
-------------------- ----------- ------------------------------ --------- ------------------------------ 
             default  3-00:00:00       cpu=4,gres/gpu=1,mem=32G                                          
                high  1-00:00:00     cpu=16,gres/gpu=4,mem=128G                                                                        
              medium  2-00:00:00       cpu=8,gres/gpu=2,mem=64G                                          
           scavenger  3-00:00:00     cpu=64,gres/gpu=8,mem=256G            cpu=576,gres/gpu=72,mem=2304G 

If you want to see all job QoS, including those that you do not have access to, you can use the show_qos --all command.

$ show_qos --all
                Name     MaxWall                        MaxTRES MaxJobsPU                      MaxTRESPU 
-------------------- ----------- ------------------------------ --------- ------------------------------ 
             cml-cpu  7-00:00:00                                        8                                
         cml-default  7-00:00:00       cpu=4,gres/gpu=1,mem=32G         2                                
            cml-high  1-12:00:00     cpu=16,gres/gpu=4,mem=128G         2                                
       cml-high_long 14-00:00:00              cpu=32,gres/gpu=8         8                     gres/gpu=8 
          cml-medium  3-00:00:00       cpu=8,gres/gpu=2,mem=64G         2                                
       cml-scavenger  3-00:00:00                                                             gres/gpu=24 
       cml-very_high  1-12:00:00     cpu=32,gres/gpu=8,mem=256G         8                    gres/gpu=12 
             default  3-00:00:00       cpu=4,gres/gpu=1,mem=32G                                          
                high  1-00:00:00     cpu=16,gres/gpu=4,mem=128G                                          
             highmem 21-00:00:00               cpu=32,mem=2000G                                          
           huge-long 10-00:00:00     cpu=32,gres/gpu=8,mem=256G                                          
              medium  2-00:00:00       cpu=8,gres/gpu=2,mem=64G                                          
           scavenger  3-00:00:00     cpu=64,gres/gpu=8,mem=256G            cpu=576,gres/gpu=72,mem=2304G 
          vulcan-cpu  2-00:00:00                cpu=1024,mem=4T         4                                
      vulcan-default  7-00:00:00       cpu=4,gres/gpu=1,mem=32G         2                                
       vulcan-exempt  7-00:00:00     cpu=32,gres/gpu=8,mem=256G         2                                
         vulcan-high  1-12:00:00     cpu=16,gres/gpu=4,mem=128G         2                                
        vulcan-janus  3-00:00:00    cpu=32,gres/gpu=10,mem=256G                                          
       vulcan-medium  3-00:00:00       cpu=8,gres/gpu=2,mem=64G         2                                
       vulcan-sailon  3-00:00:00     cpu=32,gres/gpu=8,mem=256G                              gres/gpu=48 
    vulcan-scavenger  3-00:00:00     cpu=32,gres/gpu=8,mem=256G                                          

To find out what accounts and partitions you have access to, first use the show_assoc command to show your account/job QoS combinations. Then, use the scontrol show partition command and note the AllowAccounts entry for each listed partition. You are able to submit to any partition that allows an account that you have. If you need to use an account other than the default account nexus, you will need to specify an account via the --account submission argument.

Partition QoS

Partition QoS are used to limit resources used by all jobs running in a partition, either per user or per group.

To view partition QoS, use the show_partition_qos command.

$ show_partition_qos
                Name MaxSubmitPU                      MaxTRESPU              GrpTRES 
-------------------- ----------- ------------------------------ -------------------- 
           scavenger         500  cpu=576,gres/gpu=72,mem=2304G                      
                tron         500     cpu=32,gres/gpu=4,mem=256G                                                          

If you want to see all partition QoS, including those that you do not have access to, you can use the show_partition_qos --all command.

$ show_partition_qos --all
                Name MaxSubmitPU                      MaxTRESPU              GrpTRES 
-------------------- ----------- ------------------------------ -------------------- 
                cbcb         500                                 cpu=1004,mem=47840G 
               class         500     cpu=32,gres/gpu=4,mem=256G                      
                clip         500                                   cpu=564,mem=5647G 
                 cml         500                                 cpu=1128,mem=11381G 
       cml-scavenger         500                    gres/gpu=24                      
               gamma         500                                   cpu=520,mem=4517G 
                mbrc         500                                   cpu=240,mem=2378G 
                 mc2         500                                   cpu=312,mem=3133G 
           scavenger         500  cpu=576,gres/gpu=72,mem=2304G                      
                tron         500     cpu=32,gres/gpu=4,mem=256G                      
              vulcan         500                                 cpu=1760,mem=15824G 
    vulcan-scavenger         500       

NOTE: These QoS cannot be used directly when submitting jobs, with the exception of scavenger QoS (i.e. they are not in the AllowQos field for their respective partition). Partition QoS limits apply to all jobs running on a given partition, even when using multiple job QoS.

For example, in the default non-preemption partition (tron), you are restricted to 32 total cores, 4 total GPUs, and 256GB total RAM at once across all jobs you have running in the partition. You also can only have a maximum of 500 jobs in the partition in the running (R) or pending (PD) states simultaneously. The latter is to prevent excess pending jobs causing backfill issues.

  • If you need to submit more than 500 jobs in batch at once, you can develop and run an "outer submission script" that repeatedly attempts to run the "inner submission script" (your original submission script) to submit jobs in the batch periodically, until all job submissions are successful. The outer submission script should use looping logic to check if you are at the max job limit and should then retry submission after waiting for some time interval. An example outer submission script is as follows. In this example, example_inner.sh is your inner submission script which is not an array job and you want to run 1000 jobs. If your inner submission script is an array job, adjust the number of jobs accordingly. Keep in mind that array jobs must be of size 500 or fewer.
#!/bin/bash
numjobs=1000
i=0
while [ $i -lt $numjobs ]
do
  while [[ "$(sbatch example_inner.sh 2>&1)" =~ "QOSMaxSubmitJobPerUserLimit" ]]
  do
    echo "Currently at maximum job submissions allowed."
    echo "Waiting for 5 minutes before trying to submit more jobs."
    sleep 300
  done
  i=$(( $i + 1 ))
  echo "Submitted job $i of $numjobs"
done

It is suggested that you run the outer submission script in a Tmux session to keep the terminal window executing it from being interrupted.

Lab/group-specific partitions may also have partition QoS intended to limit the total number of resources consumed by all users in that lab/group that are using the partition (codified by GrpTRES in the output above for the partition QoS name that matches the lab/group partition name). They also have the 500 running/pending job maximum. Note that the exact values above for TRES are not fixed and may fluctuate as more resources are added to various partitions.

Storage

All storage available in Nexus is currently NFS based. We will be introducing some changes for Phase 2 to support high performance GPUDirect Storage (GDS). These storage allocation procedures will be revised and approved by the launch of Phase 2 by a joint UMIACS and CSD faculty committee.

Home Directories

Home directories in the Nexus computational infrastructure are available from the Institute's NFShomes as /nfshomes/USERNAME where USERNAME is your username. These home directories have very limited storage (30GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is not intended for data sets or other large scale data holdings. Users are encouraged to utilize our GitLab infrastructure to host your code repositories.

NOTE: To check your quota on this directory you will need to use the quota -s command.

Your home directory data is fully protected and has both snapshots and is backed up nightly.

Other standalone compute clusters have begun to fold into partitions in Nexus. The corresponding home directories used by these clusters (if not /nfshomes) will be gradually phased out in favor of the /nfshomes home directories.

Scratch Directories

Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:

  • Network scratch directories
  • Local scratch directories

Please note that class accounts do not have network scratch directories.

Network Scratch Directories

You are allocated 200GB of scratch space via NFS from /fs/nexus-scratch/$username. It is not backed up or protected in any way. This directory is automounted so you will need to cd into the directory or request/specify a fully qualified file path to access this.

You can view your quota usage by running df -h /fs/nexus-scratch/$username.

You may request a permanent increase of up to 400GB total space without any faculty approval by contacting staff. If you need space beyond 400GB, you will need faculty approval and/or a project allocation for this. If you choose to increase your scratch space beyond 400GB, the increased space is also subject to the 270 TB days limit mentioned in the project allocation section before we check back in for renewal. For example, if you request 1.4TB total space, you may have this for 270 days (1TB beyond the 400GB permanent increase).

This file system is available on all submission, data management, and computational nodes within the cluster.

Local Scratch Directories

Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named /scratch0, /scratch1, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confines of your job and stage the data out before the end of your job.

These local scratch directories have a tmpwatch job which will delete unaccessed data after 90 days, scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Please make sure you secure any data you write to these directories at the end of your job.

Faculty Allocations

Each faculty member can be allocated 1TB of lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please contact staff to inquire.

This lab space does not have snapshots by default (but are available if requested), but is backed up.

Project Allocations

Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc.. A single faculty member can not have more than 20TB of sponsored account project allocations active at any point.

The maximum allocation length you can request is 540 days (500GB space) and the maximum storage space you can request is 9TB (30 day length).

To request an allocation, please contact staff with the faculty member(s) that the project is under involved in the conversation. Please include the following details:

  • Project Name (short)
  • Description
  • Size (1TB, 2TB, etc.)
  • Length in days (270 days, 135 days, etc.)
  • Other user(s) that need to access the allocation, if any

These allocations are available via /fs/nexus-projects/$project_name. Renewal is not guaranteed to be available due to limits on the amount of total storage. Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If renewal is available, you can renew for up to another 270 TB days with reapproval from the original faculty approver. If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation. If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible. If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible. If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.

Datasets

We have read-only dataset storage available at /fs/nexus-datasets. If there are datasets that you would like to see curated and available, please see this page.

We will have a more formal process to approve datasets by Phase 2 of Nexus.

Migrations

If you are a user of an existing cluster that is the process of being folded into Nexus now or in the near future, your cluster-specific migration information will be listed here.

  • Nexus/CML - all compute nodes folded in as of 08/17/2023
  • Nexus/Vulcan - all compute nodes folded in as of 08/17/2023