SLURM/Priority

From UMIACS
Revision as of 14:21, 31 July 2023 by Mbaney (talk | contribs)
Jump to navigation Jump to search

SLURM at UMIACS is configured to prioritize jobs based on a number of factors, termed multifactor priority in SLURM.

The factors in use at UMIACS include:

  • Age of job i.e. time spent waiting to run in the queue
  • Partition job was submitted to
  • Fair-share of resources
  • "Nice" value that job was submitted with

Age

The longer a job is eligible to run but cannot due to all available resources being taken up increases the job's priority to be scheduled as time goes on. The priority modifier for this factor reaches its limit after 7 days.

Partition

The partition named scavenger on each of our clusters always has a lower priority factor for its jobs than all other partitions on that cluster. As mentioned in other UMIACS cluster-specific documentation, jobs submitted to this partition are also preemptable. These two design choices give the partition its name; jobs submitted to the scavenger partition "scavenge" for available resources on the cluster rather than consume dedicated chunks of resources, and are interrupted by jobs seeking to consume dedicated chunks of resources.

On Nexus, labs/centers may also have their own scavenger partitions if the faculty for the lab/center have decided upon some sort of limit on jobs (number of simultaneous jobs, number of actively consumed billing resources, etc.) in their non-scavenger partitions. These lab/center scavenger partitions allow for more jobs to be run by members of that lab/center on that lab's/center's nodes only, but are preemptable by that lab's/center's non-scavenger partition jobs.

In decreasing order of priority (highest first), our job priorities for partitions are:

  1. Lab/center non-scavenger partitions
  2. Lab/center scavenger partitions
  3. Cluster-wide scavenger partitions

Fair-share

The more resources your jobs have already consumed within an account, the lower priority factor your future jobs will have when compared to other users' jobs in the same account who have used fewer resources (so as to "fair-share" with other users). Additionally, if there are multiple accounts that can submit to a partition, and the sum of resources of all users' jobs within account A is greater than the sum of resources of all users' jobs within account B, the lower priority factor all future jobs from users in account A will have when compared to all future jobs from users in account B. (In other words, fair-share is hierarchical.)

You can view the various fair-share statistics with the command sshare -l. It will show your specific FairShare values (always between 0.0 and 1.0) within accounts that you have access to. You can also view other accounts' Level Fairshare (LevelFS).

Account                    User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage  FairShare    LevelFS                    GrpTRESMins                    TRESRunMins
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
root                                          0.000000 69067378677                  1.000000                                                      cpu=362165,mem=2472431257,ene+
 cbcb                                    1    0.111111  2540508354    0.036784      0.036784              3.020599                                cpu=2972,mem=11216145,energy=+
 class                                   1    0.111111   111442160    0.001614      0.001614             68.857058                                cpu=0,mem=0,energy=0,node=0,b+
 clip                                    1    0.111111  5438953594    0.078753      0.078753              1.410887                                cpu=18883,mem=383518720,energ+
 gamma                                   1    0.111111 11308782204    0.163748      0.163748              0.678550                                cpu=0,mem=0,energy=0,node=0,b+
 mbrc                                    1    0.111111     3589694    0.000043      0.000043            2.6119e+03                                cpu=0,mem=0,energy=0,node=0,b+
 mc2                                     1    0.111111       33620    0.000000      0.000000            2.2824e+05                                cpu=0,mem=0,energy=0,node=0,b+
 nexus                                   1    0.111111  6449248844    0.093371      0.093371              1.189993                                cpu=4655,mem=43385582,energy=+
  nexus                  username        1    0.000781      166762    0.000002      0.000026   0.486680  30.186042                                cpu=0,mem=0,energy=0,node=0,b+
 scavenger                               1    0.111111 43214818947    0.625687      0.625687              0.177583                                cpu=335653,mem=2034310809,ene+
  scavenger              username        1    0.000781       54928    0.000001      0.000001   0.036062 614.120765                                cpu=0,mem=0,energy=0,node=0,b+

The actual resource weightings for the three main resources (memory per GB, CPU cores, and GPUs if applicable) are per-partition and can be viewed in the TRESBillingWeights line in the output of scontrol show partition. The billing value for a job is the sum of all resource weightings for resources the job has requested. This value is then multiplied by the amount of time a job has run in seconds to get the amount it contributes to the RawUsage for the association within the account it is running under.

There are two main algorithms we use for resource weightings, per cluster:

Modern

This weighting algorithm is in use on the following clusters:

Resources have algorithmically computed floating point billing values, adjusted once at the beginning of each academic semester based on additions to or removals from cluster partitions since the last adjustment.

GPU-capable partitions

Each resource (memory/CPU/GPU) is given a weighting value such that their relative billings to each other are equal (33.33% each). The values are then rounded to whole numbers. Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU/GPU values are adjusted accordingly.

Different GPU types may also be weighted differently within the GPU relative billing. A baseline GPU type is first chosen for each cluster. All GPUs of that type and other types that have lower FP32 performance (in TFLOPS) are given a weighting factor of 1.0. GPU types with higher FP32 performance than the baseline GPU are given a weighting factor calculated by dividing their FP32 performance by the baseline GPU's FP32 performance (both FP32 values rounded to one decimal place each), then rounded to two decimal places (so as to represent a percentage of performance relative to the baseline). The weighting values for each GPU type are then determined by normalizing the sum of all of GPU cards of different types multiplied by their weighting factors against the relative billing percentage for GPUs (33.33%). The values are then rounded to whole numbers.

The current baseline GPUs per cluster are:

CPU-only partitions

Each resource (memory/CPU) is first given a weighting value such that their relative billings to each other are equal (50% each). The values are then rounded to whole numbers. Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU value is adjusted accordingly. The final CPU weight value is then divided by 10, which ends up translating to roughly 90.9% of the billing weight being for memory and 9.1% being for CPU. This is done so as to not affect accounts' fair-share priority factors as much when running CPU-only jobs given the popularity of GPU computing.

Legacy

This weighting algorithm is currently in use on all clusters not mentioned in the previous section. These clusters will eventually either fold into Nexus or have the modern algorithm introduced in the future.

Resources have fixed floating point billing values.

GPU-capable partitions

Memory is billed at 0.125 per GB, CPU is billed at 1.0 per core, and GPU is billed at 4.0 per card.

CPU-only partitions

Memory is billed at 0.125 per GB and CPU is billed at 0.1 per core. The lower CPU weighting is done so as to not affect accounts' fair-share priority factors as much when running CPU-only jobs given the popularity of GPU computing.

Nice value

This is a submission argument that you as the user can include when submitting your jobs to deprioritize them. Larger values will deprioritize jobs more e.g.,

srun --pty --qos=default --mem 1gb --time=01:00:00 --nice=2 bash

will have lower priority than

srun --pty --qos=default --mem 1gb --time=01:00:00 --nice=1 bash

which will have lower priority than

srun --pty --qos=default --mem 1gb --time=01:00:00 bash

assuming all three jobs were submitted at the same time. You cannot use negative values for this argument.