SLURM/Priority: Difference between revisions
No edit summary |
No edit summary |
||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[SLURM]] at UMIACS is configured to prioritize jobs based on a number of factors, termed [https://slurm.schedmd.com/priority_multifactor.html multifactor priority] in SLURM. Each job submitted to the scheduler is assigned a priority value, which can be viewed in the output of <code>scontrol show job <jobid></code> | [[SLURM]] at UMIACS is configured to prioritize jobs based on a number of factors, termed [https://slurm.schedmd.com/priority_multifactor.html multifactor priority] in SLURM. Each job submitted to the scheduler is assigned a priority value, which can be viewed in the output of <code>scontrol show job <jobid></code>. | ||
Example: | |||
<pre> | <pre> | ||
$ scontrol show job 1 | $ scontrol show job 1 | ||
Line 10: | Line 11: | ||
==Pending Jobs== | ==Pending Jobs== | ||
If the partition that you submit your job to cannot instantly begin your job due to no compute node(s) having the resources free to run it, your job will remain in the Pending state with the listed reason <tt>(Resources)</tt>. If there is another job already pending with this reason, you submit a job to the same partition, and your job gets assigned a lower priority value, your job will instead remain in the Pending state with reason <tt>(Priority)</tt>. | If the partition that you submit your job to cannot instantly begin your job due to no compute node(s) having the resources free to run it, your job will remain in the Pending state with the listed reason <tt>(Resources)</tt>. If there is another job already pending with this reason, you submit a job to the same partition, and your job gets assigned a lower priority value than that pending job, your job will instead remain in the Pending state with reason <tt>(Priority)</tt>. If there are multiple jobs pending and your job is not the highest priority job pending, the scheduler will only begin execution of your job if starting your job would not push the begin times for any higher priority jobs in the same partition further back. | ||
Lowering some combination of the resources you are requesting and/or the time limit may allow submitted jobs to run more quickly or instantly during times where a partition is under resource pressure. The command <code>squeue -j <jobid> --start</code> can be used to provide a time estimate for when your job will start, where <jobid> is the job ID you receive from either srun or sbatch. | Lowering some combination of the resources you are requesting and/or the time limit may allow submitted jobs to run more quickly or instantly during times where a partition is under resource pressure. The command <code>squeue -j <jobid> --start</code> can be used to provide a time estimate for when your job will start, where <jobid> is the job ID you receive from either srun or sbatch. | ||
You can use the command alias <code>[[SLURM/JobSubmission#show_available_nodes | show_available_nodes]]</code> with a variety of different submission arguments to get a better idea of what jobs may be able to begin sooner. | |||
==Priority Factors== | ==Priority Factors== | ||
The priority factors in use at UMIACS include: | The priority factors in use at UMIACS include: | ||
* Age of job i.e. time spent waiting to run in the queue | * Age of job i.e. time spent waiting to run in the queue | ||
* Association ( | * Association (SLURM account) being used | ||
* Partition job was submitted to | * Partition job was submitted to | ||
* Fair-share of resources | * Fair-share of resources | ||
Line 25: | Line 26: | ||
===Age=== | ===Age=== | ||
The longer a job is eligible to run but cannot, the higher the job's priority becomes as it continue to wait in the queue. The priority modifier for this factor reaches its limit after 7 days. | The longer a job is eligible to run but cannot due to resources being unavailable, the higher the job's priority becomes as it continue to wait in the queue. The priority modifier for this factor reaches its limit after 7 days. | ||
===Association=== | ===Association=== | ||
Some lab/center-specific | Some lab/center-specific SLURM accounts may have priority values directly attached to them. Jobs run under these accounts gain this many extra points of priority. | ||
===Partition=== | ===Partition=== | ||
The | The partitions whose names are or are prefixed with <code>scavenger</code> on each of our clusters always have lower priority factors for their jobs than all other partitions on that cluster. As mentioned in other UMIACS cluster-specific documentation, jobs submitted to these partitions are also [https://slurm.schedmd.com/preempt.html preemptable]. These two design choices give the partitions their names; jobs submitted to <code>scavenger</code> named or prefixed partitions "scavenge" for available resources on the cluster rather than consume dedicated resources, and are interrupted by jobs asking to consume dedicated resources. | ||
On [[Nexus]], labs/centers may also have their own scavenger partitions (<code><labname>-scavenger</code>) if the faculty for the lab/center have decided upon some sort of limit on jobs (number of simultaneous jobs, number of actively consumed billing resources, etc.) in their non-scavenger partitions. These lab/center scavenger partitions allow for more jobs to be run by members of that lab/center on that lab's/center's nodes only, but are preemptable by that lab's/center's non-scavenger partition jobs. | On [[Nexus]], labs/centers may also have their own scavenger partitions (<code><labname>-scavenger</code>) if the faculty for the lab/center have decided upon some sort of limit on jobs (number of simultaneous jobs, number of actively consumed billing resources, etc.) in their non-scavenger partitions. These lab/center scavenger partitions allow for more jobs to be run by members of that lab/center on that lab's/center's nodes only, but are preemptable by that lab's/center's non-scavenger partition jobs. | ||
Line 39: | Line 40: | ||
# Lab/center-specific non-"scavenger" named partitions | # Lab/center-specific non-"scavenger" named partitions | ||
# Lab/center-specific "scavenger" named partitions | # Lab/center-specific "scavenger" named partitions | ||
# Institute-wide | # Institute-wide <tt>scavenger</tt> partitions | ||
A job in a lower priority tier will never have a higher priority value than any job in any of the higher priority tiers. | A job in a lower priority tier will never have a higher priority value than any job in any of the higher priority tiers. | ||
Line 88: | Line 89: | ||
The actual resource billing weights for the three main resources (memory per GB, CPU cores, and number of GPUs if applicable) are per-partition and can be viewed in the <code>TRESBillingWeights</code> line in the output of <code>scontrol show partition</code>. The <code>billing</code> value for a job is the sum of all resource weightings for resources the job has requested. This value is then multiplied by the amount of time a job has run in seconds to get the amount it contributes to the RawUsage for the association within the account it is running under. | The actual resource billing weights for the three main resources (memory per GB, CPU cores, and number of GPUs if applicable) are per-partition and can be viewed in the <code>TRESBillingWeights</code> line in the output of <code>scontrol show partition</code>. The <code>billing</code> value for a job is the sum of all resource weightings for resources the job has requested. This value is then multiplied by the amount of time a job has run in seconds to get the amount it contributes to the RawUsage for the association within the account it is running under. | ||
The algorithm we use for resource weightings differs depending on if there are any GPUs in a partition or not, and is as follows: | |||
==== | ====GPU partitions==== | ||
Each resource (memory/CPU/GPU) is given a weighting value such that their relative billings to each other within the partition are equal (33.33% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU/GPU values are adjusted accordingly. | |||
Different GPU types may also be weighted differently within the GPU relative billing. A baseline GPU type is first chosen. All GPUs of that type and other types that have lower FP32 performance (in [https://en.wikipedia.org/wiki/FLOPS TFLOPS]) are given a weighting factor of 1.0. GPU types with higher FP32 performance than the baseline GPU are given a weighting factor calculated by dividing their FP32 performance by the baseline GPU's FP32 performance. The weighting values for each GPU type are then determined by normalizing the sum of all of GPU cards' billing values multiplied by their weighting factors against the relative billing percentage for GPUs (33.33%). | |||
Different GPU types may also be weighted differently within the GPU relative billing. A baseline GPU type is first chosen. All GPUs of that type and other types that have lower FP32 performance (in [https://en.wikipedia.org/wiki/FLOPS TFLOPS]) are given a weighting factor of 1.0. GPU types with higher FP32 performance than the baseline GPU are given a weighting factor calculated by dividing their FP32 performance by the baseline GPU's FP32 performance. The weighting values for each GPU type are then determined by normalizing the sum of all of GPU cards | |||
The current baseline GPU is the [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ NVIDIA RTX A4000]. | The current baseline GPU is the [https://www.nvidia.com/en-us/design-visualization/rtx-a4000/ NVIDIA RTX A4000]. | ||
==== | ====CPU-only partitions==== | ||
Each resource (memory/CPU) is first given a weighting value such that their relative billings to each other are equal (50% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU value is adjusted accordingly. The final CPU weight value is then divided by 10, which ends up translating to roughly 90.9% of the billing weight being for memory and 9.1% being for CPU. | Each resource (memory/CPU) is first given a weighting value such that their relative billings to each other within the partition are equal (50% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU value is adjusted accordingly. The final CPU weight value is then divided by 10, which ends up translating to roughly 90.9% of the billing weight being for memory and 9.1% being for CPU. The division of the CPU value is done so as to not affect accounts' fair-share priority factors as much when running CPU-only jobs given the popularity of GPU computing. | ||
===Nice value=== | ===Nice value=== |
Revision as of 19:05, 24 October 2024
SLURM at UMIACS is configured to prioritize jobs based on a number of factors, termed multifactor priority in SLURM. Each job submitted to the scheduler is assigned a priority value, which can be viewed in the output of scontrol show job <jobid>
.
Example:
$ scontrol show job 1 JobId=1 JobName=bash UserId=username(13337) GroupId=username(13337) MCS_label=N/A Priority=10841 Nice=0 Account=nexus QOS=default ...
Pending Jobs
If the partition that you submit your job to cannot instantly begin your job due to no compute node(s) having the resources free to run it, your job will remain in the Pending state with the listed reason (Resources). If there is another job already pending with this reason, you submit a job to the same partition, and your job gets assigned a lower priority value than that pending job, your job will instead remain in the Pending state with reason (Priority). If there are multiple jobs pending and your job is not the highest priority job pending, the scheduler will only begin execution of your job if starting your job would not push the begin times for any higher priority jobs in the same partition further back.
Lowering some combination of the resources you are requesting and/or the time limit may allow submitted jobs to run more quickly or instantly during times where a partition is under resource pressure. The command squeue -j <jobid> --start
can be used to provide a time estimate for when your job will start, where <jobid> is the job ID you receive from either srun or sbatch.
You can use the command alias show_available_nodes
with a variety of different submission arguments to get a better idea of what jobs may be able to begin sooner.
Priority Factors
The priority factors in use at UMIACS include:
- Age of job i.e. time spent waiting to run in the queue
- Association (SLURM account) being used
- Partition job was submitted to
- Fair-share of resources
- "Nice" value that job was submitted with
Age
The longer a job is eligible to run but cannot due to resources being unavailable, the higher the job's priority becomes as it continue to wait in the queue. The priority modifier for this factor reaches its limit after 7 days.
Association
Some lab/center-specific SLURM accounts may have priority values directly attached to them. Jobs run under these accounts gain this many extra points of priority.
Partition
The partitions whose names are or are prefixed with scavenger
on each of our clusters always have lower priority factors for their jobs than all other partitions on that cluster. As mentioned in other UMIACS cluster-specific documentation, jobs submitted to these partitions are also preemptable. These two design choices give the partitions their names; jobs submitted to scavenger
named or prefixed partitions "scavenge" for available resources on the cluster rather than consume dedicated resources, and are interrupted by jobs asking to consume dedicated resources.
On Nexus, labs/centers may also have their own scavenger partitions (<labname>-scavenger
) if the faculty for the lab/center have decided upon some sort of limit on jobs (number of simultaneous jobs, number of actively consumed billing resources, etc.) in their non-scavenger partitions. These lab/center scavenger partitions allow for more jobs to be run by members of that lab/center on that lab's/center's nodes only, but are preemptable by that lab's/center's non-scavenger partition jobs.
In decreasing order of priority (highest first), our priority tiers for partitions are:
- Account-specific non-preemptable partitions
- Lab/center-specific non-"scavenger" named partitions
- Lab/center-specific "scavenger" named partitions
- Institute-wide scavenger partitions
A job in a lower priority tier will never have a higher priority value than any job in any of the higher priority tiers.
The more resources your jobs have already consumed within an account, the lower priority factor your future jobs will have when compared to other users' jobs in the same account who have used fewer resources (so as to "fair-share" with other users). Additionally, if there are multiple accounts that can submit to a partition, and the sum of resources of all users' jobs within account A is greater than the sum of resources of all users' jobs within account B, the lower priority factor all future jobs from users in account A will have when compared to all future jobs from users in account B. (In other words, fair-share is hierarchical.)
You can view the various fair-share statistics with the command sshare -l
. It will show your specific FairShare values (always between 0.0 and 1.0) within accounts that you have access to. You can also view other accounts' Level Fairshare (LevelFS).
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ root 0.000000 66034847484 1.000000 cpu=7746109,mem=69754856514,e+ cbcb 1 0.032258 14115111102 0.213757 0.213757 0.150910 cpu=4969,mem=20355003,energy=+ class 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ clip 1 0.032258 1568122041 0.023733 0.023733 1.359207 cpu=70083,mem=1464478788,ener+ cml 1 0.032258 17338485 0.000263 0.000263 122.854754 cpu=29958,mem=245415936,energ+ cml-abhinav 1 0.032258 784250 0.000012 0.000012 2.7161e+03 cpu=0,mem=0,energy=0,node=0,b+ cml-cameron 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ cml-furongh 1 0.032258 2098793815 0.031784 0.031784 1.014924 cpu=940758,mem=8995575569,ene+ cml-hajiagha 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ cml-john 1 0.032258 258872094 0.003920 0.003920 8.228447 cpu=476993,mem=5494963200,ene+ cml-ramani 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ cml-scavenger 1 0.032258 6734023027 0.101979 0.101979 0.316321 cpu=1496736,mem=13036434773,e+ cml-sfeizi 1 0.032258 185510632 0.002809 0.002809 11.482444 cpu=70732,mem=579442005,energ+ cml-tokekar 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ cml-tomg 1 0.032258 99040108 0.001500 0.001500 21.507603 cpu=0,mem=0,energy=0,node=0,b+ cml-zhou 1 0.031250 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ gamma 1 0.032258 8880343229 0.134482 0.134482 0.239869 cpu=2532358,mem=23460226867,e+ mbrc 1 0.032258 27060567 0.000410 0.000410 78.716582 cpu=0,mem=0,energy=0,node=0,b+ mc2 1 0.032258 9175 0.000000 0.000000 2.3215e+05 cpu=0,mem=0,energy=0,node=0,b+ nexus 1 0.032258 3346084300 0.050672 0.050672 0.636599 cpu=121941,mem=1468973003,ene+ nexus username 1 0.000779 69666 0.000001 0.000021 0.457407 37.435501 cpu=0,mem=0,energy=0,node=0,b+ scavenger 1 0.032258 21762190063 0.329562 0.329562 0.097882 cpu=1085904,mem=4775150199,en+ scavenger username 1 0.000779 171 0.000000 0.000000 0.033975 9.8885e+04 cpu=0,mem=0,energy=0,node=0,b+ vulcan 1 0.032258 1458631376 0.022089 0.022089 1.460352 cpu=25968,mem=106368204,energ+ vulcan-abhinav 1 0.032258 4441051354 0.067254 0.067254 0.479648 cpu=850445,mem=9471827285,ene+ vulcan-djacobs 1 0.032258 381503730 0.005777 0.005777 5.583472 cpu=7656,mem=250882730,energy+ vulcan-janus 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ vulcan-jbhuang 1 0.032258 15619477 0.000237 0.000237 136.375587 cpu=0,mem=0,energy=0,node=0,b+ vulcan-lsd 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ vulcan-metzler 1 0.032258 435471075 0.006595 0.006595 4.891520 cpu=16235,mem=133000942,energ+ vulcan-rama 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ vulcan-ramani 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+ vulcan-yaser 1 0.032258 209285667 0.003166 0.003166 10.189036 cpu=15366,mem=251762005,energ+ vulcan-zwicker 1 0.032258 0 0.000000 0.000000 inf cpu=0,mem=0,energy=0,node=0,b+
The actual resource billing weights for the three main resources (memory per GB, CPU cores, and number of GPUs if applicable) are per-partition and can be viewed in the TRESBillingWeights
line in the output of scontrol show partition
. The billing
value for a job is the sum of all resource weightings for resources the job has requested. This value is then multiplied by the amount of time a job has run in seconds to get the amount it contributes to the RawUsage for the association within the account it is running under.
The algorithm we use for resource weightings differs depending on if there are any GPUs in a partition or not, and is as follows:
GPU partitions
Each resource (memory/CPU/GPU) is given a weighting value such that their relative billings to each other within the partition are equal (33.33% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU/GPU values are adjusted accordingly.
Different GPU types may also be weighted differently within the GPU relative billing. A baseline GPU type is first chosen. All GPUs of that type and other types that have lower FP32 performance (in TFLOPS) are given a weighting factor of 1.0. GPU types with higher FP32 performance than the baseline GPU are given a weighting factor calculated by dividing their FP32 performance by the baseline GPU's FP32 performance. The weighting values for each GPU type are then determined by normalizing the sum of all of GPU cards' billing values multiplied by their weighting factors against the relative billing percentage for GPUs (33.33%).
The current baseline GPU is the NVIDIA RTX A4000.
CPU-only partitions
Each resource (memory/CPU) is first given a weighting value such that their relative billings to each other within the partition are equal (50% each). Memory is typically always the most abundant resource by unit (weighting value of 1.0 per GB) and the CPU value is adjusted accordingly. The final CPU weight value is then divided by 10, which ends up translating to roughly 90.9% of the billing weight being for memory and 9.1% being for CPU. The division of the CPU value is done so as to not affect accounts' fair-share priority factors as much when running CPU-only jobs given the popularity of GPU computing.
Nice value
This is a submission argument that you as the user can include when submitting your jobs to deprioritize them. Larger values will deprioritize jobs more e.g.,
srun --pty --nice=2 bash
will have lower priority than
srun --pty --nice=1 bash
which will have lower priority than
srun --pty bash
assuming all three jobs were submitted at the same time. You cannot use negative values for this argument.