SLURM/Preemption: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
 
(One intermediate revision by the same user not shown)
Line 5: Line 5:
==Cancellation==
==Cancellation==
# Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
# Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <code>scontrol --json show job $SLURM_JOBID</code>:
#* <tt>job_state = ['CANCELLED', 'COMPLETING']</tt>
#* <tt>job_state = ['CANCELLED', 'COMPLETING']</tt>
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this time is currently 5 minutes.
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this is 5 minutes.


==Preemption/Requeue==
==Preemption/Requeue==
# Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
# Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <code>scontrol --json show job $SLURM_JOBID</code>:
#* <tt>job_state = ['PENDING', 'COMPLETING']</tt>
#* <tt>job_state = ['PENDING', 'COMPLETING']</tt>
#* <tt>restart_cnt += 1</tt>
#* <tt>restart_cnt += 1</tt>
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this time is currently 5 minutes.
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this is 5 minutes.
# Once the job is stopped, it enters the PENDING state for two minutes, and then is eligible to be run again.
# Once the job is stopped, it enters the PENDING state for 2 minutes, and then is eligible to be run again.
# When the job runs again, an additional environment variable <tt>SLURM_RESTART_COUNT</tt> is defined, which reports the number of times the job has been preempted/requeued.
# When the job runs again, an additional environment variable <tt>SLURM_RESTART_COUNT</tt> is defined, which reports the number of times the job has been preempted/requeued.


Line 21: Line 21:


==On job cancellation/preemption==
==On job cancellation/preemption==
You can handle the SIGTERM signal and run <tt>scontrol --json show job $SLURM_JOBID</tt> within your script. Based on the value of <tt>job_state</tt>, your script can take a different codepath depending on if it was cancelled or was preempted.
You can handle the SIGTERM signal and run <code>scontrol --json show job $SLURM_JOBID</code> within your script. Based on the value of <tt>job_state</tt>, your script can take a different codepath depending on if it was cancelled or was preempted.


The output of <tt>scontrol --json show job $SLURM_JOBID</tt> is equivalent to the Slurm REST API's endpoint "GET /slurm/v0.0.40/job/{job_id}". For more information on how to parse the output of this scontrol command, please refer to Slurm's REST API documentation. [https://slurm.schedmd.com/rest_api.html#slurmV0040GetJob]
The output of <code>scontrol --json show job $SLURM_JOBID</code> is equivalent to the Slurm REST API's endpoint "GET /slurm/v0.0.40/job/{job_id}". For more information on how to parse the output of this scontrol command, please refer to Slurm's REST API documentation. [https://slurm.schedmd.com/rest_api.html#slurmV0040GetJob]


==On resuming==
==On resuming==
Line 29: Line 29:


=Testing=
=Testing=
You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios.
You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios. Replace <job_id> with the actual job ID of the job you are testing.


Cancellation: <tt>scancel <job id></tt>
Cancellation: <code>scancel <job_id></code>


Preemption: <tt>scontrol requeue <job id>
Preemption: <code>scontrol requeue <job_id></code>

Latest revision as of 17:11, 20 October 2025

If you are submitting to a partition which is eligible for preemption (e.g., scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Below is some information that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information.

Flowchart for job cancellation/preemption

Cancellation

  1. Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['CANCELLED', 'COMPLETING']
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the Nexus cluster, this is 5 minutes.

Preemption/Requeue

  1. Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['PENDING', 'COMPLETING']
    • restart_cnt += 1
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the Nexus cluster, this is 5 minutes.
  4. Once the job is stopped, it enters the PENDING state for 2 minutes, and then is eligible to be run again.
  5. When the job runs again, an additional environment variable SLURM_RESTART_COUNT is defined, which reports the number of times the job has been preempted/requeued.

Key takeaways

On job cancellation/preemption

You can handle the SIGTERM signal and run scontrol --json show job $SLURM_JOBID within your script. Based on the value of job_state, your script can take a different codepath depending on if it was cancelled or was preempted.

The output of scontrol --json show job $SLURM_JOBID is equivalent to the Slurm REST API's endpoint "GET /slurm/v0.0.40/job/{job_id}". For more information on how to parse the output of this scontrol command, please refer to Slurm's REST API documentation. [1]

On resuming

If your job needs to behave differently based on whether or not it was previously preempted, you can check if the environment variable SLURM_RESTART_COUNT is defined.

Testing

You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios. Replace <job_id> with the actual job ID of the job you are testing.

Cancellation: scancel <job_id>

Preemption: scontrol requeue <job_id>