SLURM/Preemption: Difference between revisions

From UMIACS
Jump to navigation Jump to search
(Created page with "If you are submitting to a partition which is eligible for preemption (e.g. scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Here we are documenting some Slurm behaviors that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information. =Flowchart for job cancellation/preemption= ==Cancellation==...")
 
No edit summary
Line 1: Line 1:
If you are submitting to a partition which is eligible for preemption (e.g. scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Here we are documenting some Slurm behaviors that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information.
If you are submitting to a partition which is eligible for preemption (e.g., scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Below is some information that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information.


=Flowchart for job cancellation/preemption=
=Flowchart for job cancellation/preemption=


==Cancellation==
==Cancellation==
# Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
# Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
#* <tt>job_state = ['CANCELLED', 'COMPLETING']</tt>
#* <tt>job_state = ['CANCELLED', 'COMPLETING']</tt>
# If the processes don't stop within a certain amount of time, eventually SIGKILL will be sent.
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this time is currently 5 minutes.


==Preemption/Requeue==
==Preemption/Requeue==
# Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
# Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
# Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of <tt>scontrol --json show job $SLURM_JOBID</tt>:
#* <tt>job_state = ['PENDING', 'COMPLETING']</tt>
#* <tt>job_state = ['PENDING', 'COMPLETING']</tt>
#* <tt>restart_cnt += 1</tt>
#* <tt>restart_cnt += 1</tt>
# If the processes don't stop within a certain amount of time, eventually SIGKILL will be sent.
# If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the [[Nexus]] cluster, this time is currently 5 minutes.
# Once the job is stopped, it enters the PENDING state for two minutes, and then is eligible to be run again.
# Once the job is stopped, it enters the PENDING state for two minutes, and then is eligible to be run again.
# When the job runs again, an additional environment variable will be defined, <tt>SLURM_RESTART_COUNT</tt>, which reports the number of times the job has been preempted/requeued.
# When the job runs again, an additional environment variable <tt>SLURM_RESTART_COUNT</tt> will be defined, which reports the number of times the job has been preempted/requeued.


=Key takeaways=
=Key takeaways=


==On job cancellation/preemption==
==On job cancellation/preemption==
You can handle the SIGTERM signal and run <tt>scontrol --json show job $SLURM_JOBID</tt> within your script. Based on the value of <tt>job_state</tt>, your script can take a different codepath depending on if it was cancelled or was preempted.
You can handle the SIGTERM signal and run <tt>scontrol --json show job $SLURM_JOBID</tt> within your script. Based on the value of <tt>job_state</tt>, your script can take a different codepath depending on if it was cancelled or was preempted.


Line 29: Line 26:


==On resuming==
==On resuming==
If your job needs to behave differently based on whether or not it was previously preempted, you can check if the environment variable <tt>SLURM_RESTART_COUNT</tt> is defined.
If your job needs to behave differently based on whether or not it was previously preempted, you can check if the environment variable <tt>SLURM_RESTART_COUNT</tt> is defined.


=Testing=
=Testing=
You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios.
You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios.



Revision as of 17:13, 26 November 2024

If you are submitting to a partition which is eligible for preemption (e.g., scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Below is some information that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information.

Flowchart for job cancellation/preemption

Cancellation

  1. Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['CANCELLED', 'COMPLETING']
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the Nexus cluster, this time is currently 5 minutes.

Preemption/Requeue

  1. Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['PENDING', 'COMPLETING']
    • restart_cnt += 1
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL is sent. On the Nexus cluster, this time is currently 5 minutes.
  4. Once the job is stopped, it enters the PENDING state for two minutes, and then is eligible to be run again.
  5. When the job runs again, an additional environment variable SLURM_RESTART_COUNT will be defined, which reports the number of times the job has been preempted/requeued.

Key takeaways

On job cancellation/preemption

You can handle the SIGTERM signal and run scontrol --json show job $SLURM_JOBID within your script. Based on the value of job_state, your script can take a different codepath depending on if it was cancelled or was preempted.

The output of scontrol --json show job $SLURM_JOBID is equivalent to the Slurm REST API's endpoint "GET /slurm/v0.0.40/job/{job_id}". For more information on how to parse the output of this scontrol command, please refer to Slurm's REST API documentation. [1]

On resuming

If your job needs to behave differently based on whether or not it was previously preempted, you can check if the environment variable SLURM_RESTART_COUNT is defined.

Testing

You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios.

Cancellation: scancel <job id>

Preemption: scontrol requeue <job id>