SLURM/Preemption

From UMIACS
Jump to navigation Jump to search

If you are submitting to a partition which is eligible for preemption (e.g. scavenger), you are responsible for making sure that your job can be interrupted/restarted gracefully. Here we are documenting some Slurm behaviors that you can use to determine if your job has been cancelled or preempted. You should be able to take different code paths during startup/shutdown of your jobs based on this information.

Flowchart for job cancellation/preemption

Cancellation

  1. Slurm controller sends an internal cancel signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['CANCELLED', 'COMPLETING']
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL will be sent.

Preemption/Requeue

  1. Slurm controller sends an internal preemption/requeue signal to Slurm node(s) where the job is currently assigned.
  2. Slurm node(s) send SIGCONT and SIGTERM around the same time, and the following fields of note are set in the output of scontrol --json show job $SLURM_JOBID:
    • job_state = ['PENDING', 'COMPLETING']
    • restart_cnt += 1
  3. If the processes don't stop within a certain amount of time, eventually SIGKILL will be sent.
  4. Once the job is stopped, it enters the PENDING state for two minutes, and then is eligible to be run again.
  5. When the job runs again, an additional environment variable will be defined, SLURM_RESTART_COUNT, which reports the number of times the job has been preempted/requeued.

Key takeaways

On job cancellation/preemption

You can handle the SIGTERM signal and run scontrol --json show job $SLURM_JOBID within your script. Based on the value of job_state, your script can take a different codepath depending on if it was cancelled or was preempted.

The output of scontrol --json show job $SLURM_JOBID is equivalent to the Slurm REST API's endpoint "GET /slurm/v0.0.40/job/{job_id}". For more information on how to parse the output of this scontrol command, please refer to Slurm's REST API documentation. [1]

On resuming

If your job needs to behave differently based on whether or not it was previously preempted, you can check if the environment variable SLURM_RESTART_COUNT is defined.

Testing

You can use the following commands to manually cancel and requeue your jobs (requeueing and preemption are handled similarly). With these tools, you can test your scripts to ensure that they gracefully handle these scenarios.

Cancellation: scancel <job id>

Preemption: scontrol requeue <job id>