SLURM: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
Line 52: Line 52:
<pre>
<pre>
# scontrol show partition
# scontrol show partition
PartitionName=debug TotalNodes=5 TotalCPUs=40 RootOnly=NO
PartitionName=test
   Default=YES Shared=FORCE:4 Priority=1 State=UP
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   MaxTime=00:30:00 Hidden=NO
   AllocNodes=ALL Default=YES
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
  DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   Nodes=adev[1-5] NodeIndices=0-4
   MaxNodes=1 MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shado[00-04]
  Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=10 TotalNodes=5 SelectTypeParameters=N/A
  DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED


PartitionName=batch TotalNodes=10 TotalCPUs=80 RootOnly=NO
PartitionName=test2
   Default=NO Shared=FORCE:4 Priority=1 State=UP
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   MaxTime=16:00:00 Hidden=NO
   AllocNodes=ALL Default=NO
   MinNodes=1 MaxNodes=26 DisableRootJobs=NO AllowGroups=ALL
  DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   Nodes=adev[6-15] NodeIndices=5-14
   MaxNodes=2 MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shado[00-02]
  Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=6 TotalNodes=3 SelectTypeParameters=N/A
  DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
</pre>
</pre>


To show more about nodes you can run '''scontrol show nodes'''
To show more about nodes you can run '''scontrol show nodes'''
<pre>
<pre>
# scontrol show nodes
NodeName=shado00 Arch=x86_64 CoresPerSocket=1
  CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=1.01 Features=(null)
  Gres=(null)
  NodeAddr=shado00 NodeHostName=shado00 Version=14.11
  OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
  State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
  BootTime=2015-07-23T21:13:22 SlurmdStartTime=2015-07-30T11:21:49
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=shado01 Arch=x86_64 CoresPerSocket=1
  CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.94 Features=(null)
  Gres=(null)
  NodeAddr=shado01 NodeHostName=shado01 Version=14.11
  OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
  State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
  BootTime=2015-07-23T21:13:22 SlurmdStartTime=2015-07-30T11:23:23
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=shado02 Arch=x86_64 CoresPerSocket=1
  CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.95 Features=(null)
  Gres=(null)
  NodeAddr=shado02 NodeHostName=shado02 Version=14.11
  OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
  State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
  BootTime=2015-07-23T21:13:23 SlurmdStartTime=2015-07-30T11:23:50
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>
</pre>

Revision as of 17:04, 30 July 2015

Simple Linux Utility for Resource Management

UMIACS is transitioning away from our Torque/Maui batch resource manager to Slurm. Slurm is now in use broadly with the regional and national super computing communities.

Terminology and command line changes are the biggest differences when coming from Torque/Maui to Slurm.

  • Torque queues are now called partitions in Slurm

Commands

sinfo

To view partitions and nodes you can use the sinfo command. You will notice that there are two partitions in the following example, but in this view it will break the partitions into the availability of the nodes. The * character in the PARTITION column signifies the default partition for jobs.

# sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
debug*       up      30:00     2  down* adev[1-2]
debug*       up      30:00     3   idle adev[3-5]
batch        up      30:00     3  down* adev[6,13,15]
batch        up      30:00     3  alloc adev[7-8,14]
batch        up      30:00     4   idle adev[9-12]

squeue

The squeue command shows submitted jobs in partitions. This will, by default, show all jobs in all partitions. There are a number of limitation and output options that are documented in the man page for squeue.

# squeue
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
65646     batch  chem  mike  R 24:19     2 adev[7-8]
65647     batch   bio  joan  R  0:09     1 adev14
65648     batch  math  phil PD  0:00     6 (Resources)

srun

To run a simple command like hostname over 4 nodes: srun -n4 -l hostname

To get an interactive session with 4GB of RAM for 8 hours with a bash shell: srun --pty --mem 4096 -t 8:00:00 bash

scancel

To cancel a job, you can call scancel with a job number.

scontrol

You can receive more thorough information on both nodes and partitions through the scontrol command.

To show more about partitions you can run scontrol show partition

# scontrol show partition
PartitionName=test
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shado[00-04]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=10 TotalNodes=5 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=test2
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=shado[00-02]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=6 TotalNodes=3 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

To show more about nodes you can run scontrol show nodes

# scontrol show nodes
NodeName=shado00 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=1.01 Features=(null)
   Gres=(null)
   NodeAddr=shado00 NodeHostName=shado00 Version=14.11
   OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
   BootTime=2015-07-23T21:13:22 SlurmdStartTime=2015-07-30T11:21:49
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=shado01 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.94 Features=(null)
   Gres=(null)
   NodeAddr=shado01 NodeHostName=shado01 Version=14.11
   OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
   BootTime=2015-07-23T21:13:22 SlurmdStartTime=2015-07-30T11:23:23
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=shado02 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.95 Features=(null)
   Gres=(null)
   NodeAddr=shado02 NodeHostName=shado02 Version=14.11
   OS=Linux RealMemory=7823 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1
   BootTime=2015-07-23T21:13:23 SlurmdStartTime=2015-07-30T11:23:50
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s