SLURM/ClusterStatus: Difference between revisions

From UMIACS
Jump to navigation Jump to search
(Created page with "=Cluster Status= The general status of nodes/partitions in a cluster can be viewed using the sinfo and scontrol commands. ==sinfo== sinfo will show you the status of partition...")
 
No edit summary
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Cluster Status=
=Cluster Status=
The general status of nodes/partitions in a cluster can be viewed using the sinfo and scontrol commands.
SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.
 
==sinfo==
==sinfo==
sinfo will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
<pre>
<pre>
tgray26@shadosub:sinfo
username@opensub00:sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test        up  infinite      2    mix shado[00-01]
dpart*      up  infinite      8   idle openlab[00-07]
test        up  infinite      7   idle shado[02-08]
gpu          up  infinite      2  idle openlab08
test2*      up  infinite      2   mix shado[00-01]
test2*      up  infinite      3   idle shado[02-04]
</pre>
</pre>
<pre>
<pre>
tgray26@shadosub:sinfo -N
username@opensub00:sinfo -N
NODELIST  NODES PARTITION STATE  
NODELIST  NODES PARTITION STATE
shado00        1      test mix 
openlab00      1   dpart* idle
shado00        1    test2* mix 
openlab01     1    dpart* idle
shado01        1      test mix 
openlab02      1    dpart* idle
shado01        1    test2* mix 
openlab03      1   dpart* idle
shado02        1     test idle
openlab04     1    dpart* idle
shado02        1    test2* idle  
openlab05      1    dpart* idle
openlab06      1   dpart* idle
openlab07      1    dpart* idle
openlab08      1      gpu idle
</pre>
</pre>
==scontrol==
==scontrol==
The scontrol command, while generally reserved for administrator use, can be used to view the status/configuration of the nodes in the cluster. If passed a specific node name only information about that node will be displayed, otherwise all nodes will be listed.
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
<pre>
<pre>
tgray26@shadosub:scontrol show nodes shado00
$ scontrol show nodes openlab00,openlab08
NodeName=shado00 Arch=x86_64 CoresPerSocket=4
NodeName=openlab00 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=7.10
   AvailableFeatures=(null)
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   Gres=(null)
   NodeAddr=shado00 NodeHostName=shado00 Version=16.05
   NodeAddr=openlab00 NodeHostName=openlab00 Version=16.05
   OS=Linux RealMemory=15885 AllocMem=0 FreeMem=12187 Sockets=2 Boards=1
   OS=Linux RealMemory=7822 AllocMem=7822 FreeMem=149 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-06-23T20:25:41 SlurmdStartTime=2016-07-10T13:33:29
   BootTime=2017-01-17T14:46:59 SlurmdStartTime=2017-01-17T14:47:43
   CapWatts=n/a
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=openlab08 Arch=x86_64 CoresPerSocket=8
  CPUAlloc=1 CPUErr=0 CPUTot=16 CPULoad=1.19
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=gpu:3
  NodeAddr=openlab08 NodeHostName=openlab08 Version=16.05
  OS=Linux RealMemory=128722 AllocMem=1024 FreeMem=395 Sockets=2 Boards=1
  State=MIXED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
  BootTime=2016-12-22T20:26:52 SlurmdStartTime=2016-12-22T20:33:21
  CapWatts=n/a
  CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>
==sacctmgr==
The sacctmgr command shows cluster accounting information.  One of the helpful commands is to list the available QOS which translates into queues in systems like PBS/Torque.
<pre>
$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU
      Name  Priority    MaxWall MaxJobsPU
---------- ---------- ----------- ---------
    normal          0
    dpart          0  2-00:00:00        8
      gpu          0    08:00:00        2
</pre>
</pre>

Revision as of 15:45, 7 May 2021

Cluster Status

SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.

sinfo

The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.

username@opensub00:sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dpart*       up   infinite      8   idle openlab[00-07]
gpu          up   infinite      2   idle openlab08
username@opensub00:sinfo -N
NODELIST   NODES PARTITION STATE
openlab00      1    dpart* idle
openlab01      1    dpart* idle
openlab02      1    dpart* idle
openlab03      1    dpart* idle
openlab04      1    dpart* idle
openlab05      1    dpart* idle
openlab06      1    dpart* idle
openlab07      1    dpart* idle
openlab08      1       gpu idle

scontrol

The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).

$ scontrol show nodes openlab00,openlab08
NodeName=openlab00 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=7.10
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=openlab00 NodeHostName=openlab00 Version=16.05
   OS=Linux RealMemory=7822 AllocMem=7822 FreeMem=149 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2017-01-17T14:46:59 SlurmdStartTime=2017-01-17T14:47:43
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=openlab08 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=1 CPUErr=0 CPUTot=16 CPULoad=1.19
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:3
   NodeAddr=openlab08 NodeHostName=openlab08 Version=16.05
   OS=Linux RealMemory=128722 AllocMem=1024 FreeMem=395 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-12-22T20:26:52 SlurmdStartTime=2016-12-22T20:33:21
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sacctmgr

The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QOS which translates into queues in systems like PBS/Torque.

$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU
      Name   Priority     MaxWall MaxJobsPU
---------- ---------- ----------- ---------
    normal          0
     dpart          0  2-00:00:00         8
       gpu          0    08:00:00         2