SLURM/ClusterStatus: Difference between revisions

From UMIACS
Jump to navigation Jump to search
No edit summary
No edit summary
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Cluster Status=
=Cluster Status=
Slurm offers a variety of tools to check the general status of nodes/partitions in a cluster.
SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.


==sinfo==
==sinfo==
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
<pre>
<pre>
tgray26@opensub00:sinfo
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dpart*      up  infinite      8   idle openlab[00-07]
gamma        up  infinite      3   idle gammagpu[01-03]
gpu          up  infinite      2  idle openlab[08-09]
scavenger    up  infinite      2  drain tron[50-51]
scavenger    up  infinite    21    mix tron[00-01,03-15,46-49,52-53]
scavenger    up  infinite     31  idle tron[02,16-45]
tron*        up 3-00:00:00     2 drain tron[50-51]
tron*        up 3-00:00:00    21    mix tron[00-01,03-15,46-49,52-53]
tron*        up 3-00:00:00    31   idle tron[02,16-45]
</pre>
</pre>
<pre>
<pre>
tgray26@opensub00:sinfo -N
$ sinfo -N
NODELIST   NODES PARTITION STATE
NODELIST   NODES PARTITION STATE
openlab00     1   dpart* idle
gammagpu01     1     gamma idle
openlab01     1   dpart* idle
gammagpu02     1     gamma idle
openlab02     1   dpart* idle
gammagpu03     1     gamma idle
openlab03      1   dpart* idle
tron00          1 scavenger mix
openlab04      1   dpart* idle
tron00          1    tron* mix
openlab05      1   dpart* idle
tron01          1 scavenger mix
openlab06      1   dpart* idle
tron01          1     tron* mix
openlab07      1   dpart* idle
tron02          1 scavenger idle
openlab08      1       gpu idle
tron02          1     tron* idle
openlab09      1       gpu idle
tron03          1 scavenger mix
tron03          1     tron* mix
tron04          1 scavenger mix
tron04          1     tron* mix
...
tron52          1 scavenger mix
tron52          1     tron* mix
tron53          1 scavenger mix
tron53          1     tron* mix
 
</pre>
</pre>


Line 28: Line 43:
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
<pre>
<pre>
tgray26@opensub00:scontrol show nodes openlab00,openlab01
$ scontrol show nodes tron05,tron13
NodeName=openlab00 Arch=x86_64 CoresPerSocket=4
NodeName=tron05 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.02
   CPUAlloc=28 CPUTot=32 CPULoad=47.32
   AvailableFeatures=(null)
   AvailableFeatures=rhel8,AMD,EPYC-7302
   ActiveFeatures=(null)
   ActiveFeatures=rhel8,AMD,EPYC-7302
   Gres=(null)
   Gres=gpu:rtxa6000:8
   NodeAddr=openlab00 NodeHostName=openlab00 Version=16.05
   NodeAddr=tron05 NodeHostName=tron05 Version=21.08.5
   OS=Linux RealMemory=7822 AllocMem=0 FreeMem=5842 Sockets=2 Boards=1
   OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
  RealMemory=257538 AllocMem=157696 FreeMem=197620 Sockets=2 Boards=1
   BootTime=2016-07-11T16:40:45 SlurmdStartTime=2016-07-11T23:47:24
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
  Partitions=scavenger,tron
   BootTime=2022-04-21T17:40:51 SlurmdStartTime=2022-04-21T18:00:56
  LastBusyTime=2022-04-22T11:21:16
  CfgTRES=cpu=32,mem=257538M,billing=346,gres/gpu=8,gres/gpu:rtxa6000=8
  AllocTRES=cpu=28,mem=154G,gres/gpu=7,gres/gpu:rtxa6000=7
   CapWatts=n/a
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


 
NodeName=tron13 Arch=x86_64 CoresPerSocket=16
NodeName=openlab01 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=1 CPUTot=16 CPULoad=8.41
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01
   AvailableFeatures=rhel8,AMD,EPYC-7302P
   AvailableFeatures=(null)
   ActiveFeatures=rhel8,AMD,EPYC-7302P
   ActiveFeatures=(null)
   Gres=gpu:rtxa4000:4
   Gres=(null)
   NodeAddr=tron13 NodeHostName=tron13 Version=21.08.5
   NodeAddr=openlab01 NodeHostName=openlab01 Version=16.05
   OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022
   OS=Linux RealMemory=7822 AllocMem=0 FreeMem=5865 Sockets=2 Boards=1
  RealMemory=128525 AllocMem=65536 FreeMem=33463 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
   BootTime=2016-07-11T16:40:59 SlurmdStartTime=2016-07-11T23:48:25
  Partitions=scavenger,tron
   BootTime=2022-04-21T17:40:46 SlurmdStartTime=2022-04-21T17:54:51
  LastBusyTime=2022-04-22T13:04:57
  CfgTRES=cpu=16,mem=128525M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
  AllocTRES=cpu=1,mem=64G,gres/gpu=4,gres/gpu:rtxa4000=4
   CapWatts=n/a
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>
==sacctmgr==
The sacctmgr command shows cluster accounting information.  One of the helpful commands is to list the available QoSes.
<pre>
$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU
      Name  Priority    MaxWall MaxJobsPU
---------- ---------- ----------- ---------
    normal          0
    dpart          0  2-00:00:00        8
      gpu          0    08:00:00        2
</pre>
</pre>

Latest revision as of 17:36, 2 October 2023

Cluster Status

SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.

sinfo

The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gamma        up   infinite      3   idle gammagpu[01-03]
scavenger    up   infinite      2  drain tron[50-51]
scavenger    up   infinite     21    mix tron[00-01,03-15,46-49,52-53]
scavenger    up   infinite     31   idle tron[02,16-45]
tron*        up 3-00:00:00      2  drain tron[50-51]
tron*        up 3-00:00:00     21    mix tron[00-01,03-15,46-49,52-53]
tron*        up 3-00:00:00     31   idle tron[02,16-45]
$ sinfo -N
NODELIST    NODES PARTITION STATE
gammagpu01      1     gamma idle
gammagpu02      1     gamma idle
gammagpu03      1     gamma idle
tron00          1 scavenger mix
tron00          1     tron* mix
tron01          1 scavenger mix
tron01          1     tron* mix
tron02          1 scavenger idle
tron02          1     tron* idle
tron03          1 scavenger mix
tron03          1     tron* mix
tron04          1 scavenger mix
tron04          1     tron* mix
...
tron52          1 scavenger mix
tron52          1     tron* mix
tron53          1 scavenger mix
tron53          1     tron* mix

scontrol

The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).

$ scontrol show nodes tron05,tron13
NodeName=tron05 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=28 CPUTot=32 CPULoad=47.32
   AvailableFeatures=rhel8,AMD,EPYC-7302
   ActiveFeatures=rhel8,AMD,EPYC-7302
   Gres=gpu:rtxa6000:8
   NodeAddr=tron05 NodeHostName=tron05 Version=21.08.5
   OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022
   RealMemory=257538 AllocMem=157696 FreeMem=197620 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=scavenger,tron
   BootTime=2022-04-21T17:40:51 SlurmdStartTime=2022-04-21T18:00:56
   LastBusyTime=2022-04-22T11:21:16
   CfgTRES=cpu=32,mem=257538M,billing=346,gres/gpu=8,gres/gpu:rtxa6000=8
   AllocTRES=cpu=28,mem=154G,gres/gpu=7,gres/gpu:rtxa6000=7
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=tron13 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=1 CPUTot=16 CPULoad=8.41
   AvailableFeatures=rhel8,AMD,EPYC-7302P
   ActiveFeatures=rhel8,AMD,EPYC-7302P
   Gres=gpu:rtxa4000:4
   NodeAddr=tron13 NodeHostName=tron13 Version=21.08.5
   OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022
   RealMemory=128525 AllocMem=65536 FreeMem=33463 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=scavenger,tron
   BootTime=2022-04-21T17:40:46 SlurmdStartTime=2022-04-21T17:54:51
   LastBusyTime=2022-04-22T13:04:57
   CfgTRES=cpu=16,mem=128525M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
   AllocTRES=cpu=1,mem=64G,gres/gpu=4,gres/gpu:rtxa4000=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sacctmgr

The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QoSes.

$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU
      Name   Priority     MaxWall MaxJobsPU
---------- ---------- ----------- ---------
    normal          0
     dpart          0  2-00:00:00         8
       gpu          0    08:00:00         2