SLURM/ClusterStatus

From UMIACS
Revision as of 23:11, 22 April 2022 by Jayid07 (talk | contribs) (→‎sinfo)
Jump to navigation Jump to search

Cluster Status

SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.

sinfo

The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.

[username@nexuscml00 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gamma        up   infinite      3   idle gammagpu[01-03]
scavenger    up   infinite      2  drain tron[50-51]
scavenger    up   infinite     21    mix tron[00-01,03-15,46-49,52-53]
scavenger    up   infinite     31   idle tron[02,16-45]
tron*        up 3-00:00:00      2  drain tron[50-51]
tron*        up 3-00:00:00     21    mix tron[00-01,03-15,46-49,52-53]
tron*        up 3-00:00:00     31   idle tron[02,16-45]

[username@nexuscml00 ~]$ sinfo -N
NODELIST    NODES PARTITION STATE
gammagpu01      1     gamma idle
gammagpu02      1     gamma idle
gammagpu03      1     gamma idle
tron00          1 scavenger mix
tron00          1     tron* mix
tron01          1 scavenger mix
tron01          1     tron* mix
tron02          1 scavenger idle
tron02          1     tron* idle
tron03          1 scavenger mix
tron03          1     tron* mix
tron04          1 scavenger mix
tron04          1     tron* mix
...
tron52          1 scavenger mix
tron52          1     tron* mix
tron53          1 scavenger mix
tron53          1     tron* mix

scontrol

The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).

$ scontrol show nodes openlab00,openlab08
NodeName=openlab00 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=7.10
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=openlab00 NodeHostName=openlab00 Version=16.05
   OS=Linux RealMemory=7822 AllocMem=7822 FreeMem=149 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2017-01-17T14:46:59 SlurmdStartTime=2017-01-17T14:47:43
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=openlab08 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=1 CPUErr=0 CPUTot=16 CPULoad=1.19
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:3
   NodeAddr=openlab08 NodeHostName=openlab08 Version=16.05
   OS=Linux RealMemory=128722 AllocMem=1024 FreeMem=395 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2016-12-22T20:26:52 SlurmdStartTime=2016-12-22T20:33:21
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sacctmgr

The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QoSes.

$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU
      Name   Priority     MaxWall MaxJobsPU
---------- ---------- ----------- ---------
    normal          0
     dpart          0  2-00:00:00         8
       gpu          0    08:00:00         2