SLURM/ClusterStatus
Jump to navigation
Jump to search
Cluster Status
SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.
sinfo
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
[username@nexuscml00 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gamma up infinite 3 idle gammagpu[01-03] scavenger up infinite 2 drain tron[50-51] scavenger up infinite 21 mix tron[00-01,03-15,46-49,52-53] scavenger up infinite 31 idle tron[02,16-45] tron* up 3-00:00:00 2 drain tron[50-51] tron* up 3-00:00:00 21 mix tron[00-01,03-15,46-49,52-53] tron* up 3-00:00:00 31 idle tron[02,16-45]
[username@nexuscml00 ~]$ sinfo -N NODELIST NODES PARTITION STATE gammagpu01 1 gamma idle gammagpu02 1 gamma idle gammagpu03 1 gamma idle tron00 1 scavenger mix tron00 1 tron* mix tron01 1 scavenger mix tron01 1 tron* mix tron02 1 scavenger idle tron02 1 tron* idle tron03 1 scavenger mix tron03 1 tron* mix tron04 1 scavenger mix tron04 1 tron* mix ... tron52 1 scavenger mix tron52 1 tron* mix tron53 1 scavenger mix tron53 1 tron* mix
scontrol
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
$ scontrol show nodes openlab00,openlab08 NodeName=openlab00 Arch=x86_64 CoresPerSocket=4 CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=7.10 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=openlab00 NodeHostName=openlab00 Version=16.05 OS=Linux RealMemory=7822 AllocMem=7822 FreeMem=149 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A BootTime=2017-01-17T14:46:59 SlurmdStartTime=2017-01-17T14:47:43 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=openlab08 Arch=x86_64 CoresPerSocket=8 CPUAlloc=1 CPUErr=0 CPUTot=16 CPULoad=1.19 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:3 NodeAddr=openlab08 NodeHostName=openlab08 Version=16.05 OS=Linux RealMemory=128722 AllocMem=1024 FreeMem=395 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=49975 Weight=1 Owner=N/A MCS_label=N/A BootTime=2016-12-22T20:26:52 SlurmdStartTime=2016-12-22T20:33:21 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
sacctmgr
The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QoSes.
$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU Name Priority MaxWall MaxJobsPU ---------- ---------- ----------- --------- normal 0 dpart 0 2-00:00:00 8 gpu 0 08:00:00 2