SLURM/ClusterStatus
Jump to navigation
Jump to search
Cluster Status
SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.
sinfo
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gamma up infinite 3 idle gammagpu[01-03] scavenger up infinite 2 drain tron[50-51] scavenger up infinite 21 mix tron[00-01,03-15,46-49,52-53] scavenger up infinite 31 idle tron[02,16-45] tron* up 3-00:00:00 2 drain tron[50-51] tron* up 3-00:00:00 21 mix tron[00-01,03-15,46-49,52-53] tron* up 3-00:00:00 31 idle tron[02,16-45]
$ sinfo -N NODELIST NODES PARTITION STATE gammagpu01 1 gamma idle gammagpu02 1 gamma idle gammagpu03 1 gamma idle tron00 1 scavenger mix tron00 1 tron* mix tron01 1 scavenger mix tron01 1 tron* mix tron02 1 scavenger idle tron02 1 tron* idle tron03 1 scavenger mix tron03 1 tron* mix tron04 1 scavenger mix tron04 1 tron* mix ... tron52 1 scavenger mix tron52 1 tron* mix tron53 1 scavenger mix tron53 1 tron* mix
scontrol
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
$ scontrol show nodes tron05,tron13 NodeName=tron05 Arch=x86_64 CoresPerSocket=16 CPUAlloc=28 CPUTot=32 CPULoad=47.32 AvailableFeatures=rhel8,AMD,EPYC-7302 ActiveFeatures=rhel8,AMD,EPYC-7302 Gres=gpu:rtxa6000:8 NodeAddr=tron05 NodeHostName=tron05 Version=21.08.5 OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 RealMemory=257538 AllocMem=157696 FreeMem=197620 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A Partitions=scavenger,tron BootTime=2022-04-21T17:40:51 SlurmdStartTime=2022-04-21T18:00:56 LastBusyTime=2022-04-22T11:21:16 CfgTRES=cpu=32,mem=257538M,billing=346,gres/gpu=8,gres/gpu:rtxa6000=8 AllocTRES=cpu=28,mem=154G,gres/gpu=7,gres/gpu:rtxa6000=7 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=tron13 Arch=x86_64 CoresPerSocket=16 CPUAlloc=1 CPUTot=16 CPULoad=8.41 AvailableFeatures=rhel8,AMD,EPYC-7302P ActiveFeatures=rhel8,AMD,EPYC-7302P Gres=gpu:rtxa4000:4 NodeAddr=tron13 NodeHostName=tron13 Version=21.08.5 OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 RealMemory=128525 AllocMem=65536 FreeMem=33463 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A Partitions=scavenger,tron BootTime=2022-04-21T17:40:46 SlurmdStartTime=2022-04-21T17:54:51 LastBusyTime=2022-04-22T13:04:57 CfgTRES=cpu=16,mem=128525M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4 AllocTRES=cpu=1,mem=64G,gres/gpu=4,gres/gpu:rtxa4000=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
sacctmgr
The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QoSes.
$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU Name Priority MaxWall MaxJobsPU ---------- ---------- ----------- --------- normal 0 dpart 0 2-00:00:00 8 gpu 0 08:00:00 2