SLURM/ClusterStatus: Difference between revisions
Jump to navigation
Jump to search
(→sinfo) |
No edit summary |
||
(One intermediate revision by one other user not shown) | |||
Line 5: | Line 5: | ||
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually. | The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually. | ||
<pre> | <pre> | ||
$ sinfo | |||
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | ||
gamma up infinite 3 idle gammagpu[01-03] | gamma up infinite 3 idle gammagpu[01-03] | ||
Line 14: | Line 14: | ||
tron* up 3-00:00:00 21 mix tron[00-01,03-15,46-49,52-53] | tron* up 3-00:00:00 21 mix tron[00-01,03-15,46-49,52-53] | ||
tron* up 3-00:00:00 31 idle tron[02,16-45] | tron* up 3-00:00:00 31 idle tron[02,16-45] | ||
</pre> | |||
<pre> | <pre> | ||
$ sinfo -N | |||
NODELIST NODES PARTITION STATE | NODELIST NODES PARTITION STATE | ||
gammagpu01 1 gamma idle | gammagpu01 1 gamma idle | ||
Line 43: | Line 43: | ||
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces). | The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces). | ||
<pre> | <pre> | ||
$ scontrol show nodes | $ scontrol show nodes tron05,tron13 | ||
NodeName= | NodeName=tron05 Arch=x86_64 CoresPerSocket=16 | ||
CPUAlloc= | CPUAlloc=28 CPUTot=32 CPULoad=47.32 | ||
AvailableFeatures= | AvailableFeatures=rhel8,AMD,EPYC-7302 | ||
ActiveFeatures= | ActiveFeatures=rhel8,AMD,EPYC-7302 | ||
Gres= | Gres=gpu:rtxa6000:8 | ||
NodeAddr= | NodeAddr=tron05 NodeHostName=tron05 Version=21.08.5 | ||
OS=Linux RealMemory= | OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 | ||
State= | RealMemory=257538 AllocMem=157696 FreeMem=197620 Sockets=2 Boards=1 | ||
BootTime= | State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A | ||
Partitions=scavenger,tron | |||
BootTime=2022-04-21T17:40:51 SlurmdStartTime=2022-04-21T18:00:56 | |||
LastBusyTime=2022-04-22T11:21:16 | |||
CfgTRES=cpu=32,mem=257538M,billing=346,gres/gpu=8,gres/gpu:rtxa6000=8 | |||
AllocTRES=cpu=28,mem=154G,gres/gpu=7,gres/gpu:rtxa6000=7 | |||
CapWatts=n/a | CapWatts=n/a | ||
CurrentWatts=0 | CurrentWatts=0 AveWatts=0 | ||
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s | ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s | ||
NodeName=tron13 Arch=x86_64 CoresPerSocket=16 | |||
NodeName= | CPUAlloc=1 CPUTot=16 CPULoad=8.41 | ||
CPUAlloc=1 | AvailableFeatures=rhel8,AMD,EPYC-7302P | ||
AvailableFeatures= | ActiveFeatures=rhel8,AMD,EPYC-7302P | ||
ActiveFeatures= | Gres=gpu:rtxa4000:4 | ||
Gres=gpu: | NodeAddr=tron13 NodeHostName=tron13 Version=21.08.5 | ||
NodeAddr= | OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 | ||
OS=Linux RealMemory= | RealMemory=128525 AllocMem=65536 FreeMem=33463 Sockets=1 Boards=1 | ||
State=MIXED ThreadsPerCore=1 TmpDisk= | State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A | ||
BootTime= | Partitions=scavenger,tron | ||
BootTime=2022-04-21T17:40:46 SlurmdStartTime=2022-04-21T17:54:51 | |||
LastBusyTime=2022-04-22T13:04:57 | |||
CfgTRES=cpu=16,mem=128525M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4 | |||
AllocTRES=cpu=1,mem=64G,gres/gpu=4,gres/gpu:rtxa4000=4 | |||
CapWatts=n/a | CapWatts=n/a | ||
CurrentWatts=0 | CurrentWatts=0 AveWatts=0 | ||
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s | ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s | ||
</pre> | </pre> |
Latest revision as of 17:36, 2 October 2023
Cluster Status
SLURM offers a variety of tools to check the general status of nodes/partitions in a cluster.
sinfo
The sinfo command will show you the status of partitions in the cluster. Passing the -N flag will show each node individually.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gamma up infinite 3 idle gammagpu[01-03] scavenger up infinite 2 drain tron[50-51] scavenger up infinite 21 mix tron[00-01,03-15,46-49,52-53] scavenger up infinite 31 idle tron[02,16-45] tron* up 3-00:00:00 2 drain tron[50-51] tron* up 3-00:00:00 21 mix tron[00-01,03-15,46-49,52-53] tron* up 3-00:00:00 31 idle tron[02,16-45]
$ sinfo -N NODELIST NODES PARTITION STATE gammagpu01 1 gamma idle gammagpu02 1 gamma idle gammagpu03 1 gamma idle tron00 1 scavenger mix tron00 1 tron* mix tron01 1 scavenger mix tron01 1 tron* mix tron02 1 scavenger idle tron02 1 tron* idle tron03 1 scavenger mix tron03 1 tron* mix tron04 1 scavenger mix tron04 1 tron* mix ... tron52 1 scavenger mix tron52 1 tron* mix tron53 1 scavenger mix tron53 1 tron* mix
scontrol
The scontrol command can be used to view the status/configuration of the nodes in the cluster. If passed specific node name(s) only information about those node(s) will be displayed, otherwise all nodes will be listed. To specify multiple nodes, separate each node name by a comma (no spaces).
$ scontrol show nodes tron05,tron13 NodeName=tron05 Arch=x86_64 CoresPerSocket=16 CPUAlloc=28 CPUTot=32 CPULoad=47.32 AvailableFeatures=rhel8,AMD,EPYC-7302 ActiveFeatures=rhel8,AMD,EPYC-7302 Gres=gpu:rtxa6000:8 NodeAddr=tron05 NodeHostName=tron05 Version=21.08.5 OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 RealMemory=257538 AllocMem=157696 FreeMem=197620 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A Partitions=scavenger,tron BootTime=2022-04-21T17:40:51 SlurmdStartTime=2022-04-21T18:00:56 LastBusyTime=2022-04-22T11:21:16 CfgTRES=cpu=32,mem=257538M,billing=346,gres/gpu=8,gres/gpu:rtxa6000=8 AllocTRES=cpu=28,mem=154G,gres/gpu=7,gres/gpu:rtxa6000=7 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=tron13 Arch=x86_64 CoresPerSocket=16 CPUAlloc=1 CPUTot=16 CPULoad=8.41 AvailableFeatures=rhel8,AMD,EPYC-7302P ActiveFeatures=rhel8,AMD,EPYC-7302P Gres=gpu:rtxa4000:4 NodeAddr=tron13 NodeHostName=tron13 Version=21.08.5 OS=Linux 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Tue Mar 8 12:56:54 EST 2022 RealMemory=128525 AllocMem=65536 FreeMem=33463 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A Partitions=scavenger,tron BootTime=2022-04-21T17:40:46 SlurmdStartTime=2022-04-21T17:54:51 LastBusyTime=2022-04-22T13:04:57 CfgTRES=cpu=16,mem=128525M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4 AllocTRES=cpu=1,mem=64G,gres/gpu=4,gres/gpu:rtxa4000=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
sacctmgr
The sacctmgr command shows cluster accounting information. One of the helpful commands is to list the available QoSes.
$ sacctmgr list qos format=Name,Priority,MaxWall,MaxJobsPU Name Priority MaxWall MaxJobsPU ---------- ---------- ----------- --------- normal 0 dpart 0 2-00:00:00 8 gpu 0 08:00:00 2