ClassAccounts: Difference between revisions
No edit summary |
|||
(36 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
==Overview== | ==Overview== | ||
UMIACS Class Accounts | UMIACS Class Accounts support classes for all of UMIACS/CSD via the [[Nexus]] cluster. Faculty may request that a class be supported by following the instructions [[ClassAccounts/Manage | here]]. | ||
==Getting an account== | ==Getting an account== | ||
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA to have it renewed. | Your TA or instructor will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA or instructor. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA or instructor to have it renewed. | ||
Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business. | Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business. | ||
'''Any questions or issues with your account, storage, or cluster use must first be made through your TA or instructor.''' | |||
===Registering for Duo=== | ===Registering for Duo=== | ||
UMIACS requires that all Class accounts | UMIACS requires that all Class accounts register for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.''' | ||
If you see the following error in your SSH client, you have not yet enrolled/registered in Duo. | |||
<pre> | |||
Access is not allowed because you are not enrolled in Duo. Please contact your organization's IT help desk. | |||
</pre> | |||
In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, | In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, please visit our [[Duo | Duo help page]]. | ||
Once notified that your account has been installed and you have registered in our Duo instance, you can | Once notified that your account has been installed and you have registered in our Duo instance, you can [[SSH]] to <code>nexusclass.umiacs.umd.edu</code> with your assigned username and your chosen password to log in to a submission host. | ||
* <code>nexusclass00.umiacs.umd.edu</code> | |||
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission hosts, you will need to connect to that same submission host to access it later. The actual submission hosts are: | |||
* <code>nexusclass00.umiacs.umd.edu</code> | |||
* <code>nexusclass01.umiacs.umd.edu</code> | |||
==Cleaning up your account before the end of the semester== | ==Cleaning up your account before the end of the semester== | ||
Class accounts for a given semester | Class accounts for a given semester are liable to be archived and deleted after that semester's completion as early as the following: | ||
* Winter semesters: February 1st of same year | |||
* Spring semesters: June 1st of same year | * Spring semesters: June 1st of same year | ||
* Summer semesters: September 1st of same year | * Summer semesters: September 1st of same year | ||
Line 24: | Line 36: | ||
==Personal Storage== | ==Personal Storage== | ||
Your home directory has a quota of | Your home directory has a quota of 30GB and is located at: | ||
<pre> | <pre> | ||
/fs/classhomes/<semester><year>/<coursecode>/<username> | /fs/classhomes/<semester><year>/<coursecode>/<username> | ||
Line 31: | Line 43: | ||
where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000". | where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000". | ||
You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at | You can request up to another 100GB of personal storage if you would like by '''having your TA or instructor [[HelpDesk | contact staff]]'''. This storage will be located at | ||
<pre> | <pre> | ||
/fs/class-projects/<semester><year>/<coursecode>/<username> | /fs/class-projects/<semester><year>/<coursecode>/<username> | ||
Line 37: | Line 49: | ||
==Group Storage== | ==Group Storage== | ||
You can also request group storage | You can also request group storage by '''having your TA or instructor [[HelpDesk | contact staff]]''' to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at | ||
<pre> | <pre> | ||
/fs/class-projects/<semester><year>/<coursecode>/<groupname> | /fs/class-projects/<semester><year>/<coursecode>/<groupname> | ||
Line 50: | Line 62: | ||
==Cluster Usage== | ==Cluster Usage== | ||
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the | '''You may not run computational jobs on any submission host.''' You must [[SLURM/JobSubmission | schedule your jobs with the SLURM workload manager]]. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager]. | ||
Class accounts only have access to the following submission parameters in SLURM: | |||
* <code>--partition</code> - <code>class</code> | |||
* <code>--account</code> - <code>class</code> | |||
* <code>--qos</code> - <code>default</code>, <code>medium</code>, and <code>high</code> | |||
You must specify at least the partition parameter manually in any submission command you run. If you do not specify any QoS parameter, you will receive the QoS <code>default</code>. | |||
You can view the resource limits for each QoS by running the command <code>show_qos</code>. The value in the MaxWall column is the maximum runtime you can run a single job for each QoS, and the values in the MaxTRES column are the maximum amount of CPU cores/GPUs/memory you can request for a single job using each QoS. | |||
Please note that you will be restricted to 32 total CPU cores, 4 total GPUs, and 256GB total RAM across all jobs you have running at once. This can be viewed with the command <code>show_partition_qos</code>. | |||
===Example=== | ===Example=== | ||
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code> with the account <code>class</code> running with the QoS of <code>default</code>. | Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code>, with the account <code>class</code>, running with the QoS of <code>default</code> and the default CPU/memory allocation/time limit for the partition. | ||
<pre> | <pre> | ||
$ srun | bash-4.4$ hostname | ||
nexusclass00.umiacs.umd.edu | |||
bash-4.4$ srun --partition=class --account=class --qos=default --gres=gpu:1 --pty bash | |||
srun: Job time limit was unset; set to partition default of 60 minutes | |||
srun: job 1333337 queued and waiting for resources | |||
srun: job 1333337 has been allocated resources | |||
bash-4.4$ hostname | bash-4.4$ hostname | ||
tron14.umiacs.umd.edu | tron14.umiacs.umd.edu | ||
bash-4.4$ nvidia-smi -L | bash-4.4$ nvidia-smi -L | ||
GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c) | GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c) | ||
Line 72: | Line 95: | ||
===Available Nodes=== | ===Available Nodes=== | ||
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to | You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to troubleshoot GPUs or other components. | ||
<pre> | <pre> | ||
$ show_nodes -p class | $ show_nodes -p class | ||
NODELIST CPUS MEMORY AVAIL_FEATURES | NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE | ||
tron06 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron06 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron07 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron07 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron08 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron08 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron09 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron09 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron10 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron10 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron11 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron11 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron12 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron12 16 128525 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron13 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron13 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron14 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron14 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron15 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron15 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron16 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron16 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron17 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron17 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron18 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron18 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron19 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron19 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron20 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron20 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron21 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle | tron21 16 128525 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle | ||
tron22 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron22 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron23 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron23 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron24 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron24 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron25 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron25 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron26 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron26 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron27 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron27 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron28 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron28 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron29 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron29 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron30 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron30 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron31 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron31 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron32 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron32 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron33 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron33 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron34 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle | tron34 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle | ||
tron35 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron35 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron36 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron36 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron37 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron37 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron38 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron38 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron39 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron39 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron40 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron40 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron41 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron41 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron42 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron42 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron43 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron43 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron44 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle | tron44 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle | ||
tron46 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron47 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron48 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron49 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron50 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron51 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron52 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron53 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron54 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron55 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron56 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron57 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron58 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron59 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron60 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
tron61 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle | |||
</pre> | </pre> | ||
Line 122: | Line 160: | ||
<pre> | <pre> | ||
$ scontrol show node | $ scontrol show node tron06 | ||
NodeName= | NodeName=tron06 Arch=x86_64 CoresPerSocket=16 | ||
CPUAlloc=0 CPUTot=16 CPULoad=0. | CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.08 | ||
AvailableFeatures=rhel8, | AvailableFeatures=rhel8,Zen,EPYC-7302P,Ampere | ||
ActiveFeatures=rhel8, | ActiveFeatures=rhel8,Zen,EPYC-7302P,Ampere | ||
Gres=gpu:rtxa4000:4 | Gres=gpu:rtxa4000:4 | ||
NodeAddr= | NodeAddr=tron06 NodeHostName=tron06 Version=23.02.6 | ||
OS=Linux 4.18.0- | OS=Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 | ||
RealMemory= | RealMemory=126214 AllocMem=0 FreeMem=107174 Sockets=1 Boards=1 | ||
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight= | State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=340 Owner=N/A MCS_label=N/A | ||
Partitions=class,scavenger,tron | Partitions=class,scavenger,tron | ||
BootTime= | BootTime=2024-01-29T09:35:12 SlurmdStartTime=2024-02-05T15:14:20 | ||
LastBusyTime= | LastBusyTime=2024-02-16T15:59:38 ResumeAfterTime=None | ||
CfgTRES=cpu=16,mem= | CfgTRES=cpu=16,mem=126214M,billing=638,gres/gpu=4,gres/gpu:rtxa4000=4 | ||
AllocTRES= | AllocTRES= | ||
CapWatts=n/a | CapWatts=n/a |
Latest revision as of 22:12, 11 November 2024
Overview
UMIACS Class Accounts support classes for all of UMIACS/CSD via the Nexus cluster. Faculty may request that a class be supported by following the instructions here.
Getting an account
Your TA or instructor will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA or instructor. You must redeem the account within 7 days or else the redemption token will expire. If your redemption token does expire, please contact your TA or instructor to have it renewed.
Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.
Any questions or issues with your account, storage, or cluster use must first be made through your TA or instructor.
Registering for Duo
UMIACS requires that all Class accounts register for MFA (multi-factor authentication) under our Duo instance (note that this is different than UMD's general Duo instance). You will not be able to log onto the class submission host until you register.
If you see the following error in your SSH client, you have not yet enrolled/registered in Duo.
Access is not allowed because you are not enrolled in Duo. Please contact your organization's IT help desk.
In order to register, visit our directory app and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, please visit our Duo help page.
Once notified that your account has been installed and you have registered in our Duo instance, you can SSH to nexusclass.umiacs.umd.edu
with your assigned username and your chosen password to log in to a submission host.
If you store something in a local filesystem directory (/tmp, /scratch0) on one of the two submission hosts, you will need to connect to that same submission host to access it later. The actual submission hosts are:
nexusclass00.umiacs.umd.edu
nexusclass01.umiacs.umd.edu
Cleaning up your account before the end of the semester
Class accounts for a given semester are liable to be archived and deleted after that semester's completion as early as the following:
- Winter semesters: February 1st of same year
- Spring semesters: June 1st of same year
- Summer semesters: September 1st of same year
- Fall semesters: January 1st of next year
It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.
Personal Storage
Your home directory has a quota of 30GB and is located at:
/fs/classhomes/<semester><year>/<coursecode>/<username>
where <semester>
is either "spring", "summer", "fall", or "winter", <year>
is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's Schedule of Classes in all lowercase e.g., "cmsc999z", and <username>
is the username mentioned in the email you received to redeem the account e.g., "c999z000".
You can request up to another 100GB of personal storage if you would like by having your TA or instructor contact staff. This storage will be located at
/fs/class-projects/<semester><year>/<coursecode>/<username>
Group Storage
You can also request group storage by having your TA or instructor contact staff to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
where <groupname>
is composed of:
- the abbreviated course code as used in the username e.g., "c999z"
- the character "g"
- the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long
e.g., "c999zg00".
Cluster Usage
You may not run computational jobs on any submission host. You must schedule your jobs with the SLURM workload manager. You can also find out more with the public documentation for the SLURM Workload Manager.
Class accounts only have access to the following submission parameters in SLURM:
--partition
-class
--account
-class
--qos
-default
,medium
, andhigh
You must specify at least the partition parameter manually in any submission command you run. If you do not specify any QoS parameter, you will receive the QoS default
.
You can view the resource limits for each QoS by running the command show_qos
. The value in the MaxWall column is the maximum runtime you can run a single job for each QoS, and the values in the MaxTRES column are the maximum amount of CPU cores/GPUs/memory you can request for a single job using each QoS.
Please note that you will be restricted to 32 total CPU cores, 4 total GPUs, and 256GB total RAM across all jobs you have running at once. This can be viewed with the command show_partition_qos
.
Example
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition class
, with the account class
, running with the QoS of default
and the default CPU/memory allocation/time limit for the partition.
bash-4.4$ hostname nexusclass00.umiacs.umd.edu bash-4.4$ srun --partition=class --account=class --qos=default --gres=gpu:1 --pty bash srun: Job time limit was unset; set to partition default of 60 minutes srun: job 1333337 queued and waiting for resources srun: job 1333337 has been allocated resources bash-4.4$ hostname tron14.umiacs.umd.edu bash-4.4$ nvidia-smi -L GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c)
Available Nodes
You can list the available nodes and their current state with the show_nodes -p class
command. This list of nodes is not completely static as nodes may be pulled out of service to troubleshoot GPUs or other components.
$ show_nodes -p class NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE tron06 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron07 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron08 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron09 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron10 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron11 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron12 16 128525 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron13 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron14 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron15 16 128520 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron16 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron17 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron18 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron19 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron20 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron21 16 128525 rhel8,AMD,EPYC-7302P,Ampere gpu:rtxa4000:4 idle tron22 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron23 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron24 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron25 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron26 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron27 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron28 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron29 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron30 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron31 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron32 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron33 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron34 16 128524 rhel8,Zen,EPYC-7313P,Ampere gpu:rtxa4000:4 idle tron35 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron36 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron37 16 128521 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron38 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron39 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron40 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron41 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron42 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron43 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron44 16 128525 rhel8,AMD,EPYC-7302,Ampere gpu:rtxa4000:4 idle tron46 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron47 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron48 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron49 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron50 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron51 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron52 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron53 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron54 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron55 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron56 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron57 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron58 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron59 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron60 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle tron61 48 255232 rhel8,Zen,EPYC-7352,Ampere gpu:rtxa5000:8 idle
You can also find more granular information about an individual node with the scontrol show node
command.
$ scontrol show node tron06 NodeName=tron06 Arch=x86_64 CoresPerSocket=16 CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.08 AvailableFeatures=rhel8,Zen,EPYC-7302P,Ampere ActiveFeatures=rhel8,Zen,EPYC-7302P,Ampere Gres=gpu:rtxa4000:4 NodeAddr=tron06 NodeHostName=tron06 Version=23.02.6 OS=Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 RealMemory=126214 AllocMem=0 FreeMem=107174 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=340 Owner=N/A MCS_label=N/A Partitions=class,scavenger,tron BootTime=2024-01-29T09:35:12 SlurmdStartTime=2024-02-05T15:14:20 LastBusyTime=2024-02-16T15:59:38 ResumeAfterTime=None CfgTRES=cpu=16,mem=126214M,billing=638,gres/gpu=4,gres/gpu:rtxa4000=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s