|
|
Line 2: |
Line 2: |
|
| |
|
| <span style="font-size:150%">'''As of the [[MonthlyMaintenanceWindow | August 2023 maintenance window]], all compute nodes have moved into the [[Nexus]] cluster.''' Please see [[Nexus/CML]] for more details.</span> | | <span style="font-size:150%">'''As of the [[MonthlyMaintenanceWindow | August 2023 maintenance window]], all compute nodes have moved into the [[Nexus]] cluster.''' Please see [[Nexus/CML]] for more details.</span> |
|
| |
| =Compute Infrastructure=
| |
| Each of UMIACS' cluster computational infrastructures is accessed through the submission node. Users will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node. Each cluster in UMIACS has different quality of services (QoS) that are '''required''' to be selected upon submission of a job. Many clusters, including this one, also have other resources such as GPUs that need to be requested for a job.
| |
|
| |
| The current submission node(s) for '''CML''' are:
| |
| * <code>cmlsub00.umiacs.umd.edu</code>
| |
|
| |
| The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members. The scheduler's resources are modeled around this concept. This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.
| |
|
| |
| ==Partitions==
| |
| There are three partitions available to general CML [[SLURM]] users. If you do not specify a partition when submitting your job, you will receive the '''dpart''' partition.
| |
|
| |
| * '''dpart''' - This is the default partition. Job allocations are guaranteed.
| |
| * '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other partitions are ready to be scheduled.
| |
| * '''cpu''' - This partition is for CPU focused jobs. Job allocations are guaranteed.
| |
|
| |
| There is one additional partition available solely to Furong's sponsored accounts.
| |
|
| |
| * '''furongh''' - This partition is for exclusive priority access to Furong's purchased A6000 node.
| |
|
| |
| ==Accounts==
| |
| The Center has a base SLURM account <code>cml</code> which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested. If you do not specify an account when submitting your job, you will receive the '''cml''' account.
| |
|
| |
| <pre>
| |
| $ sacctmgr show accounts
| |
| Account Descr Org
| |
| ---------- -------------------- --------------------
| |
| abhinav abhinav shrivastava cml
| |
| cml cml cml
| |
| furongh furong huang cml
| |
| hajiagha mohammad hajiaghayi cml
| |
| john john dickerson cml
| |
| ramani ramani duraiswami cml
| |
| root default root account root
| |
| scavenger scavenger scavenger
| |
| sfeizi soheil feizi cml
| |
| tokekar pratap tokekar cml
| |
| tomg tom goldstein cml
| |
| </pre>
| |
|
| |
| You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with. Please [[HelpDesk | contact staff]] and include your faculty member in the conversation if you do not see the appropriate association.
| |
|
| |
| <pre>
| |
| $ show_assoc
| |
| User Account Def Acct Def QOS QOS
| |
| ---------- ---------- ---------- --------- ------------------------------------
| |
| tomg tomg default,high,medium
| |
| tomg cml cpu,default,medium
| |
| tomg scavenger scavenger
| |
| </pre>
| |
|
| |
| You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. The billing number displayed here is the sum of [[SLURM/Priority#Modern | resource weightings]] for all nodes appropriated to that account.
| |
|
| |
| <pre>
| |
| $ sacctmgr show assoc account=tomg format=user,account,qos,grptres
| |
| User Account QOS GrpTRES
| |
| ---------- ---------- -------------------- -------------
| |
| tomg billing=8107
| |
| </pre>
| |
|
| |
| ==QoS==
| |
| CML currently has 5 QoS for the '''dpart''' partition (though <code>high_long</code> and <code>very_high</code> may not be available to all faculty accounts), 1 QoS for the '''scavenger''' partition, and 1 QoS for the '''cpu''' partition. If you do not specify a QoS when submitting your job, you will receive the '''default''' QoS. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).
| |
|
| |
| <pre>
| |
| $ show_qos
| |
| Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
| |
| ------------ ----------- ------- ------------------------------ ------------------------------ --------------------
| |
| medium 3-00:00:00 2 cpu=8,gres/gpu=2,mem=64G
| |
| default 7-00:00:00 2 cpu=4,gres/gpu=1,mem=32G
| |
| high 1-12:00:00 2 cpu=16,gres/gpu=4,mem=128G
| |
| scavenger 3-00:00:00 gres/gpu=24
| |
| normal
| |
| cpu 7-00:00:00 8
| |
| very_high 1-12:00:00 8 cpu=32,gres/gpu=8,mem=256G gres/gpu=12
| |
| high_long 14-00:00:00 8 cpu=32,gres/gpu=8 gres/gpu=8
| |
| </pre>
| |
|
| |
| ==GPUs==
| |
| Jobs that require GPU resources need to explicitly request the resources within their job submission. This is done through Generic Resource Scheduling (GRES). Users may use the most generic identifier (in this case '''gpu'''), a colon, and a number to select without explicitly naming the type of GPU (i.e. <code>--gres=gpu:4</code> for 4 GPUs of any type).
| |
|
| |
| <pre>
| |
| $ sinfo -o "%20N %10c %10m %25f %40G"
| |
| NODELIST CPUS MEMORY AVAIL_FEATURES GRES
| |
| cmlgrad[02,05] 32 385421 Xeon,4216 gpu:rtx2080ti:7,gpu:rtx3070:1
| |
| cml[00-11,13-16],cml 32 353924+ Xeon,4216 gpu:rtx2080ti:8
| |
| cmlcpu[01-04] 20 386675 Xeon,E5-2660 (null)
| |
| cmlcpu[00,06-07] 24 386675+ Xeon,E5-2680 (null)
| |
| cml12 32 385429 Xeon,4216 gpu:rtx2080ti:7,gpu:rtxa4000:1
| |
| cml[17-29] 32 257654 Zen,EPYC-7282 gpu:rtxa4000:8
| |
| </pre>
| |
|
| |
| ==Job Submission and Management==
| |
| Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation.
| |
|
| |
| A very quick start to get an interactive shell is as follows when run on the submission node. This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time. If the job goes beyond these limits (either the memory allocation or the maximum time) it will be terminated immediately.
| |
|
| |
| <pre>
| |
| srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
| |
| </pre>
| |
|
| |
| <pre>
| |
| [username@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
| |
| [username@cml00:~ ] $ nvidia-smi -L
| |
| GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)
| |
| </pre>
| |
|
| |
| If you are going to run in a faculty account instead of the default <code>cml</code> account you will need to specify the <code>--account=</code> flag.
| |
|
| |
| A quick example to run an interactive job using the cpu partition. The cpu partition uses the default account <code>cml</code>.
| |
| <pre>
| |
| -bash-4.2$ srun --partition=cpu --qos=cpu bash -c 'echo "Hello World from" `hostname`'
| |
| </pre>
| |
|
| |
| =Data Storage=
| |
| Information on data storage available in CML's computational infrastructure can be found [[Nexus/CML#Storage | here]].
| |
The Center for Machine Learning (CML) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.
As of the August 2023 maintenance window, all compute nodes have moved into the Nexus cluster. Please see Nexus/CML for more details.