UMIACS - User contributions [en]

CML

2020-08-24T16:39:29Z

Pmehta1:

The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

= Compute Infrastructure =
Each of UMIACS' cluster computational infrastructures is accessed through the submission node. Users will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node. Each cluster in UMIACS has different quality of service (QoS) that are '''required''' to be selected upon submission of a job. Many clusters, including this one, also have other resources such as GPUs that need to be requested for a job.

The current submission node(s) for '''CML''' are:
* <code>cmlsub00.umiacs.umd.edu</code>

The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members. The scheduler's resources are modeled around this concept. This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.

== Partitions ==
There are three partitions to the CML [[SLURM]] computational infrastructure. If you do not specify a partition when submitting your job, you will receive the '''dpart''' partition.

* '''dpart''' - This is the default partition and job allocations are guaranteed.
* '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the '''dpart''' partition are ready to be scheduled.
* '''cpu''' - This partition is for CPU focussed jobs and the job allocations are guaranteed.

== Accounts ==
The Center has a base account <code>cml</code> which has a modest number of nodes (currently 16 GPUs) total available in it. Other faculty that have invested in the cluster have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed GPU resources corresponding to the amount that they invested. If you do not specify a account when submitting your job, you will receive the <code>cml</code> account.

<pre>
# sacctmgr show accounts
Account Descr Org
---------- -------------------- --------------------
abhinav abhinav shrivastava cml
cml cml cml
furongh furong huang cml
john john dickerson cml
root default root account root
scavenger scavenger scavenger
sfeizi soheil feizi cml
tomg tom goldstein cml
</pre>

You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with. Please contact staff@umiacs.umd.edu and CC your faculty member if you do not see the appropriate association.

<pre>
$ show_assoc
User Account Def Acct Def QOS QOS
---------- ---------- ---------- --------- ------------------------------------
tomg tomg default,high,medium
tomg cml cpu,default,medium
tomg scavenger scavenger
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.

<pre>
$ sacctmgr show assoc account=tomg format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
tomg gres/gpu=24

</pre>

== QoS ==
CML currently has 3 QoS for the '''dpart''' partition, 1 QoS for the '''scavenger''' partition and 1 QoS for the '''cpu''' partition. You are '''required''' to specify a QoS when submitting your job. The important part here is that in different QoS you can have a shorter/longer maximum wall time, a different total number of jobs running at once, and a different maximum number of track-able resources (TRES) for the job. In the scavenger QoS, one more constraint that you are restricted by is the total number of TRES per user (over multiple jobs).

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU
---------- ----------- ------- ------------------------------ -------------
medium 3-00:00:00 1 cpu=8,mem=64G,gres/gpu=2
default 7-00:00:00 2 cpu=4,mem=32G,gres/gpu=1
high 1-12:00:00 1 cpu=16,mem=128G,gres/gpu=4
scavenger 3-00:00:00 gres/gpu=36
cpu 1-00:00:00 1
</pre>

== GPUs ==
Jobs that require GPU resources need to explicitly request the resources within their job submission. This is done through Generic Resource Scheduling (GRES). Currently all nodes in the cluster are homogeneous, however in the future this may not be the case. Users may use the most generic identifier (in this case '''gpu'''), a colon, and a number to select without explicitly naming the type of GPU (ie. <code>--gres gpu:4</code> for 4 GPUs).

<pre>
$ sinfo -o "%15N %10c %10m %25f %25G"
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
cml[00-09] 32 1+ (null) gpu:rtx2080ti:8
</pre>

== Job Submission and Management ==
Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation.

A very quick start to get an interactive shell is as follows when run on the submission node. This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time. If the job goes beyond these limits (either the memory allocation or the maximum time) it will be terminated immediately.

<pre>
srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
</pre>

<pre>
[derek@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[derek@cml00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)
</pre>

If you are going to run in a faculty account instead of the default <code>cml</code> account you will need to specify the <code>--account=</code> flag.

A quick example to run an interactive job using the cpu partition. The cpu partition uses the default account <code>cml</code>.
<pre>
-bash-4.2$ srun --partition=cpu --qos=cpu bash -c 'echo "Hello World from" `hostname`'
</pre>

=Data Storage=
Until the final storage investment arrives we have made available a temporary allocation of storage. This section is subject to change. There are 3 types of storage available to users in the CML:
* Home directories
* Project directories
* Scratch directories

== Home Directories ==
Home directories in the CML computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both snapshots and is backed up nightly.

== Project Directories ==
Users within the CML compute infrastructure can request project based allocations for up to 2TB for up to 120 days from [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] with approval from a CML faculty member and the director of CML. These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation. Once the allocation period is over, the user will be contacted and give a window of opportunity to clean and secure their data before staff will remove the allocation.

This data is backed up nightly.

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the CML compute infrastructure:
* Network scratch directory
* Local scratch directories

=== Network Scratch Directory===
Users granted access to the CML compute infrastructure are each allocated '''400GB''' of network attached scratch. This is available as <code>/cmlscratch/USERNAME</code> where USERNAME is your username. This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

Users may request an additional allocation of scratch space up to a total of '''800GB''' by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

=== Local Scratch Directories===
Each computational node that a user can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, users must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories will have a tmpwatch job which will '''delete unmodified data after 120 days'''. Please make sure you secure any data you write to these directories at the end of your job.

== Data sets ==
The following data sets are available read only for the Center. If there are other data sets that you would like to see curated and available, please contact [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

{| class="wikitable"
! Data set
! Path
|-
| CelebA
| /fs/cml-datasets/CelebA
|-
| Cityscapes
| /fs/cml-datasets/cityscapes
|-
| COCO
| /fs/cml-datasets/coco
|-
| Diversity in Faces
| /fs/cml-datasets/diversity_in_faces
|-
| ImageNet ILSVRC2012
| /fs/cml-datasets/ImageNet/ILSVRC2012
|-
| LFW
| /fs/cml-datasets/facial_test_data
|-
| LibriSpeech
| /fs/cml-datasets/LibriSpeech
|-
| LSUN
| /fs/cml-datasets/LSUN
|-
| MegaFace
| /fs/cml-datasets/megaface
|-
| MS-Celeb-1M
| /fs/cml-datasets/MS_Celeb_aligned_112
|-
| roberta
| /fs/cml-datasets/roberta
|-
| ShapeNetCore.v2
| /fs/cml-datasets/ShapeNetCore.v2
|}

CML

2020-04-06T18:53:36Z

Pmehta1:

The Center for Machine Learning ([https://ml.umd.edu CML]) at the University of Maryland is located within the Institute for Advanced Computer Studies. The CML has a cluster of computational (CPU/GPU) resources that are available to be scheduled.

=Compute Infrastructure=

Each of UMIACS cluster computational infrastructures is accessed through the submission node. Users will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node. Each cluster in UMIACS has different quality of service (QoS) that are '''required''' to be selected upon submission of a job and many like this one has specific other resources such as GPUs that need to be requested for a job.

The current submission node(s) for '''CML''' are:
* <code>cmlsub00.umiacs.umd.edu</code>

The Center for Machine Learning GPU resources are a small investment from the base Center funds and a number of investments by individual faculty members. The schedulers resources therefore going forward are modeled around this concept. This means there are additional Slurm accounts that users will need to be aware of if they are submitting in the non-scavenger partition.

== Partitions ==
There are two partitions to the CML [[SLURM]] computational infrastructure. If you do not specify a partition when submitting your job you will receive the '''dpart'''.

* '''dpart''' - This is the default partition and job allocations are guaranteed.
* '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the '''dpart''' are ready to be scheduled.

== Accounts ==
The Center has a base account <code>cml</code> which has a modest number of nodes (currently 16 GPUs) total that available in it. For other faculty who have invested they have an additional account provided which provides guaranteed GPU resources that they invested. If you do not specify a account when submitting your job you will receive the <code>cml</code> account.

<pre>
$ sacctmgr show accounts
Account Descr Org
---------- -------------------- --------------------
cml cml cml
furongh furong huang cml
john john dickerson cml
root default root account root
scavenger scavenger scavenger
sfeizi soheil feizi cml
tomg tom goldstein cml
</pre>

You can check your account associations by running the '''show_assoc''' to see the accounts you are associated with. Please contact staff@umiacs.umd.edu and CC your faculty member if you do not see the appropriate association.

<pre>
$ show_assoc
User Account Def Acct Def QOS QOS
---------- ---------- ---------- --------- ------------------------------------
tomg tomg default,high,medium
tomg cml default,medium
tomg scavenger scavenger
</pre>

You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for.

<pre>
$ sacctmgr show assoc account=tomg format=user,account,qos,grptres
User Account QOS GrpTRES
---------- ---------- -------------------- -------------
tomg gres/gpu=24
...

</pre>

== QoS ==
CML currently has 3 QoS for the '''dpart''' and 1 QoS for the '''scavenger''' partition. You are '''required''' to specify a QoS when submitting your job. The important parts here is that in different QoS you can have a shorter/longer maximum wall time, a total number of jobs running at once and a maximum number of track-able resources (TRES) for the job. In the scavenger QoS there is one more constraint that you are restricted by the total number of TRES per user (over multiple jobs).

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU
---------- ----------- ------- ------------------------------ -------------
medium 3-00:00:00 1 cpu=8,mem=64G,gres/gpu=2
default 7-00:00:00 2 cpu=4,mem=32G,gres/gpu=1
high 1-12:00:00 1 cpu=16,mem=128G,gres/gpu=4
scavenger 3-00:00:00 gres/gpu=36
</pre>

== GPUs ==
Jobs that require GPU resources need to explicitly request the resources within their job submission. This is done through Generic Resource Scheduling (GRES). Currently all nodes in the cluster are homogeneous however in the future this may not be the case. Users may use the most generic identifier in this case '''gpu''' a colon and a number to select without explicitly naming the type of GPU (ie. <code>--gres gpu:4</code> for 4 GPUs).

<pre>
$ sinfo -o "%15N %10c %10m %25f %25G"
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
cml[00-09] 32 1+ (null) gpu:rtx2080ti:8
</pre>

== Job Submission and Management ==
Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation.

A very quick start to get an interactive shell is as follows when run on the submission node. This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time. If the job goes beyond these limits either the memory allocation or the maximum time it will be terminated immediately.

<pre>
srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
</pre>

<pre>
[derek@cmlsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[derek@cml00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-20846848-e66d-866c-ecbe-89f2623f3b9a)
</pre>

If you are going to run in a faculty account instead of the default <code>cml</code> account you will need to specify the <code>--account=</code> flag.

=Data Storage=

Until the final storage investment arrives we have made available a temporary allocation of storage. This section is subject to change. There are 3 types of storage available to users in the CML home directories, project directories and scratch directories.

== Home Directories ==

Home directories in the CML computational infrastructure are available from the Institutes [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both snapshots and is backed up nightly.

== Project Directories ==

Users within the CML compute infrastructure can request project based allocations for up to 2TB for up to 120 days from [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] with approval from a CML faculty member and the director. These allocations will be available from '''/fs/cml-projects''' under a name that you provide when you request the allocation. Once the allocation period is over the user will be contacted and give a window of opportunity to clean and secure their data before staff will remove the allocation.

This data is backed up nightly.

== Scratch Directories ==

There are two types of scratch directories in the CML compute infrastructure, network and local scratch directories. Scratch data has no data protection including no snapshots and the data is not backed up.

=== Network Scratch Directory===
Users granted access to the CML compute infrastructure are each allocated '''400GB''' of network attached scratch. This is available as <code>/cmlscratch/USERNAME</code> where USERNAME is your username. This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

Users may request an additional allocation of scratch space up to '''800GB''' by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

=== Local Scratch Directory===
Each computational node that a user can schedule compute jobs on has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However users must stage their data within the confine of their job and stage the data out before the end of their job.

These local scratch directories will have a tmpwatch job which will '''delete unmodified data after 120 days'''. Please make sure you secure any data you write to these directories at the end of your job.

== Data sets ==

The following data sets are available read only for the Center. If there are other data sets that you would like to see curated and available please contact [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

{| class="wikitable"
! Data set
! Path
|-
| CelebA
| /fs/cml-datasets/CelebA
|-
| COCO
| /fs/cml-datasets/coco
|-
| ImageNet ILSVRC2012
| /fs/cml-datasets/ImageNet/ILSVRC2012
|-
| roberta
| /fs/cml-datasets/roberta
|-
| ShapeNetCore.v2
| /fs/cml-datasets/ShapeNetCore.v2
|-
| LSUN
| /fs/cml-datasets/LSUN
|}

SLURM/JobSubmission

2019-11-21T20:27:31Z

Pmehta1:

=Job Submission=

SLURM offers a variety of ways to run jobs. It is important to understand the different options available and how to request the resources required for a job in order for it to run successfully. All job submission should be done from submit nodes; any computational code should be run in a job allocation on compute nodes. The following commands outline how to allocate resources on the compute nodes and submit processes to be run on the allocated nodes.

Please note that the hard maximum number of jobs that the SLURM scheduler can handle is 10000. It is best to limit your number of submitted jobs at any given time to less than half this amount in the case that another user also wants to submit a large number of jobs.

==srun==
<code>srun</code> is the command used to run a process on the compute nodes in the cluster. It works by passing it a command (this could be a script) which will be run on a compute node and then <code>srun</code> will return. <code>srun</code> accepts many command line options to specify the resources required by the command passed to it. Some common command line arguments are listed below and full documentation of all available options is available in the man page for <code>srun</code>, which can be accessed by running <code>man srun</code>.
<pre>
tgray26@opensub01:srun --mem=100mb --time=1:00:00 bash -c 'echo "Hello World from" `hostname`'
Hello World from openlab06.umiacs.umd.edu
</pre>
It is important to understand that <code>srun</code> is an interactive command. By default input to <code>srun</code> is broadcast to all compute nodes running your process and output from the compute nodes is redirected to <code>srun</code>. This behavior can be changed; however, '''srun will always wait for the command passed to finish before exiting, so if you start a long running process and end your terminal session, your process will stop running on the compute nodes and your job will end'''. To run a non-interactive submission that will remain running after you logout, you will need to wrap your <code>srun</code> commands in a batch script and submit it with [[#sbatch | sbatch]]
===Common srun arguments===
* <code>--mem=1gb</code> ''if no unit is given MB is assumed''
* <code>--nodes=2</code> ''if passed to srun, the given command will be run concurrently on each node''
* <code>--qos=dpart</code> ''to see the available QOS options on a cluster, run'' <code>show_qos</code>
* <code>--time=hh:mm:ss</code> ''time needed to run your job''
* <code>--job-name=helloWorld</code>
* <code>--output filename</code> ''file to redirect stdout to''
* <code>--error filename</code> ''file to redirect stderr''
* <code>--partition $PNAME</code> ''request job run in the $PNAME partition''
* <code>--ntasks 2</code> ''request 2 "tasks" which map to cores on a CPU, if passed to srun the given command will be run concurrently on each core''
* <code>--account=accountname</code> ''use qos specific to an account''

===Interactive Shell Sessions===
An interactive shell session on a compute node can be useful for debugging or developing code that isn't ready to be run as a batch job. To get an interactive shell on a node, use <code>srun</code> to invoke a shell:
<pre>
tgray26@opensub01:srun --pty --mem 1gb --time=01:00:00 bash
tgray26@openlab06:
</pre>
'''Please do not leave interactive shells running for long periods of time when you are not working. This blocks resources from being used by everyone else.'''

==salloc==
The salloc command can also be used to request resources be allocated without needing a batch script. Running salloc with a list of resources will allocate the resources you requested, create a job, and drop you into a subshell with the environment variables necessary to run commands in the newly created job allocation. When your time is up or you exit the subshell, your job allocation will be relinquished.
<pre>
tgray26@opensub00:salloc -N 1 --mem=2gb --time=01:00:00
salloc: Granted job allocation 159
tgray26@opensub00:srun /usr/bin/hostname
openlab00.umiacs.umd.edu
tgray26@opensub00:exit
exit
salloc: Relinquishing job allocation 159
</pre>
'''Please note that any commands not invoked with srun will be run locally on the submit node. Please be careful when using salloc.'''

==sbatch==
The sbatch command allows you to write a batch script to be submitted and run non-interactively on the compute nodes. To run a simple Hello World command on the compute nodes you could write a file, helloWorld.sh with the following contents:
<pre>
#!/bin/bash

srun bash -c 'echo Hello World from `hostname`'
</pre>
Then you need to submit the script with sbatch and request resources:
<pre>tgray26@opensub00:sbatch --mem=1gb --time=1:00:00 helloWorld.sh
Submitted batch job 121
</pre>
SLURM will return a job number that you can use to check the status of your job with squeue:
<pre>
tgray26@opensub00:squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
121 dpart helloWor tgray26 R 0:01 2 openlab[00-01]
</pre>
====Advanced Batch Scripts====
You can also write a batch script with all of your resources/options defined in the script itself. This is useful for jobs that need to be run 10s/100s/1000s of times. You can then handle any necessary environment setup and run commands on the resources you requested by invoking commands with srun. The srun commands can also be more complex and be told to only use portions of your entire job allocation, each of these distinct srun commands makes up one "job step". The batch script will be run on the first node allocated as part of your job allocation and each job step will be run on whatever resources you tell them to. In the following example I have a batch job that will request 2 nodes in the cluster, then I load a specific version of Python into my environment and submit two job steps, each one using one node. Since srun is blocks until the command finishes, I use the '&' operator to background the process so that both job steps can run at once; however, this means that I then need to use the wait command to block processing until all background processes have finished.
<pre>
#!/bin/bash

# Lines that begin with #SBATCH specify commands to be used by SLURM for scheduling

#SBATCH --job-name=helloWorld # sets the job name
#SBATCH --output helloWorld.out.%j # indicates a file to redirect STDOUT to; %j is the jobid
#SBATCH --error helloWorld.out.%j # indicates a file to redirect STDERR to; %j is the jobid
#SBATCH --time=00:05:00 # how long you think your job will take to complete; format=hh:mm:ss
#SBATCH --qos=dpart # set QOS, this will determine what resources can be requested
#SBATCH --nodes=2 # number of nodes to allocate for your job
#SBATCH --ntasks=4 # request 4 cpu cores be reserved for your node total
#SBATCH --ntasks-per-node=2 # request 2 cpu cores be reserved per node
#SBATCH --mem 1gb # memory required by job; if unit is not specified MB will be assumed

module load Python/2.7.9 # run any commands necessary to setup your environment

srun -N 1 --mem=512mb bash -c "hostname; python --version" & # use srun to invoke commands within your job; using an '&'
srun -N 1 --mem=512mb bash -c "hostname; python --version" & # will background the process allowing them to run concurrently
wait # wait for any background processes to complete

# once the end of the batch script is reached your job allocation will be revoked
</pre>

Another useful thing to know is that you can pass additional arguments into your sbatch scripts on the command line and reference them as <code>${1}</code> for the first argument and so on.

====More Examples====

* [[SLURM/ArrayJobs]]

===scancel===
The scancel command can be used to cancel job allocations or job steps that are no longer needed. It can be passed individual job IDs or an option to delete all of your jobs or jobs that meet certain criteria.
*<code>scancel 255</code> ''cancel job 255''
*<code>scancel 255.3</code> ''cancel job step 3 of job 255''
*<code>scancel --user tgray26 --partition dpart</code> ''cancel all jobs for tgray26 in the dpart partition''

=Identifying Resources and Features=
The sinfo can show you additional features of nodes in the cluster but you need to ask it to show some non-default options using a command like this
<code>sinfo -o "%15N %10c %10m %25f %10G"</code>.

<pre>
$ sinfo -o "%40N %8c %8m %20f %25G"
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
openlab[30-33] 64 257759 Opteron,6274 (null)
openlab[00-07] 8 7822 Opteron,2354 (null)
openlab[10-11,13-18,20-23,25,27-29] 16 23939 Xeon,x5560 (null)
openlab08 32 128720 Xeon,E5-2690 gpu:k20:2
openlab09 32 128722 Xeon,E5-2690 gpu:m40:1,gpu:k20:2
</pre>

You can also identify further specific information about a node using [https://wiki.umiacs.umd.edu/umiacs/index.php/SLURM/ClusterStatus#scontrol scontrol].

=Requesting GPUs=
If you need to do processing on a GPU, you will need to request that your job have access to GPUs just as you need to request processors or cpu cores. You will also need to make sure that you submit your job to the correct partition since nodes with GPUs are often put into their own partition to prevent the nodes from being tied up by jobs that don't utilize GPUs. In SLURM, GPUs are considered "generic resources" also known as GRES. To request some number of GPUs be reserved/available for your job you can use the flag <code>--gres:gpu:2</code> or if there are multiple types of GPUs available in the cluster and you need a specific type, you can provide the type option to the gres flag <code>--gres:k20:1</code>
<pre>
tgray26@opensub01:srun --pty --partition gpu --qos=gpu --gres=gpu:2 nvidia-smi
Wed Jul 13 15:33:18 2016
+------------------------------------------------------+
| NVIDIA-SMI 361.28 Driver Version: 361.28 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | 0 |
| 30% 24C P0 48W / 225W | 11MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:84:00.0 Off | 0 |
| 30% 23C P0 52W / 225W | 11MiB / 4799MiB | 93% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
</pre>
Please note that your job will only be able to see/access the GPUs you requested. If you only need 1 GPU, please request only 1 GPU and the other one will be left available for other users:
<pre>
tgray26@opensub01:srun --pty --partition gpu --qos=gpu --gres=gpu:k20:1 nvidia-smi
Wed Jul 13 15:31:29 2016
+------------------------------------------------------+
| NVIDIA-SMI 361.28 Driver Version: 361.28 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | 0 |
| 30% 24C P0 50W / 225W | 11MiB / 4799MiB | 92% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
</pre>
The <code>--gres</code> flag may also be passed to [[#sbatch | sbatch]] and [[#salloc | salloc]] rather than directly to [[#srun | srun]]

=MPI example=
<pre>
#!/usr/bin/bash
#SBATCH --job-name=mpi_test # Job name
#SBATCH --nodes=4 # Number of nodes
#SBATCH --ntasks=8 # Number of MPI ranks
#SBATCH --ntasks-per-node=2 # Number of MPI ranks per node
#SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node
#SBATCH --time=00:30:00 # Time limit hrs:min:sec

module load mpi

srun --mpi=openmpi /nfshomes/derek/testing/mpi/a.out
</pre>

Rclone

2019-11-15T20:58:16Z

Pmehta1: Created page with "Rclone is a command line program useful for syncing files and directories. Its functionality is similar to rsync, but has additional capabilities that support cloud storage se..."

Rclone is a command line program useful for syncing files and directories. Its functionality is similar to rsync, but has additional capabilities that support cloud storage services such as Dropbox, Open Drive, Amazon S3, and many more.

Below are directions on how to remote setup rclone on a headless machine. If you would like to set up specific Google Drive or Dropbox remotes, you can go directly to the section labeled "Google Drive" or "Dropbox".

==Remote Setup on A Headless System==

To remote setup rclone, you will first have to [[SSH]] into one of UMIACS' hosts and load the rclone module by typing <tt>module load rclone</tt>. If this does not work, check that the rclone module is available by running the command <tt>module avail rclone</tt>. If the module is not available, SSH to a host where it is.

Once you have loaded the module, there are two ways to remote configure rclone. The first is by copying the rclone config file, and the second is by using rclone authorize.

===Rclone configuration file===

To copy the rclone config file, first configure rclone on your desktop machine. This can be achieved by running the command <tt>rclone config</tt> to set up the file and then finding the location of the configuration file using the command <tt>rclone config file</tt>, which should yield output similar to what is below:

<pre>
$ rclone config file
Configuration file is stored at:
/home/user/.rclone.conf
</pre>

This file can then be transferred to the remote box using scp, copy/paste, ftp, sftp, etc. The command <tt>rclone config file</tt> can also be run on the remote box to find the correct location to which the file should be moved.

===Rclone authorize===

To configure using rclone authorize instead, run the command <tt>Remote config</tt> on your headless machine, and then select the option for working on a headless machine by answering the first prompt with an <tt>n</tt>. The resulting output should look like this:

<pre>
Remote config
Use auto config?
* Say Y if not sure
* Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> n

For this to work, you will need rclone available on a machine that has a web browser available.
Execute the following on your machine:
rclone authorize "amazon cloud drive"
Then paste the result below:
result>
</pre>

Next, on your main desktop machine, run the command <tt>rclone authorize "amazon cloud drive"</tt>. Follow the link that the output directs you to; you should now have received a secret token on your desktop machine console, in a message similar to the one below:

<pre>
rclone authorize "amazon cloud drive"
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
Paste the following into your remote machine --->
SECRET_TOKEN
<---End paste
</pre>

Go back to your headless machine and paste the secret message on the console where it says <tt>result></tt>. Finally, approve the token by answering the prompts that follow:

<pre>
result> SECRET_TOKEN
--------------------
[acd12]
client_id =
client_secret =
token = SECRET_TOKEN
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d>
</pre>

If there are any issues with the remote setup, visit https://rclone.org/remote_setup/ for more information.

==Google Drive==

These instructions assume that you have SSHd into one of UMIACS' Bastion hosts and have successfully been able to load the rclone module using the command <tt>module load rclone</tt>. If this is not the case, please do so before proceeding with the Google Drive remote setup.

First, make a new remote by running the command <tt>rclone config</tt>. This should produce output similar to what is below:

<pre>
n) New remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
n/r/c/s/q>
</pre>

Choose to make a new remote, and give it a name. The setup will then prompt you to choose a type of storage to configure. You can either choose the number from the options that corresponds to "Google Drive", or simply type in <tt>drive</tt>. When prompted for your Google Application Client Id and Secret, it is okay to leave those blank.

Rclone will then prompt for a scope that it should use when requesting access from your drive, which you can provide by selecting one of the numbers on the screen. You can leave the root folder id blank, but enter <tt>remote config</tt> for the service account file. You will then be prompted to use auto config:

<pre>
Remote config
Use auto config?
* Say Y if not sure
* Say N if you are working on a remote or headless machine or Y didn't work
y) Yes
n) No
y/n>
</pre>

Choosing the <tt>y</tt> option is probably best here unless you already have rclone setup on a headless machine. The console should then redirect you to a browser page that will prompt you to login and authorize rclone for access. Once you have followed the instructions on the webpage, the console will display your credentials and ask for approval to confirm the remote.

==Dropbox==

These instructions assume that you have SSHd into one of UMIACS' Bastion hosts and have successfully been able to load the rclone module using the command <tt>module load rclone</tt>. If this is not the case, please do so before proceeding with the Dropbox remote setup.

First, make a new remote by running the command <tt>rclone config</tt>. This should produce output similar to what is below:

<pre>
n) New remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
n/r/c/s/q>
</pre>

Choose to make a new remote, and give it a name. The setup will then prompt you to choose a type of storage to configure. You can either choose the number from the options that corresponds to "Dropbox", or simply type in <tt>dropbox</tt>. When prompted for your Dropbox App Key and Secret, it is okay to leave those blank.

The console will then redirect you to a browser page that will prompt you to enter a code displayed on the console and authorize rclone for access. Once you have followed the instructions on the webpage, the console will display your credentials and ask for approval to confirm the remote.

Iribe/ConferenceRooms

2019-10-23T19:08:32Z

Pmehta1:

== Types of Rooms ==
; [[Iribe/ConferenceRooms/Moderated | Moderated Rooms]]
: Seminar and large conference areas must be scheduled through the designated moderator for the room.
; [[Iribe/ConferenceRooms/AutoAccept | Auto-Accept Rooms]]
: These 6-12 person rooms are scheduled, but will automatically accept your meeting without requiring moderation.
; [[Iribe/ConferenceRooms/HuddleRoom | Huddle Rooms]]
: Meant for last-minute, short-term meetings for one to three people or for personal phone calls.
: Huddle rooms can only be scheduled at the touch panel. They cannot be scheduled ahead of time through Google Calendar.

== Common Tasks ==
* [[Iribe/ConferenceRooms/View | Viewing one or more rooms]]
* [[Iribe/ConferenceRooms/Reserve | Reserving a room]]
* [[Iribe/ConferenceRooms/Moderation | [Moderator] Approving a room request ]]
* [[Iribe/ConferenceRooms/Moderation | [Moderator] Enabling notifications for pending requests ]]

== Viewing ==
Conference room availability can be [[Iribe/ConferenceRooms/View | viewed]] with UMD’s Google Calendar system or from a touch panel outside each individual room. The touch panel quickly allows you to see the status of the room: red if the room is reserved right now and green if it is open.

We also offer a [[Iribe/ConferenceRooms/List|list]] of conference rooms scheduled with the UMD Google Calendar system.

== Scheduling ==
Conference rooms are [[Iribe/ConferenceRooms/Reserve | reserved]] with UMD’s Google Calendar system along with a touch panel outside the room. The touch panel quickly allows you to see the status of the room: red if the room is reserved right now and green if it is open. You can walk up to the panel and reserve the room if it is unoccupied within the next 12 hours (non-moderated rooms only).

== Room Capabilities ==
Our conference rooms have different AV capabilities based on the size of the room you are using. We are still working with the AV contractors on a number of the capabilities in these rooms, but you should at a minimum be able to plug in your laptop to display to the projector and screen(s). In the future we will outline and give sessions on how to effectively use the different functionalities in our conference rooms.

* [[Iribe/ConferenceRooms/Solstice | Solstice Mersive Pods]]
* [[Iribe/ConferenceRooms/Recording | Recording and Streaming]]

Iribe/ConferenceRooms

2019-10-23T19:06:10Z

Pmehta1:

== Types of Rooms ==
; [[Iribe/ConferenceRooms/Moderated | Moderated Rooms]]
: Seminar and large conference areas must be scheduled through the designated moderator for the room.
; [[Iribe/ConferenceRooms/AutoAccept | Auto-Accept Rooms]]
: These 6-12 person rooms are scheduled, but will automatically accept your meeting without requiring moderation.
; [[Iribe/ConferenceRooms/HuddleRoom | Huddle Rooms]]
: Meant for last-minute, short-term meetings for one to three people or for personal phone calls.
: Huddle rooms can only be scheduled at the touch panel. They cannot be scheduled ahead of time through Google Calendar.
: Undergraduates do not have the swipe access required to enter the room even if they are able to schedule a time slot at the touch panel.

== Common Tasks ==
* [[Iribe/ConferenceRooms/View | Viewing one or more rooms]]
* [[Iribe/ConferenceRooms/Reserve | Reserving a room]]
* [[Iribe/ConferenceRooms/Moderation | [Moderator] Approving a room request ]]
* [[Iribe/ConferenceRooms/Moderation | [Moderator] Enabling notifications for pending requests ]]

== Viewing ==
Conference room availability can be [[Iribe/ConferenceRooms/View | viewed]] with UMD’s Google Calendar system or from a touch panel outside each individual room. The touch panel quickly allows you to see the status of the room: red if the room is reserved right now and green if it is open.

We also offer a [[Iribe/ConferenceRooms/List|list]] of conference rooms scheduled with the UMD Google Calendar system.

== Scheduling ==
Conference rooms are [[Iribe/ConferenceRooms/Reserve | reserved]] with UMD’s Google Calendar system along with a touch panel outside the room. The touch panel quickly allows you to see the status of the room: red if the room is reserved right now and green if it is open. You can walk up to the panel and reserve the room if it is unoccupied within the next 12 hours (non-moderated rooms only).

== Room Capabilities ==
Our conference rooms have different AV capabilities based on the size of the room you are using. We are still working with the AV contractors on a number of the capabilities in these rooms, but you should at a minimum be able to plug in your laptop to display to the projector and screen(s). In the future we will outline and give sessions on how to effectively use the different functionalities in our conference rooms.

* [[Iribe/ConferenceRooms/Solstice | Solstice Mersive Pods]]
* [[Iribe/ConferenceRooms/Recording | Recording and Streaming]]