UMIACS - User contributions [en]

January 2025

2025-01-31T21:02:16Z

Derek:

If you are viewing this page, then you are connected to the UMD GlobalProtect VPN or locally connected to a UMIACS network.

Access to publicly-routed UMIACS resources such as our [[GitLab]], [[OBJ | Object Store]], and [[Intranet]] page, as well as [[Main Page | this wiki]] and other UMIACS-hosted wikis, is now available for UMIACS users who have reset their UMIACS password as advised previously.

Submission nodes for [[Nexus]] are now available for [[SSH]] connections. If you are connected via the UMD GlobalProtect VPN, you will need to multi-factor authenticate with UMIACS' [[Duo]] instance to be able to connect via SSH.

Outbound traffic from UMIACS networks is now available and users should be able to connect to upstream remote services. This includes the full use of the Iribe conference rooms.

If you have any questions or concerns, please [[HelpDesk | contact staff]].

Podman

2024-05-20T13:18:37Z

Derek: /* Storage */

[https://podman.io/ Podman] is a daemonless container engine alternative to [https://www.docker.com/ Docker]. We don't support Docker in many of our environments as it grants trivial administrative control over the host the Docker daemon runs on. Podman on the other hand has the ability to run containers in user namespaces. This means that for every user name space in the kernel you create the processes within it will map to a new uid/gid range. For example, if you are root in your container, you will not be uid 0 outside the container, but instead you will be uid 4294000000.

We still believe that [[Apptainer]] is the best option for running containerized workloads on our clustered based resources. Podman is a good option for developing the containers to be run via Apptainer or building a deliverable for a funding agency. Please [[HelpDesk | contact staff]] if you would like Podman installed on a workstation or standalone server. More information on Podman running [https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md rootless].

== Getting Started ==
To get started there are a few things that you need to configure.

First, run the <code>podman</code> command. If it says command not found or you get an ERRO like the one below about no subuid ranges, and you are on a workstation or standalone (non-cluster) server, please [[HelpDesk | contact staff]] with the error and the host that you are using. We will need to do some steps to setup the host you want ready.

<pre>
$ podman
ERRO[0000] cannot find mappings for user username: No subuid ranges found for user "username" in /etc/subuid
Error: missing command 'podman COMMAND'
Try 'podman --help' for more information.
</pre>

=== Storage ===
Containers are made up of layers for the image and these are stored in the graphroot setting of <code>~/.config/containers/storage.conf</code> which by default will be in your home directory. With our home directories being available over NFS there is an [https://www.redhat.com/sysadmin/rootless-podman-nfs issue] that due to the user name space mapping described above you will not be able to access your home directory when you are building the layers.

You need to update the <code>graphroot</code> setting to a local directory on the host. The file <code>~/.config/containers/storage.conf</code> may not exist until you run <code>podman</code> the first time, however you can manually create it.

<pre>
[storage]
driver = "vfs"
graphroot = "/scratch0/username/.local/share/containers/storage"
...
</pre>

When building larger images, it may fill up the default directory for imageCopyTmpDir (/var/tmp). If this happens, you will need to specify a different directory using the environment variable TMPDIR. For example:

<pre>export TMPDIR="/scratch0/example_tmp_directory"</pre>

== GPUs ==
Running Podman with the local Nvidia GPUs requires some additional configuration steps that staff has to add to any individual workstation or standalone (non-cluster) server that runs Podman. This includes ensuring the <tt>nvidia-container-runtime</tt> package is installed.

For example you can run <code>nvidia-smi</code> from within the official Nvidia CUDA containers with a command like this, optionally replacing the tag for different CUDA versions/OS images:

<pre>
$ podman run --rm --hooks-dir=/usr/share/containers/oci/hooks.d docker.io/nvidia/cuda:12.2.2-base-ubi8 nvidia-smi
Thu Apr 16 18:47:04 2020
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
+---------------------------------------------------------------------------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 28C P8 6W / 300W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
</pre>

The full list of tags can be found at https://hub.docker.com/r/nvidia/cuda/tags.

== Example ==
To build your own image you can start from an example we have https://gitlab.umiacs.umd.edu/derek/gpudocker.

First clone the repository, change directory and build the image with podman.
<pre>
git clone https://gitlab.umiacs.umd.edu/derek/gpudocker.git
cd gpudocker
podman build -t gpudocker .
</pre>

Then you can run the test script to verify. Notice that we pass the local directory <code>test</code> as a path into the image so we can run a script. This can also be useful for your data output data as well as if you write anywhere else in the container it will not be available outside the container.
<pre>
$ podman run --volume `pwd`/test:/mnt --hooks-dir=/usr/share/containers/oci/hooks.d gpudocker python3 /mnt/test_torch.py
GPU found 0: GeForce GTX 1080 Ti
tensor([[0.3479, 0.6594, 0.5791],
[0.6065, 0.3415, 0.9328],
[0.9117, 0.3541, 0.9050],
[0.6611, 0.5361, 0.3212],
[0.8574, 0.5116, 0.7021]])
</pre>

If you instead want to push modifications to this example to your own container registry such that you can pull the container image down later, please see the README.md located in the example repository itself.

WebCrawling

2024-03-01T14:01:48Z

Derek:

Web crawling, scraping, or otherwise downloading large sets of publicly available information on the Internet (hereafter referred to as just "crawling") should be handled with care.

You should always understand if any publicly accessible data you are planning on crawling is encumbered or not. Some things that are publicly accessible are still under copyright and it is important to understand what the restrictions are on any data that you crawl from remotely accessible sites.

There are a number of different organizations that still give access to academic institutions (such as UMIACS) through IP based filtering, but have restrictions on crawling. The services that they provide may also not be adequately built for programmatic crawling. In this case, these organizations may ban IP addresses or ranges if crawling is observed on those IP addresses or ranges. This can then prevent other individuals from using these services from the same IP addresses or ranges. In the case of UMIACS, bad behavior on the part of a single user from UMIACS systems can result in all access to a service being banned from UMIACS' public IP addresses.

Some examples of databases or sites that may have restrictions on crawling are (not an exhaustive list):
* University of Maryland Library Resources - https://lib.guides.umd.edu/c.php?g=326950&p=2194463
* NCBI - https://www.ncbi.nlm.nih.gov/search/

You should never try to evade limitations or restrictions imposed by the site or owner for publicly available service that you are looking to crawl.

WebCrawling

2024-02-21T15:02:58Z

Derek: Created page with "Web crawling, scraping or otherwise downloading large sets of publicly available information on the Internet should be handled with care. You should always understand if any data you are planning on scraping is publicly available but is still encumbered. Some things that are publicly accessible are still under copyright and are important to understand what restrictions on any data that you crawl from remotely accessible sites. There are a number of different sites tha..."

Web crawling, scraping or otherwise downloading large sets of publicly available information on the Internet should be handled with care.

You should always understand if any data you are planning on scraping is publicly available but is still encumbered. Some things that are publicly accessible are still under copyright and are important to understand what restrictions on any data that you crawl from remotely accessible sites.

There are a number of different sites that still give access to academic institutions through IP based filtering but have restrictions on scraping or crawling. There can also be the case that the service is not adequately built for programmatic crawling. In this case these organizations only often option to sites who crawl them is to ban IP addresses (or ranges). This of course can have impacts on individuals who try to then use these services from UMIACS public IPs and are blocked.

Some examples but not an exhaustive list are the following of databases or sites that may have restrictions:

* University of Maryland Library Resources - https://lib.guides.umd.edu/c.php?g=326950&p=2194463
* NCBI - https://www.ncbi.nlm.nih.gov/search/

Users should not be ever trying to evade limitations or restrictions of data within a publicly available service they are looking to scrape.

S3Clients

2023-06-21T13:15:38Z

Derek: /* mc */

Many popular S3 desktop clients can be used to access the [[OBJ | UMIACS Object Store]]. These tools complement the [[UMobj]] command line utilities and the built-in web interface by providing integration with the native file explorer on your desktop machine.

'''Note''': Many of these clients have features that are not supported by our Object Store in UMIACS. One prominent example of this is permissions. We suggest you instead manage permissions from the [https://obj.umiacs.umd.edu/obj built-in web application] for the Object Store.

=Graphical Clients=

==Cyberduck==
* http://cyberduck.ch/

This is a free Windows and Mac S3 browser (it is however nagware that asks for a donation). It supports our S3 Object Stores using the "S3 (Amazon Simple Storage Service)" drop down menu choice in the add bookmark dialog. The following fields are required.

[[Image:Cyberduck.png|400px]]

* '''Server''' - This should be your object store (<code>obj.umiacs.umd.edu</code>)
* '''Access Key ID''' - This is your access key as provided to you in the object store
* '''Password''' - This is your secret key as provided to you in the object store

You will be prompted for your secret key when you connect and may choose to save the password.

==Transmit==
* http://panic.com/transmit/
This is a paid file transfer application for Mac. It supports our S3 Object Stores using the "S3" menu choice after clicking the plus sign to add a favorite. The following fields are required:

[[Image:Transmit.png|400px]]

* '''Server''' - This should be your object store (<code>obj.umiacs.umd.edu</code>)
* '''Access Key ID''' - This is your access key as provided to you in the object store
* '''Secret''' - This is your secret key as provided to you in the object store

These settings can be saved as a favorite for easy access. Transmit also allows you to mount your Obj buckets as local disks, which will support easy drag-and-drop of files.

=Command Line Clients=
==s3cmd==
Command line client for accessing S3-like services.

* http://s3tools.org/s3cmd

You need to configure a file like <code>~/.s3cmd</code> that looks like the following with your ACCESS_KEY and SECRET_KEY substituted.

<pre>
[default]
access_key = ACCESS_KEY
host_base = obj.umiacs.umd.edu
host_bucket = %(bucket)s.obj.umiacs.umd.edu
secret_key = SECRET_KEY
use_https = True
</pre>

==mc==
The Minio Client is a comprehensive single binary (Go) command line client for cloud based storage services.

* https://min.io/download

Users in UMIACS can run this client on supported systems through adding a software [[Modules|modules]] for <code>mc</code>.

<pre>
module add mc
</pre>

You will need to setup a cloud provider for Obj and you should first retrieve the ACCESS_KEY and SECRET_KEY for your personal or LabAccount in the [https://obj.umiacs.umd.edu/obj/user/ Object Store ].

<pre>
mc config host add obj http://obj.umiacs.umd.edu ACCESS_KEY SECRET_KEY
</pre>

You can see what host(s) you have configured by the command <code>mc config host ls</code>.

You can then use the normal <code>mc</code> commands like the following to list the contents of a bucket.

<pre>
$ mc ls obj/iso
[2017-02-10 16:45:04 EST] 3.5GiB rhel-server-7.3-x86_64-dvd.iso
[2017-02-13 12:21:33 EST] 4.0GiB rhel-workstation-7.3-x86_64-dvd.iso
</pre>

There is also the ability to search for file globs of specific files using the <code>find</code> sub-command for <code>mc</code>.

<pre>
$ mc find derek/derek_support --name "*.log"
derek/derek_support/mds_20170918/ceph-mds.objmds01.log
derek/derek_support/satellite.log
derek/derek_support/umiacs-49168.log
</pre>

S3Clients

2023-02-07T20:34:37Z

Derek: /* mc */

Many popular S3 desktop clients can be used to access the [[OBJ | UMIACS Object Store]]. These tools complement the [[UMobj]] command line utilities and the built-in web interface by providing integration with the native file explorer on your desktop machine.

'''Note''': Many of these clients have features that are not supported by our Object Store in UMIACS. One prominent example of this is permissions. We suggest you instead manage permissions from the [https://obj.umiacs.umd.edu/obj built-in web application] for the Object Store.

=Graphical Clients=

==Cyberduck==
* http://cyberduck.ch/

This is a free Windows and Mac S3 browser (it is however nagware that asks for a donation). It supports our S3 Object Stores using the "S3 (Amazon Simple Storage Service)" drop down menu choice in the add bookmark dialog. The following fields are required.

[[Image:Cyberduck.png|400px]]

* '''Server''' - This should be your object store (<code>obj.umiacs.umd.edu</code>)
* '''Access Key ID''' - This is your access key as provided to you in the object store
* '''Password''' - This is your secret key as provided to you in the object store

You will be prompted for your secret key when you connect and may choose to save the password.

==Transmit==
* http://panic.com/transmit/
This is a paid file transfer application for Mac. It supports our S3 Object Stores using the "S3" menu choice after clicking the plus sign to add a favorite. The following fields are required:

[[Image:Transmit.png|400px]]

* '''Server''' - This should be your object store (<code>obj.umiacs.umd.edu</code>)
* '''Access Key ID''' - This is your access key as provided to you in the object store
* '''Secret''' - This is your secret key as provided to you in the object store

These settings can be saved as a favorite for easy access. Transmit also allows you to mount your Obj buckets as local disks, which will support easy drag-and-drop of files.

=Command Line Clients=
==s3cmd==
Command line client for accessing S3-like services.

* http://s3tools.org/s3cmd

You need to configure a file like <code>~/.s3cmd</code> that looks like the following with your ACCESS_KEY and SECRET_KEY substituted.

<pre>
[default]
access_key = ACCESS_KEY
host_base = obj.umiacs.umd.edu
host_bucket = %(bucket)s.obj.umiacs.umd.edu
secret_key = SECRET_KEY
use_https = True
</pre>

==mc==
The Minio Client is a comprehensive single binary (Go) command line client for cloud based storage services.

* https://min.io/download

Users in UMIACS can run this client on supported systems through adding a software [[Modules|modules]] for <code>mc</code>.

<pre>
module add mc
</pre>

You will need to setup a cloud provider for Obj and you should first retrieve the ACCESS_KEY and SECRET_KEY for your personal or LabAccount in the [https://obj.umiacs.umd.edu/obj/user/ Object Store ].

<pre>
mc alias set obj http://obj.umiacs.umd.edu ACCESS_KEY SECRET_KEY
</pre>

You can then use the normal <code>mc</code> commands like the following to list the contents of a bucket.

<pre>
$ mc ls obj/iso
[2017-02-10 16:45:04 EST] 3.5GiB rhel-server-7.3-x86_64-dvd.iso
[2017-02-13 12:21:33 EST] 4.0GiB rhel-workstation-7.3-x86_64-dvd.iso
</pre>

There is also the ability to search for file globs of specific files using the <code>find</code> sub-command for <code>mc</code>.

<pre>
$ mc find derek/derek_support --name "*.log"
derek/derek_support/mds_20170918/ceph-mds.objmds01.log
derek/derek_support/satellite.log
derek/derek_support/umiacs-49168.log
</pre>

Podman

2022-12-21T18:55:41Z

Derek:

[https://podman.io/ Podman] is a daemonless container engine alternative to [https://www.docker.com/ Docker]. We don't support Docker in many of our environments as it grants trivial administrative control over the host the Docker daemon runs on. Podman on the other hand has the ability to run containers in user namespaces. This means that for every user name space in the kernel you create the processes within it will map to a new uid/gid range. For example, if you are root in your container, you will not be uid 0 outside the container, but instead you will be uid 4294000000.

We still believe that [[Singularity]] is the best option for running containerized workloads on our clustered based resources. Podman is a good option for developing the containers to be run via [[Singularity]] or building a deliverable for a funding agency. Please [[HelpDesk | contact staff]] if you would like podman installed on a workstation or standalone server. More information on Podman running [https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md rootless].

== Getting Started ==
To get started there are a few things that you need to configure.

First run the '''podman''' command. If it says command not found or that you get an ERRO like the one below about no subuid ranges please [[HelpDesk | contact staff]] with the error and the host that you are using. We will need to do some steps to setup the host you want ready.

<pre>
[username@zerus:~ ] $ podman
ERRO[0000] cannot find mappings for user username: No subuid ranges found for user "username" in /etc/subuid
Error: missing command 'podman COMMAND'
Try 'podman --help' for more information.
</pre>

=== Storage ===
Containers are made up of layers for the image and these are stored in the graphroot setting of <code>~/.config/containers/storage.conf</code> which by default will be in your home directory. With our home directories being available over NFS there is an issue[https://www.redhat.com/sysadmin/rootless-podman-nfs] that due to the user name space mapping described above you will not be able to access your home directory when you are building the layers.

You need to update the <code>graphroot</code> setting to a local directory on the host. The file <code>~/.config/containers/storage.conf</code> may not exist until you run <code>podman</code> the first time.

<pre>
[storage]
driver = "vfs"
runroot = "/tmp/run-2174"
graphroot = "/scratch1/username/.local/share/containers/storage"
...
</pre>

== GPUs ==
Running Podman with the local Nvidia GPUs requires some additional configuration steps that staff has to add to any individual host that runs Podman. This includes ensuring the <tt>nvidia-container-runtime</tt> package is installed.

For example you can run <code>nvidia-smi</code> from within the official Nvidia CUDA containers with a command like this:

<pre>
$ podman run --rm --hooks-dir=/usr/share/containers/oci/hooks.d docker.io/nvidia/cuda nvidia-smi
Thu Apr 16 18:47:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 00000000:03:00.0 Off | N/A |
| 22% 40C P8 14W / 250W | 142MiB / 12212MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 00000000:04:00.0 Off | N/A |
| 22% 34C P8 15W / 250W | 1MiB / 12212MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
</pre>

== Example ==
To build your own image you can start from an example we have https://gitlab.umiacs.umd.edu/derek/gpudocker.

First clone the repository, change directory and build the image with podman.
<pre>
git clone https://gitlab.umiacs.umd.edu/derek/gpudocker.git
cd gpudocker
podman build -t gpudocker .
</pre>

Then you can run the test script to verify. Notice that we pass the local directory <code>test</code> as a path into the image so we can run a script. This can also be useful for your data output data as well as if you write anywhere else in the container it will not be available outside the container.
<pre>
$ podman run --volume `pwd`/test:/mnt --hooks-dir=/usr/share/containers/oci/hooks.d gpudocker python3 /mnt/test_torch.py
GPU found 0: GeForce GTX 1080 Ti
tensor([[0.3479, 0.6594, 0.5791],
[0.6065, 0.3415, 0.9328],
[0.9117, 0.3541, 0.9050],
[0.6611, 0.5361, 0.3212],
[0.8574, 0.5116, 0.7021]])
</pre>

Nexus/CBCB

2022-11-18T16:18:30Z

Derek: /* Storage */

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs and accessing resources like GPUs in other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition=cbcb</code>, <code>--account=cbcb</code> and a specific <code>--qos</code> when you submit jobs into the CBCB partition.

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Storage =
CBCB still has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation still in place. All data file systems are available computationally in the Nexus that were available in the previous CBCB cluster. Please note about the change in your home directory in the migration section below.

CBCB users can also request allocations from our [[Nexus#Storage]] policies.

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus/CBCB

2022-11-18T16:17:02Z

Derek:

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs and accessing resources like GPUs in other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition=cbcb</code>, <code>--account=cbcb</code> and a specific <code>--qos</code> when you submit jobs into the CBCB partition.

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Storage =
CBCB has its current [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage storage] allocation still in place. Please note that the home directory migration information below.

CBCB users can also request allocations from our [[Nexus#Storage]] policies.

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus/CBCB

2022-11-18T16:14:33Z

Derek: /* Jobs */

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs and accessing resources like GPUs in other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition=cbcb</code>, <code>--account=cbcb</code> and a specific <code>--qos</code> when you submit jobs into the CBCB partition.

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --ntasks=16 --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218874 queued and waiting for resources
srun: job 218874 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218874
JobId=218874 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:13:56 EligibleTime=2022-11-18T11:13:56
AccrueTime=2022-11-18T11:13:56
StartTime=2022-11-18T11:13:56 EndTime=2022-11-19T11:13:56 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:13:56 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:13:56 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=2000G,node=1,billing=2266
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus

2022-11-18T16:12:44Z

Derek: /* Quality of Service (QoS) */

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that have discrete schedulers will be folding into this scheduler in the future (see [[#Migrations | below]]). The resource manager for Nexus (as with our other existing computational clusters) is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account (not a [[Accounts/Collaborator | collaborator account]]) in order to access Nexus.

== Access ==
The submission nodes for the Nexus computational resources are determined by department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available login nodes.

'''Note''' - UMIACS requires multi-factor authentication through our [[Duo]] instance. This is completely discrete from both UMD's and CSD's Duo instances. You will need to enroll one or more devices to access resources in UMIACS, and will be prompted to enroll when you log into the Directory application for the first time.

Once you have identified your submission nodes, you can [[SSH]] directly into them. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

== Jobs ==
[[SLURM]] jobs are submitted by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the where/how/who, you may be required to specify <code>--partition</code>, <code>--qos</code>, and/or <code>--account</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--tasks</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four, so if you don't specify something, you may be scheduled with a very minimal set of time and resources (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs]]. You can also can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --ntasks 4 --mem=2gb --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus (when fully operational) will have a number of different partitions of resources. Different Centers, Labs, and Faculty will be able to invest in computational resources that will be restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all UMIACS and CSD faculty and graduate students.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that supports nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts | Class]] - Pool available for UMIACS/CSD class accounts.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/Gamma]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.
* [[Nexus/CBCB]] - CBCB lab pool available for CBCB lab members.

= Quality of Service (QoS) =
SLURM uses a QoS to provide limits on job sizes to users. Note that you should still try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [https://slurm.schedmd.com/fair_tree.html FairShare priority] in the future.
* default - Default QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job. The maximum wall time per job is 2 days.
* high - Limited to 16 cores, 128GB RAM, and 4 GPUs per job. The maximum wall time per job is 1 day.
* scavenger - Limited to 64 cores, 256GB RAM, and 8 GPUs per job. The maximum wall time per job is 2 days. Only 192 total cores, 768GB total RAM, and 24 total GPUs are permitted simultaneously across all of your jobs running in this QoS. This QoS is both only available in the scavenger partition and the only QoS available in the scavenger partition. To use this QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any QoS argument other than <code>--qos=scavenger</code> (optional) or the submission will fail.

You can display these QoSes from the command line using <code>show_qos</code> command. Other partition, lab-or-group-specific or reserved QoSes may also appear in the listing. The above four QoSes are the ones that everyone can submit to.

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G

</pre>

Please note that in the default non-preemption partition (<code>tron</code>), you will be restricted to 32 total cores, 256GB total RAM, and 4 total GPUs at once across all jobs you have running in the QoSes allowed by that partition. This is codified by the reserved QoS also named <code>tron</code> in the output above. Lab/group-specific partitions may also have similar restrictions across all users in that lab/group that are using the partition (codified by <code>GrpTRES</code> in the output above for the QoS name that matches the lab/group partition).

To find out what accounts and partitions you have access to, use the <code>show_assoc</code> command.

= Storage =
All storage available in Nexus is currently [[NFS]] based. We will be introducing some changes for Phase 2 to support high performance GPUDirect Storage (GDS). These storage allocation procedures will be revised and approved by the launch of Phase 2 by a joint UMIACS and CSD faculty committee.

== Home Directories ==
Home directories in the Nexus computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage (20GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both [[Snapshots | snapshots]] and is [[NightlyBackups | backed up nightly]].

Other standalone compute clusters have begun to fold into partitions in Nexus. The corresponding home directories used by these clusters (if not <code>/nfshomes</code>) will be gradually phased out in favor of the <code>/nfshomes</code> home directories.

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/$username</code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a project directory.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confines of your job and stage the data out before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

This lab space does not have [[Snapshots | snapshots]] by default (but are available if requested), but is [[NightlyBackups | backed up]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc.. A single faculty member can not have more than 20 TB of sponsored account project allocations active at any point.

The minimum storage space you can request (maximum length) is 500GB (540 days) and the minimum allocation length you can request (maximum storage) is 30 days (9TB).

To request an allocation, please [[HelpDesk | contact staff]] with your account sponsor involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available via <code>/fs/nexus-projects/$project_name</code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If you are no longer in need of the storage allocation, you will need to relocate all desired data within 14 days of the end of the allocation period. Staff will then remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and available, please see [[Datasets | this page]].

We will have a more formal process to approve datasets by Phase 2 of Nexus.

= Migrations =
If you are a user of an existing cluster that is the process of being folded into Nexus now or in the near future, your cluster-specific migration information will be listed here.
* [[Nexus/CLIP | CLIP]]

Nexus

2022-11-18T16:12:16Z

Derek: /* Partitions */

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that have discrete schedulers will be folding into this scheduler in the future (see [[#Migrations | below]]). The resource manager for Nexus (as with our other existing computational clusters) is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account (not a [[Accounts/Collaborator | collaborator account]]) in order to access Nexus.

== Access ==
The submission nodes for the Nexus computational resources are determined by department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available login nodes.

'''Note''' - UMIACS requires multi-factor authentication through our [[Duo]] instance. This is completely discrete from both UMD's and CSD's Duo instances. You will need to enroll one or more devices to access resources in UMIACS, and will be prompted to enroll when you log into the Directory application for the first time.

Once you have identified your submission nodes, you can [[SSH]] directly into them. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

== Jobs ==
[[SLURM]] jobs are submitted by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the where/how/who, you may be required to specify <code>--partition</code>, <code>--qos</code>, and/or <code>--account</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--tasks</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four, so if you don't specify something, you may be scheduled with a very minimal set of time and resources (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs]]. You can also can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --ntasks 4 --mem=2gb --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus (when fully operational) will have a number of different partitions of resources. Different Centers, Labs, and Faculty will be able to invest in computational resources that will be restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all UMIACS and CSD faculty and graduate students.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that supports nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts | Class]] - Pool available for UMIACS/CSD class accounts.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/Gamma]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.
* [[Nexus/CBCB]] - CBCB lab pool available for CBCB lab members.

= Quality of Service (QoS) =
SLURM uses a QoS to provide limits on job sizes to users. Note that you should still try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [https://slurm.schedmd.com/fair_tree.html FairShare priority] in the future.
* default - Default QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job. The maximum wall time per job is 2 days.
* high - Limited to 16 cores, 128GB RAM, and 4 GPUs per job. The maximum wall time per job is 1 day.
* scavenger - Limited to 64 cores, 256GB RAM, and 8 GPUs per job. The maximum wall time per job is 2 days. Only 192 total cores, 768GB total RAM, and 24 total GPUs are permitted simultaneously across all of your jobs running in this QoS. This QoS is both only available in the scavenger partition and the only QoS available in the scavenger partition. To use this QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any QoS argument other than <code>--qos=scavenger</code> (optional) or the submission will fail.

You can display these QoSes from the command line using <code>show_qos</code> command. Other partition, lab-or-group-specific or reserved QoSes may also appear in the listing. The above four QoSes are the ones that everyone can submit to.

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2063238M

</pre>

Please note that in the default non-preemption partition (<code>tron</code>), you will be restricted to 32 total cores, 256GB total RAM, and 4 total GPUs at once across all jobs you have running in the QoSes allowed by that partition. This is codified by the reserved QoS also named <code>tron</code> in the output above. Lab/group-specific partitions may also have similar restrictions across all users in that lab/group that are using the partition (codified by <code>GrpTRES</code> in the output above for the QoS name that matches the lab/group partition).

To find out what accounts and partitions you have access to, use the <code>show_assoc</code> command.

= Storage =
All storage available in Nexus is currently [[NFS]] based. We will be introducing some changes for Phase 2 to support high performance GPUDirect Storage (GDS). These storage allocation procedures will be revised and approved by the launch of Phase 2 by a joint UMIACS and CSD faculty committee.

== Home Directories ==
Home directories in the Nexus computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage (20GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both [[Snapshots | snapshots]] and is [[NightlyBackups | backed up nightly]].

Other standalone compute clusters have begun to fold into partitions in Nexus. The corresponding home directories used by these clusters (if not <code>/nfshomes</code>) will be gradually phased out in favor of the <code>/nfshomes</code> home directories.

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/$username</code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a project directory.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confines of your job and stage the data out before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

This lab space does not have [[Snapshots | snapshots]] by default (but are available if requested), but is [[NightlyBackups | backed up]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc.. A single faculty member can not have more than 20 TB of sponsored account project allocations active at any point.

The minimum storage space you can request (maximum length) is 500GB (540 days) and the minimum allocation length you can request (maximum storage) is 30 days (9TB).

To request an allocation, please [[HelpDesk | contact staff]] with your account sponsor involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available via <code>/fs/nexus-projects/$project_name</code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If you are no longer in need of the storage allocation, you will need to relocate all desired data within 14 days of the end of the allocation period. Staff will then remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and available, please see [[Datasets | this page]].

We will have a more formal process to approve datasets by Phase 2 of Nexus.

= Migrations =
If you are a user of an existing cluster that is the process of being folded into Nexus now or in the near future, your cluster-specific migration information will be listed here.
* [[Nexus/CLIP | CLIP]]

Nexus/CBCB

2022-11-18T16:11:00Z

Derek: /* Jobs */

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs and accessing resources like GPUs in other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition=cbcb</code>, <code>--account=cbcb</code> and a specific <code>--qos</code> when you submit jobs into the CBCB partition.

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218872 queued and waiting for resources
srun: job 218872 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218872
JobId=218872 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:07 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:09:28 EligibleTime=2022-11-18T11:09:28
AccrueTime=2022-11-18T11:09:28
StartTime=2022-11-18T11:09:28 EndTime=2022-11-19T11:09:28 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:09:28 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:09:28 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=2000G,node=1,billing=2251
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus/CBCB

2022-11-18T16:10:23Z

Derek: /* Resources */

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs and accessing resources like GPUs in other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition</code>, <code>--account</code> and <code>--qos</code> when you submit jobs into the CBCB partition..

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218872 queued and waiting for resources
srun: job 218872 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218872
JobId=218872 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:07 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:09:28 EligibleTime=2022-11-18T11:09:28
AccrueTime=2022-11-18T11:09:28
StartTime=2022-11-18T11:09:28 EndTime=2022-11-19T11:09:28 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:09:28 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:09:28 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=2000G,node=1,billing=2251
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus/CBCB

2022-11-18T16:09:48Z

Derek: Created page with "The Nexus computational resources and scheduler house the CBCB's new computational partition. = Submission Nodes = There are two submission nodes for Nexus exclusively av..."

The [[Nexus]] computational resources and scheduler house the CBCB's new computational partition.

= Submission Nodes =
There are two submission nodes for Nexus exclusively available for CBCB users.

* <code>nexuscbcb00.umiacs.umd.edu</code>
* <code>nexuscbcb01.umiacs.umd.edu</code>

= Resources =
The new CBCB partition has 22 new nodes with 32 AMD EPYC-7313 cores and 2000GB of memory each. CBCB users also has access to submitting jobs into other partitions in [[Nexus]].

= QoS =
Currently CBCB users have access to all the default QoS in the cbcb partition using the cbcb account however there is one additional QoS called <code>highmem</code> that allows significantly increased memory to be allocated.

<pre>
$ show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=339,mem=2926G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=179,mem=1511G
mc2 cpu=307,mem=1896G
cbcb cpu=913,mem=46931G
highmem 21-00:00:00 cpu=32,mem=2000G
</pre>

= Jobs =
You will need to specify a <code>--partition</code>, <code>--account</code> and <code>--qos</code> when you submit jobs into the CBCB partition..

<pre>
[derek@nexuscbcb00:~ ] $ srun --pty --mem=2000G --qos=highmem --partition=cbcb --account=cbcb --time 1-00:00:00 bash
srun: job 218872 queued and waiting for resources
srun: job 218872 has been allocated resources
[derek@cbcb00:~ ] $ scontrol show job 218872
JobId=218872 JobName=bash
UserId=derek(2174) GroupId=derek(22174) MCS_label=N/A
Priority=897 Nice=0 Account=cbcb QOS=highmem
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:07 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2022-11-18T11:09:28 EligibleTime=2022-11-18T11:09:28
AccrueTime=2022-11-18T11:09:28
StartTime=2022-11-18T11:09:28 EndTime=2022-11-19T11:09:28 Deadline=N/A
PreemptEligibleTime=2022-11-18T11:09:28 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-18T11:09:28 Scheduler=Main
Partition=cbcb AllocNode:Sid=nexuscbcb00:25443
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cbcb00
BatchHost=cbcb00
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=2000G,node=1,billing=2251
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=2000G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/nfshomes/derek
Power=
</pre>

= Migration =

== Home Directories ==
The [[Nexus]] runs on our [[NFShomes]] home directories and not /cbcbhomes/$USERNAME. As part of the process of migrating into Nexus you may need or want to copy any shell customization from your existing <code>/cbcbhomes</code> to your new home directory. To make this transition easier <code>/cbcbhomes</code> is available to the CBCB submission nodes.

== Operating System / Software ==
Previously CBCB's cluster was running RHEL7. The [[Nexus]] is running exclusively RHEL8 so any software you may have compiled may need to be re-compiled to work correctly in this new environment. The CBCB [https://wiki.umiacs.umd.edu/cbcb/index.php/CBCB_Software_Modules module tree] for RHEL8 has just been started (and may not be populated) and if you do not see the modules you need you should reach out to the maintainers.

Nexus/CLIP

2022-11-16T14:57:16Z

Derek: /* Timeline */

==Overview==
The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] lab's cluster compute nodes will be gradually folded into UMIACS' new [[Nexus]] cluster beginning on Monday, July 25th, 2022 at 9am in order to further the goal of consolidating all compute nodes in UMIACS onto one common [[SLURM]] scheduler.

The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

As part of the transition, compute nodes will be reinstalled with Red Hat Enterprise Linux 8 (RHEL8) as their operating system. The nodes are currently installed with Red Hat Enterprise Linux 7 (RHEL7) as is. GPU compute nodes' names will also change to be just <code>clip##</code> for consistency with Nexus' naming scheme. CPU-only compute nodes will fold into the <tt>legacy</tt> partition and will thus be named <code>legacy##</code>.

Data stored on the local scratch drives of compute nodes (/scratch0, /scratch1, etc.) will not persist through the reinstalls. Please secure all data in these local scratch drives to a network attached storage location prior to each nodes' move date as listed below.

You may need to re-compile or re-link your applications due to the changes to the underlying operating system libraries. We have tried to maintain a similar set of software in our GNU [[Modules]] software trees for both operating systems. However, you may need to let us know if there is something missing after the upgrades.

In addition, the general purpose nodes <code>context00.umiacs.umd.edu</code> and <code>context01.umiacs.umd.edu</code> were retired on Tuesday, September 6th, 2022 at 9am. Please use <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code> (or the <code>nexusclip</code> submission nodes) for any general purpose CLIP compute needs.

Lastly, /cliphomes directories will be deprecated sometime in the coming year. The Nexus cluster uses [[NFShomes | /nfshomes]] directories for home directory storage space. There will be a future announcement about this deprecation that includes a concrete date after the cluster node moves are done or nearly done. /cliphomes will be made read-only once the cluster node moves are done.

Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
The Nexus cluster submission nodes that are allocated to CLIP are <code>nexusclip00.umiacs.umd.edu</code> and <code>nexusclip01.umiacs.umd.edu</code>. You will need to log onto one of these submission nodes to use the moved compute nodes. Submission from <code>clipsub00.umiacs.umd.edu</code> or <code>clipsub01.umiacs.umd.edu</code> will not work.

CLIP users (exclusively) can schedule non-interruptible jobs on the moved nodes, both GPU-capable and CPU-only, by including the <code>--partition=clip</code> and <code>--account=clip</code> submission arguments. Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The Quality of Service (QoS) options present on the CLIP SLURM scheduler will not be migrated into the Nexus SLURM scheduler by default. The <code>huge-long</code> QoS can be used to request resources beyond those available in the universal Nexus QoSes listed [[Nexus#Quality_of_Service_.28QoS.29 | here]]. If you are interested in migrating a QoS from the CLIP scheduler to the Nexus scheduler, please [[HelpDesk | contact staff]] and we will evaluate the request.

==Timeline==
All events are liable to begin as early as 9am US Eastern time on the dates indicated. Each event will be completed within the business week (i.e. Fridays at 5pm).

{| class="wikitable"
! Date
! Event
|-
| July 25th 2022
| <code>clipgpu00</code> and <code>clipgpu01</code> are moved into Nexus as <code>clip00</code> and <code>clip01</code>
|-
| August 1st 2022
| <code>clipgpu02</code> and <code>clipgpu03</code> are moved into Nexus as <code>clip02</code> and <code>clip03</code>
|-
| August 8th 2022
| <code>clipgpu04</code> and <code>clipgpu05</code> are moved into Nexus as <code>clip04</code> and <code>clip05</code>
|-
| August 15th 2022
| <code>clipgpu06</code> and <code>materialgpu00</code> are moved into Nexus as <code>clip06</code> and <code>clip07</code>
|-
| August 22nd 2022
| <code>materialgpu01</code> and <code>materialgpu02</code> are moved into Nexus as <code>clip08</code> and <code>clip09</code>
|-
| September 6th 2022
| <code>context00</code> and <code>context01</code> are taken offline
|-
| Fall 2022
| Announcement is made about the deprecation of <code>/fs/cliphomes</code> directories
|-
| January 2nd 2023
| <code>chroneme[04,06-07], d[41-46], phoneme[00-09]</code> will be re-installed as legacy nodes and made available in Nexus
|-
| February 1st 2023
| Retirement of the old clip cluster infrastructure (scheduler and nodes). Submission nodes will continue to be available and will have a 30 day deprecation notice.
|-
| March 1st 2023
| Retirement of the submission nodes <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code>
|}

Nexus/CLIP

2022-11-16T14:56:18Z

Derek: /* Timeline */

==Overview==
The [https://wiki.umiacs.umd.edu/clip/index.php/Main_Page CLIP] lab's cluster compute nodes will be gradually folded into UMIACS' new [[Nexus]] cluster beginning on Monday, July 25th, 2022 at 9am in order to further the goal of consolidating all compute nodes in UMIACS onto one common [[SLURM]] scheduler.

The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the [[Iribe | Brendan Iribe Center]]. Details on common nodes already in the cluster (Tron partition) can be found [[Nexus/Tron | here]].

As part of the transition, compute nodes will be reinstalled with Red Hat Enterprise Linux 8 (RHEL8) as their operating system. The nodes are currently installed with Red Hat Enterprise Linux 7 (RHEL7) as is. GPU compute nodes' names will also change to be just <code>clip##</code> for consistency with Nexus' naming scheme. CPU-only compute nodes will fold into the <tt>legacy</tt> partition and will thus be named <code>legacy##</code>.

Data stored on the local scratch drives of compute nodes (/scratch0, /scratch1, etc.) will not persist through the reinstalls. Please secure all data in these local scratch drives to a network attached storage location prior to each nodes' move date as listed below.

You may need to re-compile or re-link your applications due to the changes to the underlying operating system libraries. We have tried to maintain a similar set of software in our GNU [[Modules]] software trees for both operating systems. However, you may need to let us know if there is something missing after the upgrades.

In addition, the general purpose nodes <code>context00.umiacs.umd.edu</code> and <code>context01.umiacs.umd.edu</code> were retired on Tuesday, September 6th, 2022 at 9am. Please use <code>clipsub00.umiacs.umd.edu</code> and <code>clipsub01.umiacs.umd.edu</code> (or the <code>nexusclip</code> submission nodes) for any general purpose CLIP compute needs.

Lastly, /cliphomes directories will be deprecated sometime in the coming year. The Nexus cluster uses [[NFShomes | /nfshomes]] directories for home directory storage space. There will be a future announcement about this deprecation that includes a concrete date after the cluster node moves are done or nearly done. /cliphomes will be made read-only once the cluster node moves are done.

Please see the [[#Timeline | Timeline]] section below for concrete dates in chronological order.

Please [[HelpDesk | contact staff]] with any questions or concerns.

==Usage==
The Nexus cluster submission nodes that are allocated to CLIP are <code>nexusclip00.umiacs.umd.edu</code> and <code>nexusclip01.umiacs.umd.edu</code>. You will need to log onto one of these submission nodes to use the moved compute nodes. Submission from <code>clipsub00.umiacs.umd.edu</code> or <code>clipsub01.umiacs.umd.edu</code> will not work.

CLIP users (exclusively) can schedule non-interruptible jobs on the moved nodes, both GPU-capable and CPU-only, by including the <code>--partition=clip</code> and <code>--account=clip</code> submission arguments. Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on clip## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use.

The Quality of Service (QoS) options present on the CLIP SLURM scheduler will not be migrated into the Nexus SLURM scheduler by default. The <code>huge-long</code> QoS can be used to request resources beyond those available in the universal Nexus QoSes listed [[Nexus#Quality_of_Service_.28QoS.29 | here]]. If you are interested in migrating a QoS from the CLIP scheduler to the Nexus scheduler, please [[HelpDesk | contact staff]] and we will evaluate the request.

==Timeline==
All events are liable to begin as early as 9am US Eastern time on the dates indicated. Each event will be completed within the business week (i.e. Fridays at 5pm).

{| class="wikitable"
! Date
! Event
|-
| July 25th 2022
| <code>clipgpu00</code> and <code>clipgpu01</code> are moved into Nexus as <code>clip00</code> and <code>clip01</code>
|-
| August 1st 2022
| <code>clipgpu02</code> and <code>clipgpu03</code> are moved into Nexus as <code>clip02</code> and <code>clip03</code>
|-
| August 8th 2022
| <code>clipgpu04</code> and <code>clipgpu05</code> are moved into Nexus as <code>clip04</code> and <code>clip05</code>
|-
| August 15th 2022
| <code>clipgpu06</code> and <code>materialgpu00</code> are moved into Nexus as <code>clip06</code> and <code>clip07</code>
|-
| August 22nd 2022
| <code>materialgpu01</code> and <code>materialgpu02</code> are moved into Nexus as <code>clip08</code> and <code>clip09</code>
|-
| September 6th 2022
| <code>context00</code> and <code>context01</code> are taken offline
|-
| Ongoing
| CPU-only compute nodes are moved into Nexus as <code>legacy##</code>
|-
| Fall 2022
| Announcement is made about the deprecation of <code>/fs/cliphomes</code> directories
|-
| January 2nd 2023
| <code>chroneme[04,06-07], d[41-46], phoneme[00-09]</code> will be re-installed as legacy nodes and made available in Nexus
|-
| February 1st 2023
| Retirement of the old clip cluster infrastructure (scheduler and nodes). Submission nodes will continue to be available.
|-
| March 1st 2023
| Retirement of the submission nodes
|}

Nexus

2022-11-09T15:00:36Z

Derek: /* Quality of Service (QoS) */

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that have discrete schedulers will be folding into this scheduler in the future (see [[#Migrations | below]]). The resource manager for Nexus (as with our other existing computational clusters) is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account (not a [[Accounts/Collaborator | collaborator account]]) in order to access Nexus.

== Access ==
The submission nodes for the Nexus computational resources are determined by department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available login nodes.

'''Note''' - UMIACS requires multi-factor authentication through our [[Duo]] instance. This is completely discrete from both UMD's and CSD's Duo instances. You will need to enroll one or more devices to access resources in UMIACS, and will be prompted to enroll when you log into the Directory application for the first time.

Once you have identified your submission nodes, you can [[SSH]] directly into them. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

== Jobs ==
[[SLURM]] jobs are submitted by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the where/how/who, you may be required to specify <code>--partition</code>, <code>--qos</code>, and/or <code>--account</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--tasks</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four, so if you don't specify something, you may be scheduled with a very minimal set of time and resources (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs]]. You can also can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --ntasks 4 --mem=2gb --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus (when fully operational) will have a number of different partitions of resources. Different Centers, Labs, and Faculty will be able to invest in computational resources that will be restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all UMIACS and CSD faculty and graduate students.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that supports nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts | Class]] - Pool available for UMIACS/CSD class accounts.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/Gamma]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.

= Quality of Service (QoS) =
SLURM uses a QoS to provide limits on job sizes to users. Note that you should still try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [https://slurm.schedmd.com/fair_tree.html FairShare priority] in the future.
* default - Default QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job. The maximum wall time per job is 2 days.
* high - Limited to 16 cores, 128GB RAM, and 4 GPUs per job. The maximum wall time per job is 1 day.
* scavenger - Limited to 64 cores, 256GB RAM, and 8 GPUs per job. The maximum wall time per job is 2 days. Only 192 total cores, 768GB total RAM, and 24 total GPUs are permitted simultaneously across all of your jobs running in this QoS. This QoS is both only available in the scavenger partition and the only QoS available in the scavenger partition. To use this QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any QoS argument other than <code>--qos=scavenger</code> (optional) or the submission will fail.

You can display these QoSes from the command line using <code>show_qos</code> command. Other partition, lab-or-group-specific or reserved QoSes may also appear in the listing. The above four QoSes are the ones that everyone can submit to.

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES
------------ ----------- ------- ------------------------------ ------------------------------ --------------------
normal
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G
tron cpu=32,gres/gpu=4,mem=256G
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G
clip cpu=334,mem=2830G
class cpu=32,gres/gpu=4,mem=256G
gamma cpu=174,mem=1415G
mc2 cpu=302,mem=1800G
</pre>

Please note that in the default non-preemption partition (<code>tron</code>), you will be restricted to 32 total cores, 256GB total RAM, and 4 total GPUs at once across all jobs you have running in the QoSes allowed by that partition. This is codified by the reserved QoS also named <code>tron</code> in the output above. Lab/group-specific partitions may also have similar restrictions across all users in that lab/group that are using the partition (codified by <code>GrpTRES</code> in the output above for the QoS name that matches the lab/group partition).

To find out what accounts and partitions you have access to, use the <code>show_assoc</code> command.

= Storage =
All storage available in Nexus is currently [[NFS]] based. We will be introducing some changes for Phase 2 to support high performance GPUDirect Storage (GDS). These storage allocation procedures will be revised and approved by the launch of Phase 2 by a joint UMIACS and CSD faculty committee.

== Home Directories ==
Home directories in the Nexus computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage (20GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both [[Snapshots | snapshots]] and is [[NightlyBackups | backed up nightly]].

Other standalone compute clusters have begun to fold into partitions in Nexus. The corresponding home directories used by these clusters (if not <code>/nfshomes</code>) will be gradually phased out in favor of the <code>/nfshomes</code> home directories.

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/$username</code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a project directory.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confines of your job and stage the data out before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

This lab space does not have [[Snapshots | snapshots]] by default (but are available if requested), but is [[NightlyBackups | backed up]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc.. A single faculty member can not have more than 20 TB of sponsored account project allocations active at any point.

The minimum storage space you can request (maximum length) is 500GB (540 days) and the minimum allocation length you can request (maximum storage) is 30 days (9TB).

To request an allocation, please [[HelpDesk | contact staff]] with your account sponsor involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available via <code>/fs/nexus-projects/$project_name</code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If you are no longer in need of the storage allocation, you will need to relocate all desired data within 14 days of the end of the allocation period. Staff will then remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and available, please see [[Datasets | this page]].

We will have a more formal process to approve datasets by Phase 2 of Nexus.

= Migrations =
If you are a user of an existing cluster that is the process of being folded into Nexus now or in the near future, your cluster-specific migration information will be listed here.
* [[Nexus/CLIP | CLIP]]

Nexus

2022-11-09T15:00:02Z

Derek: /* Quality of Service (QoS) */

The Nexus is the combined scheduler of resources in UMIACS. Many of our existing computational clusters that have discrete schedulers will be folding into this scheduler in the future (see [[#Migrations | below]]). The resource manager for Nexus (as with our other existing computational clusters) is [[SLURM]]. Resources are arranged into partitions where users are able to schedule computational jobs. Users are arranged into a number of SLURM accounts based on faculty, lab, or center investments.

= Getting Started =
All accounts in UMIACS are sponsored. If you don't already have a UMIACS account, please see [[Accounts]] for information on getting one. You need a full UMIACS account (not a [[Accounts/Collaborator | collaborator account]]) in order to access Nexus.

== Access ==
The submission nodes for the Nexus computational resources are determined by department, center, or lab affiliation. You can log into the [https://intranet.umiacs.umd.edu/directory/cr/ UMIACS Directory CR application] and select the Computational Resource (CR) in the list that has the prefix <code>nexus</code>. The Hosts section lists your available login nodes.

'''Note''' - UMIACS requires multi-factor authentication through our [[Duo]] instance. This is completely discrete from both UMD's and CSD's Duo instances. You will need to enroll one or more devices to access resources in UMIACS, and will be prompted to enroll when you log into the Directory application for the first time.

Once you have identified your submission nodes, you can [[SSH]] directly into them. From there, you are able to submit to the cluster via our [[SLURM]] workload manager. You need to make sure that your submitted jobs have the correct account, partition, and qos.

== Jobs ==
[[SLURM]] jobs are submitted by either <code>srun</code> or <code>sbatch</code> depending if you are doing an interactive job or batch job, respectively. You need to provide the where/how/who to run the job and specify the resources you need to run with.

For the where/how/who, you may be required to specify <code>--partition</code>, <code>--qos</code>, and/or <code>--account</code> (respectively) to be able to adequately submit jobs to the Nexus.

For resources, you may need to specify <code>--time</code> for time, <code>--tasks</code> for CPUs, <code>--mem</code> for RAM, and <code>--gres=gpu</code> for GPUs in your submission arguments to meet your requirements. There are defaults for all four, so if you don't specify something, you may be scheduled with a very minimal set of time and resources (e.g., by default, NO GPUs are included if you do not specify <code>--gres=gpu</code>). For more information about submission flags for GPU resources, see [[SLURM/JobSubmission#Requesting_GPUs]]. You can also can run <code>man srun</code> on your submission node for a complete list of available submission arguments.

=== Interactive ===
Once logged into a submission node, you can run simple interactive jobs. If your session is interrupted from the submission node, the job will be killed. As such, we encourage use of a terminal multiplexer such as [[Tmux]].

<pre>
$ srun --pty --ntasks 4 --mem=2gb --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-ae5dc1f5-c266-5b9f-58d5-7976e62b3ca1)
</pre>

=== Batch ===
Batch jobs are scheduled with a script file with an optional ability to embed job scheduling parameters via variables that are defined by <code>#SBATCH</code> lines at the top of the file. You can find some examples in our [[SLURM/JobSubmission]] documentation.

= Partitions =
The SLURM resource manager uses partitions to act as job queues which can restrict size, time and user limits. The Nexus (when fully operational) will have a number of different partitions of resources. Different Centers, Labs, and Faculty will be able to invest in computational resources that will be restricted to approved users through these partitions.

'''Partitions usable by all non-[[ClassAccounts |class account]] users:'''
* [[Nexus/Tron]] - Pool of resources available to all UMIACS and CSD faculty and graduate students.
* Scavenger - [https://slurm.schedmd.com/preempt.html Preemption] partition that supports nodes from multiple other partitions. More resources are available to schedule simultaneously than in other partitions, however jobs are subject to preemption rules. You are responsible for ensuring your jobs handle this preemption correctly. The SLURM scheduler will simply restart a preempted job with the same submission arguments when it is available to run again.

'''Partitions usable by [[ClassAccounts]]:'''
* [[ClassAccounts | Class]] - Pool available for UMIACS/CSD class accounts.

'''Partitions usable by specific lab/center users:'''
* [[Nexus/CLIP]] - CLIP lab pool available for CLIP lab members.
* [[Nexus/Gamma]] - GAMMA lab pool available for GAMMA lab members.
* [[Nexus/MC2]] - MC2 lab pool available for MC2 lab members.

= Quality of Service (QoS) =
SLURM uses a QoS to provide limits on job sizes to users. Note that you should still try to only allocate the minimum resources for your jobs, as resources that each of your jobs schedules are counted against your [https://slurm.schedmd.com/fair_tree.html FairShare priority] in the future.
* default - Default QoS. Limited to 4 cores, 32GB RAM, and 1 GPU per job. The maximum wall time per job is 3 days.
* medium - Limited to 8 cores, 64GB RAM, and 2 GPUs per job. The maximum wall time per job is 2 days.
* high - Limited to 16 cores, 128GB RAM, and 4 GPUs per job. The maximum wall time per job is 1 day.
* scavenger - Limited to 64 cores, 256GB RAM, and 8 GPUs per job. The maximum wall time per job is 2 days. Only 192 total cores, 768GB total RAM, and 24 total GPUs are permitted simultaneously across all of your jobs running in this QoS. This QoS is both only available in the scavenger partition and the only QoS available in the scavenger partition. To use this QoS, include <code>--partition=scavenger</code> and <code>--account=scavenger</code> in your submission arguments. Do not include any QoS argument other than <code>--qos=scavenger</code> (optional) or the submission will fail.

You can display these QoSes from the command line using <code>show_qos</code> command. Other partition, lab-or-group-specific or reserved QoSes may also appear in the listing. The above four QoSes are the ones that everyone can submit to.

<pre>
# show_qos
Name MaxWall MaxJobs MaxTRES MaxTRESPU GrpTRES Priority
---------------- ----------- ------- ----------------------------- ----------------------------- ----------------------------- --------
normal 0
scavenger 2-00:00:00 cpu=64,gres/gpu=8,mem=256G cpu=192,gres/gpu=24,mem=768G 0
medium 2-00:00:00 cpu=8,gres/gpu=2,mem=64G 0
high 1-00:00:00 cpu=16,gres/gpu=4,mem=128G 0
default 3-00:00:00 cpu=4,gres/gpu=1,mem=32G 0
tron cpu=32,gres/gpu=4,mem=256G 0
huge-long 10-00:00:00 cpu=32,gres/gpu=8,mem=256G 0
clip cpu=334,mem=2829447M 0
class cpu=32,gres/gpu=4,mem=256G 0
gamma cpu=174,mem=1414709M 0
mc2 cpu=302,mem=1799956M 0
</pre>

Please note that in the default non-preemption partition (<code>tron</code>), you will be restricted to 32 total cores, 256GB total RAM, and 4 total GPUs at once across all jobs you have running in the QoSes allowed by that partition. This is codified by the reserved QoS also named <code>tron</code> in the output above. Lab/group-specific partitions may also have similar restrictions across all users in that lab/group that are using the partition (codified by <code>GrpTRES</code> in the output above for the QoS name that matches the lab/group partition).

To find out what accounts and partitions you have access to, use the <code>show_assoc</code> command.

= Storage =
All storage available in Nexus is currently [[NFS]] based. We will be introducing some changes for Phase 2 to support high performance GPUDirect Storage (GDS). These storage allocation procedures will be revised and approved by the launch of Phase 2 by a joint UMIACS and CSD faculty committee.

== Home Directories ==
Home directories in the Nexus computational infrastructure are available from the Institute's [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username. These home directories have very limited storage (20GB, cannot be increased) and are intended for your personal files, configuration and source code. Your home directory is '''not''' intended for data sets or other large scale data holdings. Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.

'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.

Your home directory data is fully protected and has both [[Snapshots | snapshots]] and is [[NightlyBackups | backed up nightly]].

Other standalone compute clusters have begun to fold into partitions in Nexus. The corresponding home directories used by these clusters (if not <code>/nfshomes</code>) will be gradually phased out in favor of the <code>/nfshomes</code> home directories.

== Scratch Directories ==
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Nexus compute infrastructure:
* Network scratch directories
* Local scratch directories

Please note that [[ClassAccounts | class accounts]] do not have network scratch directories.

=== Network Scratch Directories ===
You are allocated 200GB of scratch space via NFS from <code>/fs/nexus-scratch/$username</code>. '''It is not backed up or protected in any way.''' This directory is '''automounted''' so you will need to <code>cd</code> into the directory or request/specify a fully qualified file path to access this.

You may request a permanent increase of up to 400GB total space without any faculty approval by [[HelpDesk | contacting staff]]. If you need space beyond 400GB, you will need faculty approval and/or a project directory.

This file system is available on all submission, data management, and computational nodes within the cluster.

=== Local Scratch Directories ===
Each computational node that you can schedule compute jobs on also has one or more local scratch directories. These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confines of your job and stage the data out before the end of your job.

These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.

== Faculty Allocations ==
Each faculty member can be allocated 1TB of lab space upon request. We can also support grouping these individual allocations together into larger center, lab, or research group allocations if desired by the faculty. Please [[HelpDesk | contact staff]] to inquire.

This lab space does not have [[Snapshots | snapshots]] by default (but are available if requested), but is [[NightlyBackups | backed up]].

== Project Allocations ==
Project allocations are available per user for 270 TB days; you can have a 1TB allocation for up to 270 days, a 3TB allocation for 90 days, etc.. A single faculty member can not have more than 20 TB of sponsored account project allocations active at any point.

The minimum storage space you can request (maximum length) is 500GB (540 days) and the minimum allocation length you can request (maximum storage) is 30 days (9TB).

To request an allocation, please [[HelpDesk | contact staff]] with your account sponsor involved in the conversation. Please include the following details:
* Project Name (short)
* Description
* Size (1TB, 2TB, etc.)
* Length in days (270 days, 135 days, etc.)
* Other user(s) that need to access the allocation, if any

These allocations will be available via <code>/fs/nexus-projects/$project_name</code>. '''Renewal is not guaranteed to be available due to limits on the amount of total storage.''' Near the end of the allocation period, staff will contact you and ask if you are still in need of the storage allocation. If you are no longer in need of the storage allocation, you will need to relocate all desired data within 14 days of the end of the allocation period. Staff will then remove the allocation.

== Datasets ==
We have read-only dataset storage available at <code>/fs/nexus-datasets</code>. If there are datasets that you would like to see curated and available, please see [[Datasets | this page]].

We will have a more formal process to approve datasets by Phase 2 of Nexus.

= Migrations =
If you are a user of an existing cluster that is the process of being folded into Nexus now or in the near future, your cluster-specific migration information will be listed here.
* [[Nexus/CLIP | CLIP]]

Apptainer

2022-11-01T16:30:45Z

Derek: /* Nexus Containers */

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory. We are using <code>/scratch0</code> but you can substitute any large enough network scratch or project directory you would like.
<pre>
export WORKDIR=/scratch0/$USER
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-8e040d17-402e-cc86-4e83-eb2b1d501f1e)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-d681a21a-8cdd-e624-6bf8-5b0234584ba2)
</pre>

==Nexus Containers==
In our [[Nexus]] environment we have some example containers based on our [https://gitlab.umiacs.umd.edu/derek/pytorch_docker pytorch_docker] project. These can be found in <code>/fs/nexus-containers/pytorch</code>.

You can just run one of the example images by doing the following (you should have already allocated a interactive job with a GPU in [[Nexus]]). It will use the default [https://gitlab.umiacs.umd.edu/derek/pytorch_docker/-/blob/master/tensor.py script] found at <code>/srv/tensor.py</code> within the image.

<pre>
$ hostname && nvidia-smi -L
tron38.umiacs.umd.edu
GPU 0: NVIDIA RTX A4000 (UUID: GPU-4a0a5644-9fc8-84b4-5d22-65d45ca36506)
</pre>
<pre>
$ apptainer run --nv /fs/nexus-containers/pytorch/pytorch_1.13.0+cu117.sif
99 984.5538940429688
199 654.1710815429688
299 435.662353515625
399 291.1429138183594
499 195.5575714111328
599 132.3363037109375
699 90.5206069946289
799 62.86213684082031
899 44.56754684448242
999 32.466392517089844
1099 24.461835861206055
1199 19.166893005371094
1299 15.6642427444458
1399 13.347112655639648
1499 11.814264297485352
1599 10.800163269042969
1699 10.129261016845703
1799 9.685370445251465
1899 9.391674041748047
1999 9.19735336303711
Result: y = 0.0022362577728927135 + 0.837898313999176 x + -0.0003857926349155605 x^2 + -0.09065020829439163 x^3
</pre>

To get data into the container you need to pass some [https://apptainer.org/docs/user/main/bind_paths_and_mounts.html bind mounts] (your home directory is done by default). In this example we will exec an interactive session binding our [[Nexus]] scratch directory which allows us to specify the command we want to run inside the container.

<pre>
apptainer exec --nv --bind /fs/nexus-scratch/derek:/fs/nexus-scratch/derek /fs/nexus-containers/pytorch/pytorch_1.13.0+cu117.sif bash
</pre>

You can now write/run your own pytorch python code interactively within the container or just make a python script that you can call directly from the apptainer exec command for batch processing.

==Docker Workflow Example==
We have a [https://gitlab.umiacs.umd.edu/derek/pytorch_docker pytorch_docker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:

# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to your repository (this can be any docker registry)
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ apptainer pull pytorch_docker.sif docker://registry.umiacs.umd.edu/derek/pytorch_docker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob 85386706b020 done
...
2022/10/14 10:58:36 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2022/10/14 10:58:43 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer exec --nv pytorch_docker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.3273, 0.7174, 0.3587],
[0.2250, 0.3896, 0.4136],
[0.3626, 0.0383, 0.6274],
[0.6241, 0.8079, 0.2950],
[0.0804, 0.9705, 0.0030]])
</pre>

Apptainer

2022-11-01T16:30:32Z

Derek:

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory. We are using <code>/scratch0</code> but you can substitute any large enough network scratch or project directory you would like.
<pre>
export WORKDIR=/scratch0/$USER
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-8e040d17-402e-cc86-4e83-eb2b1d501f1e)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-d681a21a-8cdd-e624-6bf8-5b0234584ba2)
</pre>

==Nexus Containers==
In our [[Nexus]] environment we have some example containers based on our [https://gitlab.umiacs.umd.edu/derek/pytorch_docker pytorch_docker] project. These can be found in <code>/fs/nexus-containers/pytorch</code>.

You can just run one of the example images by doing the following (you should have already allocated a interactive job with a GPU in [[Nexus]]). It will use the default [https://gitlab.umiacs.umd.edu/derek/pytorch_docker/-/blob/master/tensor.py script] found at <code>/srv/tensor.py</code> within the image.

<pre>
$ hostname && nvidia-smi -L
tron38.umiacs.umd.edu
GPU 0: NVIDIA RTX A4000 (UUID: GPU-4a0a5644-9fc8-84b4-5d22-65d45ca36506)
</pre>
</pre>
$ apptainer run --nv /fs/nexus-containers/pytorch/pytorch_1.13.0+cu117.sif
99 984.5538940429688
199 654.1710815429688
299 435.662353515625
399 291.1429138183594
499 195.5575714111328
599 132.3363037109375
699 90.5206069946289
799 62.86213684082031
899 44.56754684448242
999 32.466392517089844
1099 24.461835861206055
1199 19.166893005371094
1299 15.6642427444458
1399 13.347112655639648
1499 11.814264297485352
1599 10.800163269042969
1699 10.129261016845703
1799 9.685370445251465
1899 9.391674041748047
1999 9.19735336303711
Result: y = 0.0022362577728927135 + 0.837898313999176 x + -0.0003857926349155605 x^2 + -0.09065020829439163 x^3
</pre>

To get data into the container you need to pass some [https://apptainer.org/docs/user/main/bind_paths_and_mounts.html bind mounts] (your home directory is done by default). In this example we will exec an interactive session binding our [[Nexus]] scratch directory which allows us to specify the command we want to run inside the container.

<pre>
apptainer exec --nv --bind /fs/nexus-scratch/derek:/fs/nexus-scratch/derek /fs/nexus-containers/pytorch/pytorch_1.13.0+cu117.sif bash
</pre>

You can now write/run your own pytorch python code interactively within the container or just make a python script that you can call directly from the apptainer exec command for batch processing.

==Docker Workflow Example==
We have a [https://gitlab.umiacs.umd.edu/derek/pytorch_docker pytorch_docker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:

# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to your repository (this can be any docker registry)
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ apptainer pull pytorch_docker.sif docker://registry.umiacs.umd.edu/derek/pytorch_docker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob 85386706b020 done
...
2022/10/14 10:58:36 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2022/10/14 10:58:43 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer exec --nv pytorch_docker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.3273, 0.7174, 0.3587],
[0.2250, 0.3896, 0.4136],
[0.3626, 0.0383, 0.6274],
[0.6241, 0.8079, 0.2950],
[0.0804, 0.9705, 0.0030]])
</pre>

Singularity

2022-10-14T16:57:53Z

Derek: Redirected page to Apptainer

#REDIRECT [[Apptainer]]

Apptainer

2022-10-14T15:20:29Z

Derek: /* Example */

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory. We are using <code>/scratch0</code> but you can substitute any large enough network scratch or project directory you would like.
<pre>
export WORKDIR=/scratch0/$USER
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-8e040d17-402e-cc86-4e83-eb2b1d501f1e)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-d681a21a-8cdd-e624-6bf8-5b0234584ba2)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ apptainer pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob 85386706b020 done
Copying blob 45d437916d57 done
Copying blob c1bbdc448b72 done
Copying blob d8f1569ddae6 done
Copying blob 7ddbc47eeb70 done
Copying blob 8c3b70e39044 done
Copying blob ee9b457b77d0 done
Copying blob be4f3343ecd3 done
Copying blob 30b4effda4fd done
Copying blob b6f46848806c done
Copying blob 44845dc671f7 done
Copying config 1b2e5b7b99 done
Writing manifest to image destination
Storing signatures
2022/10/14 10:57:04 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2022/10/14 10:57:06 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2022/10/14 10:57:06 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2022/10/14 10:57:06 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2022/10/14 10:57:06 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2022/10/14 10:57:06 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2022/10/14 10:57:07 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2022/10/14 10:57:07 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2022/10/14 10:57:44 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2022/10/14 10:58:36 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2022/10/14 10:58:43 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.9762, 0.9717, 0.1510],
[0.0995, 0.8492, 0.9194],
[0.2057, 0.7201, 0.1342],
[0.4137, 0.8130, 0.6911],
[0.7236, 0.1144, 0.0854]])
</pre>

Apptainer

2022-10-14T14:40:52Z

Derek:

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory. We are using <code>/scratch0</code> but you can substitute any large enough network scratch or project directory you would like.
<pre>
export WORKDIR=/scratch0/$USER
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-8e040d17-402e-cc86-4e83-eb2b1d501f1e)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-d681a21a-8cdd-e624-6bf8-5b0234584ba2)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

Apptainer

2022-10-14T14:39:14Z

Derek:

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory. We are using <code>/scratch0</code> but you can substitute any large enough network scratch or project directory you would like.
<pre>
export WORKDIR=/scratch0/$USER
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9ee980c3-8746-08dd-8e14-82fbaf88367e)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

Apptainer

2022-10-14T14:38:18Z

Derek:

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with <code>singularity</code> however should should start migrating to using the <code>apptainer</code> command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory.
<pre>
export WORKDIR=/scratch0/username
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9ee980c3-8746-08dd-8e14-82fbaf88367e)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

Apptainer

2022-10-14T14:37:52Z

Derek:

'''Singularity was rebranded as Apptainer. You should still be able to run commands on the system with singularity however should should start migrating to using the apptainer command.'''

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory.
<pre>
export WORKDIR=/scratch0/username
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9ee980c3-8746-08dd-8e14-82fbaf88367e)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

Apptainer

2022-10-14T14:37:29Z

Derek:

'''Note''' Singularity was rebranded as Apptainer. You should still be able to run commands on the system with singularity however should should start migrating to using the apptainer command. All other functionality is the same other than the command name

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory.
<pre>
export WORKDIR=/scratch0/username
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9ee980c3-8746-08dd-8e14-82fbaf88367e)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

Apptainer

2022-10-14T14:37:15Z

Derek: Created page with "***Note*** Singularity was rebranded as Apptainer. You should still be able to run commands on the system with singularity however should should start migrating to using the..."

***Note*** Singularity was rebranded as Apptainer. You should still be able to run commands on the system with singularity however should should start migrating to using the apptainer command. All other functionality is the same other than the command name

[https://apptainer.org Apptainer] is a container platform that doesn't elevate the privileges of a user running the container. This is important as UMIACS runs many multi-tenant hosts and doesn't provide administrative control to users on them.

You can find out what the current version is that we provide by running the '''apptainer --version''' command. If this instead says <code>apptainer: command not found</code> please contact staff and we will ensure that the software is available on the host you are looking for it on.

<pre>
# apptainer --version
apptainer version 1.1.0-1.el7
</pre>

Apptainer can run a variety of images including its own format and [https://apptainer.org/docs/user/1.1/docker_and_oci.html Docker images]. To create images, you need to have administrative rights. Therefore, you will need to do this on a host that you have administrative access to (laptop or personal desktop) rather than a UMIACS-supported host.

If you are going to pull large images, you may run out of space in your home directory. We suggest you run the following commands to setup a alternate cache directory.
<pre>
export WORKDIR=/scratch0/username
export APPTAINER_CACHEDIR=${WORKDIR}/.cache
mkdir -p $APPTAINER_CACHEDIR
</pre>

We do suggest you pull images down into an intermediate file ('''SIF''' file) as you then do not have to worry about re-caching the image.
<pre>
$ apptainer pull cuda10.2.sif docker://nvidia/cuda:10.2-devel
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob d5d706ce7b29 done
Copying blob b4dc78aeafca done
Copying blob 24a22c1b7260 done
Copying blob 8dea37be3176 done
Copying blob 25fa05cd42bd done
Copying blob a57130ec8de1 done
Copying blob 880a66924cf5 done
Copying config db554d658b done
Writing manifest to image destination
Storing signatures
2022/10/14 10:31:17 info unpack layer: sha256:25fa05cd42bd8fabb25d2a6f3f8c9f7ab34637903d00fd2ed1c1d0fa980427dd
2022/10/14 10:31:19 info unpack layer: sha256:24a22c1b72605a4dbcec13b743ef60a6cbb43185fe46fd8a35941f9af7c11153
2022/10/14 10:31:19 info unpack layer: sha256:8dea37be3176a88fae41c265562d5fb438d9281c356dcb4edeaa51451dbdfdb2
2022/10/14 10:31:20 info unpack layer: sha256:b4dc78aeafca6321025300e9d3050c5ba3fb2ac743ae547c6e1efa3f9284ce0b
2022/10/14 10:31:20 info unpack layer: sha256:a57130ec8de1e44163e965620d5aed2abe6cddf48b48272964bfd8bca101df38
2022/10/14 10:31:20 info unpack layer: sha256:d5d706ce7b293ffb369d3bf0e3f58f959977903b82eb26433fe58645f79b778b
2022/10/14 10:31:49 info unpack layer: sha256:880a66924cf5e11df601a4f531f3741c6867a3e05238bc9b7cebb2a68d479204
INFO: Creating SIF file...
</pre>

<pre>
$ apptainer inspect cuda10.2.sif
maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
org.label-schema.build-arch: amd64
org.label-schema.build-date: Friday_14_October_2022_10:32:42_EDT
org.label-schema.schema-version: 1.0
org.label-schema.usage.apptainer.version: 1.1.0-1.el7
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: nvidia/cuda:10.2-devel
</pre>

Now you can run the local image with the '''run''' command or start a shell with the '''shell''' command. Please note that if you are in an environment with GPUs and you want to access them inside the container you need to specify the '''--nv''' flag.

<pre>
$ apptainer run --nv cuda10.2.sif nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9ee980c3-8746-08dd-8e14-82fbaf88367e)
</pre>

==Example==
We have a [https://gitlab.umiacs.umd.edu/derek/gpudocker gpudocker] example workflow using our [[GitLab]] as a Docker registry. You can clone the repository and further customize this to your needs. The workflow is:
# Run Docker on a laptop or personal desktop on to create the image.
# Tag the image and and push it to the repository.
# Pull the image down onto one of our workstations/clusters and run it with your data.

<pre>
$ singularity pull gpudocker.sif docker://registry.umiacs.umd.edu/derek/gpudocker
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
25.45 MiB / 25.45 MiB [====================================================] 2s
Copying blob sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
34.53 KiB / 34.53 KiB [====================================================] 0s
Copying blob sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
845 B / 845 B [============================================================] 0s
Copying blob sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
162 B / 162 B [============================================================] 0s
Copying blob sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
6.88 MiB / 6.88 MiB [======================================================] 0s
Copying blob sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
8.05 MiB / 8.05 MiB [======================================================] 1s
Copying blob sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
184 B / 184 B [============================================================] 0s
Copying blob sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
656.83 MiB / 656.83 MiB [=================================================] 28s
Copying blob sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
782.81 MiB / 782.81 MiB [=================================================] 37s
Copying blob sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
100.58 MiB / 100.58 MiB [==================================================] 5s
Copying blob sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
1.47 GiB / 1.47 GiB [====================================================] 1m4s
Copying config sha256:1b2e5b7b99af9d797ef6fbd091a6a2c6a30e519e31a74f5e9cacb4c8c462d6ed
7.56 KiB / 7.56 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
2020/04/14 12:21:17 info unpack layer: sha256:7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c
2020/04/14 12:21:18 info unpack layer: sha256:c1bbdc448b7263673926b8fe2e88491e5083a8b4b06ddfabf311f2fc5f27e2ff
2020/04/14 12:21:18 info unpack layer: sha256:8c3b70e3904492c753652606df4726430426f42ea56e06ea924d6fea7ae162a1
2020/04/14 12:21:18 info unpack layer: sha256:45d437916d5781043432f2d72608049dcf74ddbd27daa01a25fa63c8f1b9adc4
2020/04/14 12:21:18 info unpack layer: sha256:d8f1569ddae616589c5a2dabf668fadd250ee9d89253ef16f0cb0c8a9459b322
2020/04/14 12:21:18 info unpack layer: sha256:85386706b02069c58ffaea9de66c360f9d59890e56f58485d05c1a532ca30db1
2020/04/14 12:21:18 info unpack layer: sha256:ee9b457b77d047ff322858e2de025e266ff5908aec569560e77e2e4451fc23f4
2020/04/14 12:21:18 info unpack layer: sha256:be4f3343ecd31ebf7ec8809f61b1d36c2c2f98fc4e63582401d9108575bc443a
2020/04/14 12:21:35 info unpack layer: sha256:30b4effda4fdab95ec4eba8873f86e7574c2edddf4dc5df8212e3eda1545aafa
2020/04/14 12:21:55 info unpack layer: sha256:b6f46848806c8750a68edc4463bf146ed6c3c4af18f5d3f23281dcdfb1c65055
2020/04/14 12:21:58 info unpack layer: sha256:44845dc671f759820baac0376198141ca683f554bb16a177a3cfe262c9e368ff
INFO: Creating SIF file...
INFO: Build complete: gpudocker.sif
</pre>

<pre>
$ singularity run --nv gpudocker.sif python3 -c 'from __future__ import print_function; import torch; print(torch.cuda.current_device()); x = torch.rand(5, 3); print(x)'
0
tensor([[0.5299, 0.9827, 0.7858],
[0.2044, 0.6783, 0.2606],
[0.0538, 0.4272, 0.9361],
[0.1980, 0.2654, 0.4160],
[0.1680, 0.8407, 0.0509]])
</pre>

ClassAccounts

2022-08-25T20:45:44Z

Derek:

==Overview==
UMIACS Class Accounts are currently intended to support classes for all of UMIACS/CSD via the [[Nexus]] cluster. All new class accounts will be serviced solely through this cluster. Faculty may request that a class be supported by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

==Getting an account==
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA to have it renewed.

Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.

===Registering for Duo===
UMIACS requires that all Class accounts be registered for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.'''

If you see the following error in your SSH client you have not yet enrolled/registered in Duo.

<pre>
Access is not allowed because you are not enrolled in Duo. Please contact your organization's IT help desk.
</pre>

In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, you can visit our [[Duo | Duo help page]].

Once notified that your account has been installed and you have registered in our Duo instance, you can access the following class submission host(s) using [[SSH]] with your assigned username and your chosen password:
* <code>nexusclass00.umiacs.umd.edu</code> or <code>nexusclass01.umiacs.umd.edu</code>

==Cleaning up your account before the end of the semester==
Class accounts for a given semester will be archived and deleted after that semester's completion as early as the following:
* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.

==Personal Storage==
Your home directory has a quota of 20GB and is located at:
<pre>
/fs/classhomes/<semester><year>/<coursecode>/<username>
</pre>

where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000".

You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<username>
</pre>

==Group Storage==
You can also request group storage if you would like by having your TA [[HelpDesk | contact staff]] to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
</pre>

where <code><groupname></code> is composed of:
* the abbreviated course code as used in the username e.g., "c999z"
* the character "g"
* the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long

e.g., "c999zg00".

==Cluster Usage==
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the [[SLURM]] workload manager. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager].

'''Any questions or issues with the cluster must be first made through your TA.'''

Class accounts only have access to the following submission parameters in SLURM. You may be required to explicitly set each of these in your submission parameters.

* Partition - <code>class</code>
* Account - <code>class</code>
* QoS - <code>default</code>, <code>medium</code>, and <code>high</code>

===Example===
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code> with the account <code>class</code> running with the QoS of <code>default</code>.

<pre>
$ srun --pty --partition=class --account=class --qos=default --gres=gpu:1 bash
</pre>

<pre>
bash-4.4$ hostname
tron14.umiacs.umd.edu
bash-4.4$ nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c)
</pre>

===Available Nodes===
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to repair/replace GPUs or other components.

<pre>
$ show_nodes -p class
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE PARTITION
tron06 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron07 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron08 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron09 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron10 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron11 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron12 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron13 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron14 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron15 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron16 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron17 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron18 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron19 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron20 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron21 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron22 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron23 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron24 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron25 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron26 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron27 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron28 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron29 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron30 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron31 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron32 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron33 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron34 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron35 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron36 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron37 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron38 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron39 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron40 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron41 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron42 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron43 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron44 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron45 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
</pre>

You can also find more granular information about an individual node with the <code>scontrol show node</code> command.

<pre>
$ scontrol show node tron27
NodeName=tron27 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=rhel8,AMD,EPYC-7302
ActiveFeatures=rhel8,AMD,EPYC-7302
Gres=gpu:rtxa4000:4
NodeAddr=tron27 NodeHostName=tron27 Version=21.08.8-2
OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022
RealMemory=128521 AllocMem=0 FreeMem=125650 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=class,scavenger,tron
BootTime=2022-08-18T17:34:44 SlurmdStartTime=2022-08-19T13:10:47
LastBusyTime=2022-08-22T11:20:18
CfgTRES=cpu=16,mem=128521M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>

ClassAccounts/Manage

2022-08-25T20:44:39Z

Derek: /* Management of Users */

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. Please refer to the [[ClassAccounts]] document for public facing information.

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] to create a ticket. Please list or cc your TA(s) and what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
** Note that class accounts also need to setup [[Duo]] multi-factor authentication before getting access to the cluster. They may see a message when trying to SSH <code>Access is not allowed because you are not enrolled in Duo. Please contact your organization's IT help desk.</code>. They will need to follow the instructions in [[ClassAccounts#Registering_for_Duo]].
** If users do not provide a phone number when they redeem their account we can not reset their password automatically. In this case the TA will need make a request to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] and will advise on the steps.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:41:37Z

Derek: /* Management of Users */

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. Please refer to the [[ClassAccounts]] document for public facing information.

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] to create a ticket. Please list or cc your TA(s) and what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
** Note that class accounts also need to setup [[Duo]] multi-factor authentication.
** If users do not provide a phone number when they redeem their account we can not reset their password automatically. In this case the TA will need make a request to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] and will advise on the steps.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:37:50Z

Derek: /* Request New Class */

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. Please refer to the [[ClassAccounts]] document for public facing information.

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu] to create a ticket. Please list or cc your TA(s) and what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:37:06Z

Derek:

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. Please refer to the [[ClassAccounts]] document for public facing information.

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu]. Please list or cc your TA(s) and what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:35:55Z

Derek: /* Request New Class */

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. The user facing documentation on how to use your

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu]. Please list or cc your TA(s) and what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:35:29Z

Derek:

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]]. The user facing documentation on how to use your

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu]. Please list or cc your TA(s), what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

Once the class is created Faculty/TAs may want to create class accounts for themselves (which is allowed). This ensures you see the steps that other students will see when getting their account(s) and give you the ability to test and manage any local resources for the class.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:32:55Z

Derek:

Class Accounts are provided through our [[Nexus]] infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of contacts for the class. All students within the class will need to raise issues or questions through these points of contact who can relay or request support through our [[HelpDesk]].

=Request New Class=

All classes will need to be requested the faculty teaching by sending email to [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu]. Please list or cc your TA(s), what course you are teaching. UMIACS staff will confirm the course through UMD Testudo and create the class within our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application. We will associate the faculty member and TA(s) to be able to manage the class through this application as long as they have a UMIACS account. If they don't we can advise through this ticket the faculty member how to create and sponsor a [[Accounts/Collaborator]] account in UMIACS.

Please include in your request any special storage allocation(s) you may need.

=Management of Users=
Once the class is created users go through the following lifecycle.

* Faculty/TAs will create requests in our [https://intranet.umiacs.umd.edu/requests/accounts/class/ Requests] application with the class dashboard. This is done by entering a form with one or more email addresses. We only support creating accounts for @umd.edu and @terpmail.umd.edu addresses.
* Once a request for a student has been created from this email address they will receive an email with a URL including a token that is only good for 7 days. They would click this URL which will take them to a page to redeem their account by entering some information about themselves and select a password for their account.
** If the student does not redeem their account the Faculty/TAs can go into their class account request in the app and renew their token which will invalidate the old token and issue a new token then send a new email for account redemption.
* Accounts that are redeemed will show up as pending in the Accounts Installed section of the class dashboard. UMIACS Technical Staff will install class accounts that have been redeemed once a day (business day). Once the account(s) are installed they are listed as installed and the student will receive an email with information on getting started with their class accounts.
* Faculty/TAs will need to communicate any special instructions or information with their students on how to use the computational resources.

=Deletion of Class=
We will notify and clean up all class account holder(s) tri-annually after the semester.

* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

ClassAccounts/Manage

2022-08-25T20:29:57Z

Derek: /* Request New Class */

ClassAccounts/Manage

2022-08-25T20:28:05Z

Derek:

ClassAccounts/Manage

2022-08-25T20:27:40Z

Derek:

ClassAccounts/Manage

2022-08-25T20:26:31Z

Derek: Created page with "Class Accounts are provided through our Nexus infrastructure with a specific partition of nodes. Each class will need to have the faculty member or TA(s) as the point of..."

Nexus/Apptainer

2022-08-24T14:15:05Z

Derek:

Running containers in a multi-tenant environment has a number of security considerations. While Docker is popular the most typical setups require a daemon that has administrative level privileges that makes it not tenable. There has been a lot of work in this area but ultimately for HPC environments Singularity or as it is now known Apptainer is a solution that enables the capabilities of container workloads in multi-tenant environments.

The one consideration is that to create an image you need to have administrative rights on the machine. For this reason you can't directly create Apptainer images on our supported systems. You can download or pull images from other repositories including the Docker repositories.

=Bind Mounts=
Apptainer containers will not automatically mount data from the outside operating system other than your home directory. Users need to manually bind mounts for other file paths.

<code>--bind /fs/nexus-scratch/derek/project1:/mnt</code>

In this scenario we are binding the directory outside the container <code>/fs/nexus-scratch/derek/project1</code> to exist in the path <code>/mnt</code> inside the container.

=Shared Containers=

Portable images called Singularity Image Format or .sif files can be copied and shared. Nexus maintains some shared containers in <code>/fs/nexus-containers</code>. These are arranged by the application(s) that are installed.

=GPUs=

Nvidia has a very specific driver and libraries that are required to run CUDA programs. To ensure that all appropriate devices are created inside the container and that these libraries are made available in the container users need to use the <code>--nv</code> flag when instantiating their container(s).

=Example=

If you have the following example file in <code>/fs/nexus-scratch/derek/singularity</code>.

<pre>
#!/usr/bin/env python

import torch;

print(f'Torch cuda is available: {torch.cuda.is_available()}')
print(f'Torch cuda number of devices: {torch.cuda.device_count()}')
for g in range(torch.cuda.device_count()):
print(f'Torch cuda device {g}: {torch.cuda.get_device_name(0)}')
</pre>

<pre>
$ singularity exec --bind /fs/nexus-scratch/derek/singularity:/mnt --nv /fs/nexus-containers/pytorch/pytorch_1.10.2+cu113.sif python3 /mnt/test.py
Torch cuda is available: True
Torch cuda number of devices: 1
Torch cuda device 0: NVIDIA RTX A4000
</pre>

Nexus/Apptainer

2022-08-24T14:09:01Z

Derek: Created page with "Running containers in a multi-tenant environment has a number of security considerations. While Docker is popular the most typical setups require a daemon that has administra..."

ClassAccounts

2022-08-23T20:13:38Z

Derek: /* Example */

==Overview==
UMIACS Class Accounts are currently intended to support classes for all of UMIACS/CSD via the [[Nexus]] cluster. All new class accounts will be serviced solely through this cluster. Faculty may request that a class be supported by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

==Getting an account==
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA to have it renewed.

Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.

===Registering for Duo===
UMIACS requires that all Class accounts be registered for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.'''

In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, you can visit our [[Duo | Duo help page]].

Once notified that your account has been installed and you have registered in our Duo instance, you can access the following class submission host(s) using [[SSH]] with your assigned username and your chosen password:
* <code>nexusclass00.umiacs.umd.edu</code> or <code>nexusclass01.umiacs.umd.edu</code>

==Cleaning up your account before the end of the semester==
Class accounts for a given semester will be archived and deleted after that semester's completion as early as the following:
* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.

==Personal Storage==
Your home directory has a quota of 20GB and is located at:
<pre>
/fs/classhomes/<semester><year>/<coursecode>/<username>
</pre>

where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000".

You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<username>
</pre>

==Group Storage==
You can also request group storage if you would like by having your TA [[HelpDesk | contact staff]] to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
</pre>

where <code><groupname></code> is composed of:
* the abbreviated course code as used in the username e.g., "c999z"
* the character "g"
* the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long

e.g., "c999zg00".

==Cluster Usage==
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the [[SLURM]] workload manager. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager].

'''Any questions or issues with the cluster must be first made through your TA.'''

Class accounts only have access to the following submission parameters in SLURM. You may be required to explicitly set each of these in your submission parameters.

* Partition - <code>class</code>
* Account - <code>class</code>
* QoS - <code>default</code>, <code>medium</code>, and <code>high</code>

===Example===
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code> with the account <code>class</code> running with the QoS of <code>default</code>.

<pre>
$ srun --pty --partition=class --account=class --qos=default --gres=gpu:1 bash
</pre>

<pre>
bash-4.4$ hostname
tron14.umiacs.umd.edu
bash-4.4$ nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c)
</pre>

===Available Nodes===
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to repair/replace GPUs or other components.

<pre>
$ show_nodes -p class
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE PARTITION
tron06 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron07 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron08 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron09 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron10 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron11 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron12 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron13 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron14 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron15 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron16 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron17 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron18 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron19 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron20 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron21 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron22 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron23 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron24 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron25 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron26 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron27 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron28 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron29 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron30 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron31 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron32 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron33 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron34 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron35 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron36 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron37 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron38 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron39 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron40 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron41 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron42 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron43 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron44 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron45 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
</pre>

You can also find more granular information about an individual node with the <code>scontrol show node</code> command.

<pre>
$ scontrol show node tron27
NodeName=tron27 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=rhel8,AMD,EPYC-7302
ActiveFeatures=rhel8,AMD,EPYC-7302
Gres=gpu:rtxa4000:4
NodeAddr=tron27 NodeHostName=tron27 Version=21.08.8-2
OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022
RealMemory=128521 AllocMem=0 FreeMem=125650 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=class,scavenger,tron
BootTime=2022-08-18T17:34:44 SlurmdStartTime=2022-08-19T13:10:47
LastBusyTime=2022-08-22T11:20:18
CfgTRES=cpu=16,mem=128521M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>

ClassAccounts

2022-08-23T20:13:28Z

Derek: /* Example */

==Overview==
UMIACS Class Accounts are currently intended to support classes for all of UMIACS/CSD via the [[Nexus]] cluster. All new class accounts will be serviced solely through this cluster. Faculty may request that a class be supported by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

==Getting an account==
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA to have it renewed.

Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.

===Registering for Duo===
UMIACS requires that all Class accounts be registered for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.'''

In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, you can visit our [[Duo | Duo help page]].

Once notified that your account has been installed and you have registered in our Duo instance, you can access the following class submission host(s) using [[SSH]] with your assigned username and your chosen password:
* <code>nexusclass00.umiacs.umd.edu</code> or <code>nexusclass01.umiacs.umd.edu</code>

==Cleaning up your account before the end of the semester==
Class accounts for a given semester will be archived and deleted after that semester's completion as early as the following:
* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.

==Personal Storage==
Your home directory has a quota of 20GB and is located at:
<pre>
/fs/classhomes/<semester><year>/<coursecode>/<username>
</pre>

where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000".

You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<username>
</pre>

==Group Storage==
You can also request group storage if you would like by having your TA [[HelpDesk | contact staff]] to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
</pre>

where <code><groupname></code> is composed of:
* the abbreviated course code as used in the username e.g., "c999z"
* the character "g"
* the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long

e.g., "c999zg00".

==Cluster Usage==
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the [[SLURM]] workload manager. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager].

'''Any questions or issues with the cluster must be first made through your TA.'''

Class accounts only have access to the following submission parameters in SLURM. You may be required to explicitly set each of these in your submission parameters.

* Partition - <code>class</code>
* Account - <code>class</code>
* QoS - <code>default</code>, <code>medium</code>, and <code>high</code>

===Example===
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code> with the account <code>class</code> running with the QoS of <code>default</code>.

<pre>
$ srun --pty --partition=class --account=class --qos=default --gres=gpu:1 bash
</pre>

</pre>
bash-4.4$ hostname
tron14.umiacs.umd.edu
bash-4.4$ nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c)
</pre>

===Available Nodes===
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to repair/replace GPUs or other components.

<pre>
$ show_nodes -p class
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE PARTITION
tron06 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron07 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron08 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron09 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron10 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron11 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron12 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron13 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron14 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron15 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron16 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron17 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron18 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron19 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron20 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron21 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron22 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron23 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron24 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron25 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron26 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron27 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron28 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron29 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron30 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron31 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron32 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron33 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron34 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron35 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron36 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron37 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron38 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron39 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron40 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron41 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron42 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron43 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron44 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron45 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
</pre>

You can also find more granular information about an individual node with the <code>scontrol show node</code> command.

<pre>
$ scontrol show node tron27
NodeName=tron27 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=rhel8,AMD,EPYC-7302
ActiveFeatures=rhel8,AMD,EPYC-7302
Gres=gpu:rtxa4000:4
NodeAddr=tron27 NodeHostName=tron27 Version=21.08.8-2
OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022
RealMemory=128521 AllocMem=0 FreeMem=125650 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=class,scavenger,tron
BootTime=2022-08-18T17:34:44 SlurmdStartTime=2022-08-19T13:10:47
LastBusyTime=2022-08-22T11:20:18
CfgTRES=cpu=16,mem=128521M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>

ClassAccounts

2022-08-23T15:32:13Z

Derek: /* Cluster Usage */

==Overview==
UMIACS Class Accounts are currently intended to support classes for all of UMIACS/CSD via the [[Nexus]] cluster. All new class accounts will be serviced solely through this cluster. Faculty may request that a class be supported by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

==Getting an account==
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please contact your TA to have it renewed.

Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.

===Registering for Duo===
UMIACS requires that all Class accounts be registered for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.'''

In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, you can visit our [[Duo | Duo help page]].

Once notified that your account has been installed and you have registered in our Duo instance, you can access the following class submission host(s) using [[SSH]] with your assigned username and your chosen password:
* <code>nexusclass00.umiacs.umd.edu</code> or <code>nexusclass01.umiacs.umd.edu</code>

==Cleaning up your account before the end of the semester==
Class accounts for a given semester will be archived and deleted after that semester's completion as early as the following:
* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.

==Personal Storage==
Your home directory has a quota of 20GB and is located at:
<pre>
/fs/classhomes/<semester><year>/<coursecode>/<username>
</pre>

where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000".

You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<username>
</pre>

==Group Storage==
You can also request group storage if you would like by having your TA [[HelpDesk | contact staff]] to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
</pre>

where <code><groupname></code> is composed of:
* the abbreviated course code as used in the username e.g., "c999z"
* the character "g"
* the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long

e.g., "c999zg00".

==Cluster Usage==
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the [[SLURM]] workload manager. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager].

'''Any questions or issues with the cluster must be first made through your TA.'''

Class accounts only have access to the following submission parameters in SLURM. You may be required to explicitly set each of these in your submission parameters.

* Partition - <code>class</code>
* Account - <code>class</code>
* QoS - <code>default</code>, <code>medium</code>, and <code>high</code>

===Example===
Here is a basic example to schedule a interactive job running bash with a single GPU in the partition <code>class</code> with the account <code>class</code> running with the QoS of <code>default</code>.

<pre>
$ srun --pty --partition=class --account=class --qos=default --gres=gpu:1 bash
bash-4.4$ hostname
tron14.umiacs.umd.edu
bash-4.4$ nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-55f2d3b7-9162-8b02-50de-476a012c626c)
</pre>

===Available Nodes===
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to repair/replace GPUs or other components.

<pre>
$ show_nodes -p class
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE PARTITION
tron06 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron07 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron08 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron09 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron10 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron11 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron12 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron13 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron14 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron15 16 128520 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron16 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron17 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron18 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron19 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron20 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron21 16 128525 rhel8,AMD,EPYC-7302P gpu:rtxa4000:4 idle class
tron22 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron23 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron24 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron25 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron26 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron27 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron28 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron29 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron30 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron31 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron32 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron33 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron34 16 128524 rhel8,Zen,EPYC-7313P gpu:rtxa4000:4 idle class
tron35 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron36 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron37 16 128521 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron38 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron39 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron40 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron41 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron42 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron43 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron44 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
tron45 16 128525 rhel8,AMD,EPYC-7302 gpu:rtxa4000:4 idle class
</pre>

You can also find more granular information about an individual node with the <code>scontrol show node</code> command.

<pre>
$ scontrol show node tron27
NodeName=tron27 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=rhel8,AMD,EPYC-7302
ActiveFeatures=rhel8,AMD,EPYC-7302
Gres=gpu:rtxa4000:4
NodeAddr=tron27 NodeHostName=tron27 Version=21.08.8-2
OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022
RealMemory=128521 AllocMem=0 FreeMem=125650 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=class,scavenger,tron
BootTime=2022-08-18T17:34:44 SlurmdStartTime=2022-08-19T13:10:47
LastBusyTime=2022-08-22T11:20:18
CfgTRES=cpu=16,mem=128521M,billing=173,gres/gpu=4,gres/gpu:rtxa4000=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
</pre>

ClassAccounts

2022-08-22T14:32:20Z

Derek: /* Getting an account */

ClassAccounts

2022-08-22T14:26:30Z

Derek: /* Overview */

==Overview==
UMIACS Class Accounts are currently intended to support classes for all of UMIACS/CSD via the [[Nexus]] cluster. All new class accounts will be serviced solely through this cluster. Faculty may request that a class be supported by contacting [mailto:staff@umiacs.umd.edu staff@umiacs.umd.edu].

==Getting an account==
Your TA will request an account for you. Once this is done, you will be notified by email that you have an account to redeem. If you have not received an email, please contact your TA. '''You must redeem the account within 7 days or else the redemption token will expire.''' If your redemption token does expire, please have your TA [[HelpDesk | contact staff]] to have it renewed. Staff will not renew any redemption tokens without TA approval.

Once you do redeem your account, you will need to wait until you get a confirmation email that your account has been installed. This is typically done once a day on days that the University is open for business.

===Registering for Duo===
UMIACS requires that all Class accounts be registered for MFA (multi-factor authentication) under our [[Duo]] instance (note that this is different than UMD's general Duo instance). '''You will not be able to log onto the class submission host until you register.'''

In order to register, [https://intranet.umiacs.umd.edu/directory visit our directory app] and log in with your Class username and password. You will then receive a prompt to enroll in Duo. For assistance in enrollment, you can visit our [[Duo | Duo help page]].

Once notified that your account has been installed and you have registered in our Duo instance, you can access the following class submission host(s) using [[SSH]] with your assigned username and the password you provided depending on the unit sponsoring the class:
* <code>nexusclass00.umiacs.umd.edu</code> or <code>nexusclass01.umiacs.umd.edu</code>

==Cleaning up your account before the end of the semester==
Class accounts for a given semester will be archived and deleted after that semester's completion as early as the following:
* Spring semesters: June 1st of same year
* Summer semesters: September 1st of same year
* Fall semesters: January 1st of next year

It is your responsibility to ensure you have backed up anything you want to keep from your class account's personal or group storage (below sections) prior to the relevant date.

==Personal Storage==
Your home directory has a quota of 20GB and is located at:
<pre>
/fs/classhomes/<semester><year>/<coursecode>/<username>
</pre>

where <code><semester></code> is either "spring", "summer", "fall", or "winter", <code><year></code> is the current year e.g., "2021", <coursecode> is the class' course code as listed in UMD's [https://app.testudo.umd.edu/soc/ Schedule of Classes] in all lowercase e.g., "cmsc999z", and <code><username></code> is the username mentioned in the email you received to redeem the account e.g., "c999z000".

You can request up to another 100GB of personal storage if you would like by having your TA [[HelpDesk | contact staff]]. This storage will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<username>
</pre>

==Group Storage==
You can also request group storage if you would like by having your TA [[HelpDesk | contact staff]] to specify the usernames of the accounts that should be in the group. Only other class accounts in the same class can be added to the group. The quota will be 100GB multiplied by the number of accounts in the group and will be located at
<pre>
/fs/class-projects/<semester><year>/<coursecode>/<groupname>
</pre>

where <code><groupname></code> is composed of:
* the abbreviated course code as used in the username e.g., "c999z"
* the character "g"
* the number of the group (starting at 0 for the first group for the class requested to us) prepended with 0s to make the total group name 8 characters long

e.g., "c999zg00".

==Cluster Usage==
'''You may not run computational jobs on any submission host.''' You must schedule your jobs with the [[SLURM]] workload manager. You can also find out more with the public documentation for the [https://slurm.schedmd.com/quickstart.html SLURM Workload Manager].

'''Any questions or issues with the cluster must be first made through your TA.'''

===CML===
Class accounts only have access to the following submission parameters in SLURM. You may be required to explicitly set each of these in your submission parameters.

* Partition - <code>class</code>
* Account - <code>class</code>
* QoS - <code>default</code>

====Available Nodes====
You can list the available nodes and their current state with the <code>show_nodes -p class</code> command. This list of nodes is not completely static as nodes may be pulled out of service to repair/replace GPUs or other components.

<pre>
$ show_nodes -p class
NODELIST CPUS MEMORY AVAIL_FEATURES GRES STATE PARTITION
cmlgrad00 32 385421 Xeon,4216 gpu:rtx2080ti:8 mix class
cmlgrad01 32 385421 Xeon,4216 gpu:rtx2080ti:8 alloc class
cmlgrad02 32 385421 Xeon,4216 gpu:rtx2080ti:7,gpu:rtx30 idle class
cmlgrad03 32 385421 Xeon,4216 gpu:rtx2080ti:8 mix class
cmlgrad04 32 385421 Xeon,4216 gpu:rtx2080ti:8 alloc class
cmlgrad05 32 385421 Xeon,4216 gpu:rtx3070:1,gpu:rtx2080 idle class
cmlgrad06 32 385422 Xeon,4216 gpu:rtx2080ti:8 alloc class
cmlgrad07 32 385421 Xeon,4216 gpu:rtx2080ti:8 mix class
</pre>

You can also find more granular information about an individual node with the <code>scontrol show node</code> command.

<pre>
$ scontrol show node cmlgrad02
NodeName=cmlgrad02 Arch=x86_64 CoresPerSocket=16
CPUAlloc=0 CPUTot=32 CPULoad=0.07
AvailableFeatures=Xeon,4216
ActiveFeatures=Xeon,4216
Gres=gpu:rtx2080ti:7,gpu:rtx3070:1
NodeAddr=cmlgrad02 NodeHostName=cmlgrad02 Version=20.11.8
OS=Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Fri Sep 24 10:17:16 UTC 2021
RealMemory=385421 AllocMem=0 FreeMem=376637 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=class,scavenger
BootTime=2021-11-18T17:39:23 SlurmdStartTime=2021-11-29T12:42:36
CfgTRES=cpu=32,mem=385421M,billing=487,gres/gpu=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
</pre>

===Nexus===
TBD

ClassAccounts

2022-08-22T14:25:40Z

Derek: /* Overview */

ClassAccounts

2022-08-22T14:24:28Z

Derek: /* Registering for Duo */