Nexus/GAMMA: Difference between revisions
|  Created page with "The [https://gamma.umd.edu/ GAMMA] lab has a partition of GPU nodes available in the Nexus.  The following QoS are also available to this partition.  Please run the <code>show..." | No edit summary | ||
| (59 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| The [https://gamma.umd.edu/ GAMMA] lab has a partition of GPU nodes available in the Nexus. | The [https://gamma.umd.edu/ GAMMA] lab has a partition of GPU nodes available in the [[Nexus]]. Only GAMMA lab members are able to run non-interruptible jobs on these nodes. | ||
| The  | =Access= | ||
| You can always find out what hosts you have access to submit via the [[Nexus#Access]] page.  The GAMMA lab in particular has a special submission node that has additional local storage available. | |||
| * <code>nexusgamma00.umiacs.umd.edu</code> | |||
| *  | Please do not run anything on the submission node. Always allocate yourself machines on the compute nodes (see instructions below) to run any job. | ||
| *  | |||
| *  | =Quality of Service= | ||
| * huge-long | GAMMA users have access to all of the [[Nexus#Quality_of_Service_.28QoS.29 | standard job QoSes]] in the <code>gamma</code> partition using the <code>gamma</code> account. | ||
| The additional job QoSes for the GAMMA partition specifically are: | |||
| * <code>huge-long</code>: Allows for longer jobs using higher overall resources. | |||
| * <code>gamma-huge-long</code>: Allows for longer jobs using higher overall resources, with even more available GPUs per job. | |||
| Please note that the partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use. | |||
| ==Time Limit== | |||
| There is a 12 hour time limit on interactive jobs in the gamma partition.  If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script. | |||
| =Compute Nodes= | |||
| {| class="wikitable sortable" | |||
| ! Nodenames | |||
| ! Type | |||
| ! Quantity | |||
| ! CPU cores per node | |||
| ! Memory per node | |||
| ! GPUs per node | |||
| |- | |||
| |gammagpu[00-04,06-09] | |||
| |A5000 GPU Node | |||
| |9 | |||
| |32 | |||
| |256GB | |||
| |8 | |||
| |- | |||
| |gammagpu05 | |||
| |A4000 GPU Node | |||
| |1 | |||
| |32 | |||
| |256GB | |||
| |8 | |||
| |- | |||
| |gammagpu[10-17] | |||
| |A6000 GPU Node | |||
| |8 | |||
| |16 | |||
| |128GB | |||
| |4 | |||
| |- | |||
| |gammagpu[18-21] | |||
| |L40S GPU Node | |||
| |4 | |||
| |32 | |||
| |256GB | |||
| |4 | |||
| |- class="sortbottom" | |||
| | | |||
| !Total | |||
| |22 | |||
| |576 | |||
| |4608GB | |||
| |128 | |||
| |} | |||
| =Network= | |||
| The network infrastructure supporting the GAMMA partition consists of: | |||
| # One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes: | |||
| #* gammagpu[00-19]: Two 100GbE links per node, one to each switch in the pair (redundancy). | |||
| #* gammagpu[20-21]: One 100GbE link per node. One of the two nodes links to the first switch in the pair, and the other links to the second switch in the pair. These nodes do not have redundant links because the switches are currently at port capacity. | |||
| The fileserver hosting all GAMMA [[Nexus/GAMMA#Project_Directories | project]], [[Nexus/GAMMA#Scratch_Directories | scratch]], and [[Nexus/GAMMA#Datasets | dataset]] allocations first connects to the first pair of switches mentioned [[Nexus/CML#Network | here (CML's page's network section)]] and then to a pair of intermediary switches before reaching the compute nodes. The last hop from the pair of intermediary switches to the first pair of switches mentioned on this page is via eight 100GbE links, two to each switch in a set of four intermediary switches for redundancy and increased bandwidth. | |||
| For a broader overview of the network infrastructure supporting the Nexus cluster, please see [[Nexus/Network]]. | |||
| =Example= | |||
| From <code>nexusgamma00.umiacs.umd.edu</code> you can run the following example to submit an interactive job.  Please note that you need to specify the <code>--partition</code> and <code>--account</code>.  Please refer to our [[SLURM]] documentation about about how to further customize your submissions including making a batch submission.  The following command will allocate 8 GPUs for 2 days in an interactive session.  Change parameters accordingly to your needs.  We discourage use of srun and promote use of sbatch for fair use of GPUs. | |||
| <pre> | |||
| $ srun --pty --gres=gpu:8 --account=gamma --partition=gamma --qos=huge-long bash | |||
| $ hostname | |||
| gammagpu05.umiacs.umd.edu | |||
| $ nvidia-smi -L | |||
| GPU 0: NVIDIA RTX A4000 (UUID: GPU-07ef47fe-072c-5e08-ba19-b87442dd330f) | |||
| GPU 1: NVIDIA RTX A4000 (UUID: GPU-7e30eac5-d1e2-ab92-686c-2aee8efbcb79) | |||
| GPU 2: NVIDIA RTX A4000 (UUID: GPU-9a31a590-7f83-23a3-6499-05b67357dbf1) | |||
| GPU 3: NVIDIA RTX A4000 (UUID: GPU-5c581a49-e0e5-8ec2-124d-b37e043a9086) | |||
| GPU 4: NVIDIA RTX A4000 (UUID: GPU-a7061b64-b90e-8fec-ff31-8cc997c88880) | |||
| GPU 5: NVIDIA RTX A4000 (UUID: GPU-2e590cba-70fd-4261-9f9f-c5ee813fd305) | |||
| GPU 6: NVIDIA RTX A4000 (UUID: GPU-439de936-4b11-5ea6-184c-0a709d35d679) | |||
| GPU 7: NVIDIA RTX A4000 (UUID: GPU-de7ef2e7-b6f8-fe81-dfab-a86629553bde) | |||
| </pre> | |||
| You can also use SBATCH to submit your job.  Here are two examples on how to do that. | |||
| <pre> | |||
| $ sbatch --pty --gres=gpu:8 --account=gamma --partition=gamma --qos=huge-long --time=1-23:00:00 script.sh | |||
| </pre> | |||
| OR | |||
| <pre> | |||
| $ sbatch script.sh | |||
| // script.sh // | |||
| #!/bin/bash | |||
| #SBATCH --gres=gpu:8 | |||
| #SBATCH --account=gamma | |||
| #SBATCH --partition=gamma | |||
| #SBATCH --qos=huge-long | |||
| #SBATCH --time=1-23:00:00 | |||
| python your_file.py | |||
| </pre> | |||
| =Storage= | |||
| There are 3 types of user storage available to users in GAMMA: | |||
| * Home directories | |||
| * Project directories | |||
| * Scratch directories | |||
| There is also read-only storage available for Dataset directories. | |||
| GAMMA users can also request [[Nexus#Project_Allocations | Nexus project allocations]]. | |||
| ===Home Directories=== | |||
| {{Nfshomes}} | |||
| ===Project Directories=== | |||
| You can request project based allocations for up to 8TB and up to 180 days with approval from a GAMMA faculty member.   | |||
| To request an allocation, please [[HelpDesk | contact staff]] with the faculty member(s) that approved the project in the conversation.  Please include the following details: | |||
| * Project Name (short) | |||
| * Description | |||
| * Size (1TB, 2TB, etc.) | |||
| * Length in days (30 days, 90 days, etc.) | |||
| * Other user(s) that need to access the allocation, if any | |||
| These allocations will be available from '''/fs/gamma-projects''' under a name that you provide when you request the allocation.  Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation (requires re-approval from a GAMMA faculty member). | |||
| * If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period.  Staff will then remove the allocation. | |||
| * If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible. | |||
| ** If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible. | |||
| ** If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation. | |||
| This data is backed up nightly. | |||
| ===Scratch Directories=== | |||
| Scratch data has no data protection, there are no snapshots and the data is not backed up.  | |||
| There are two types of scratch directories: | |||
| * Network scratch directory | |||
| * Local scratch directories | |||
| ====Network Scratch Directory==== | |||
| You are allocated 100GB of scratch space via NFS from <code>/gammascratch/$username</code>.  '''It is not backed up or protected in any way.'''   | |||
| This directory is '''automounted''' so you may not see your directory if you run <code>ls /gammascratch</code> but it will be mounted when you <code>cd</code> into your /gammascratch directory. | |||
| You may request a permanent increase of up to 200GB total space without any faculty approval by [[HelpDesk | contacting staff]].  If you need space beyond 200GB, you will need faculty approval.  | |||
| This file system is available on all submission and computational nodes within the cluster. | |||
| ====Local Scratch Directories==== | |||
| These file systems are not available over [[NFS]] and '''there are no backups or snapshots available''' for these file systems. | |||
| * Each computational node that you can schedule compute jobs on has one or more local scratch directories.  These are always named <code>/scratch0</code>, <code>/scratch1</code>, etc.  These directories are local to each node, ie. the <code>/scratch0</code> on two different nodes are completely separate. | |||
| ** These directories are almost always more performant than any other storage available to the job.  However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs. | |||
| ** These local scratch directories have a tmpwatch job which will '''delete unaccessed data after 90 days''', scheduled via maintenance jobs to run once a month during our monthly maintenance windows.  Again, please make sure you secure any data you write to these directories at the end of your job. | |||
| * Gamma has invested in a 20TB NVMe scratch file system on <code>nexusgamma00.umiacs.umd.edu</code> that is available as <code>/scratch1</code>.  To utilize this space, you will need to copy data from/to this over SSH from a compute node.  To make this easier, you may want to setup [[SSH]] keys that will allow you to copy data without prompting for passwords.  | |||
| ** The <code>/scratch1</code> directory on <code>nexusgamma00.umiacs.umd.edu</code> doesn't have a tmpwatch. The files in this directory need to be manually removed once they are no longer needed. | |||
| ===Datasets=== | |||
| We have read-only dataset storage available at <code>/fs/gamma-datasets</code>.  If there are datasets that you would like to see curated and made available, please see [[Datasets | this page]]. | |||
| The list of GAMMA datasets we currently host can be viewed [https://info.umiacs.umd.edu/datasets/list/?q=GAMMA here]. | |||
Latest revision as of 17:25, 28 August 2025
The GAMMA lab has a partition of GPU nodes available in the Nexus. Only GAMMA lab members are able to run non-interruptible jobs on these nodes.
Access
You can always find out what hosts you have access to submit via the Nexus#Access page. The GAMMA lab in particular has a special submission node that has additional local storage available.
- nexusgamma00.umiacs.umd.edu
Please do not run anything on the submission node. Always allocate yourself machines on the compute nodes (see instructions below) to run any job.
Quality of Service
GAMMA users have access to all of the  standard job QoSes in the gamma partition using the gamma account.
The additional job QoSes for the GAMMA partition specifically are:
- huge-long: Allows for longer jobs using higher overall resources.
- gamma-huge-long: Allows for longer jobs using higher overall resources, with even more available GPUs per job.
Please note that the partition has a GrpTRES limit of 100% of the available cores/RAM on the partition-specific nodes in aggregate plus 50% of the available cores/RAM on legacy## nodes in aggregate, so your job may need to wait if all available cores/RAM (or GPUs) are in use.
Time Limit
There is a 12 hour time limit on interactive jobs in the gamma partition. If you need to run longer jobs, you will need to modify your workflow into a job that can be submitted as a batch script.
Compute Nodes
| Nodenames | Type | Quantity | CPU cores per node | Memory per node | GPUs per node | 
|---|---|---|---|---|---|
| gammagpu[00-04,06-09] | A5000 GPU Node | 9 | 32 | 256GB | 8 | 
| gammagpu05 | A4000 GPU Node | 1 | 32 | 256GB | 8 | 
| gammagpu[10-17] | A6000 GPU Node | 8 | 16 | 128GB | 4 | 
| gammagpu[18-21] | L40S GPU Node | 4 | 32 | 256GB | 4 | 
| Total | 22 | 576 | 4608GB | 128 | 
Network
The network infrastructure supporting the GAMMA partition consists of:
- One pair of network switches connected to each other via dual 100GbE links for redundancy, serving the following compute nodes:
- gammagpu[00-19]: Two 100GbE links per node, one to each switch in the pair (redundancy).
- gammagpu[20-21]: One 100GbE link per node. One of the two nodes links to the first switch in the pair, and the other links to the second switch in the pair. These nodes do not have redundant links because the switches are currently at port capacity.
 
The fileserver hosting all GAMMA project, scratch, and dataset allocations first connects to the first pair of switches mentioned here (CML's page's network section) and then to a pair of intermediary switches before reaching the compute nodes. The last hop from the pair of intermediary switches to the first pair of switches mentioned on this page is via eight 100GbE links, two to each switch in a set of four intermediary switches for redundancy and increased bandwidth.
For a broader overview of the network infrastructure supporting the Nexus cluster, please see Nexus/Network.
Example
From nexusgamma00.umiacs.umd.edu you can run the following example to submit an interactive job.  Please note that you need to specify the --partition and --account.  Please refer to our SLURM documentation about about how to further customize your submissions including making a batch submission.  The following command will allocate 8 GPUs for 2 days in an interactive session.  Change parameters accordingly to your needs.  We discourage use of srun and promote use of sbatch for fair use of GPUs.
$ srun --pty --gres=gpu:8 --account=gamma --partition=gamma --qos=huge-long bash $ hostname gammagpu05.umiacs.umd.edu $ nvidia-smi -L GPU 0: NVIDIA RTX A4000 (UUID: GPU-07ef47fe-072c-5e08-ba19-b87442dd330f) GPU 1: NVIDIA RTX A4000 (UUID: GPU-7e30eac5-d1e2-ab92-686c-2aee8efbcb79) GPU 2: NVIDIA RTX A4000 (UUID: GPU-9a31a590-7f83-23a3-6499-05b67357dbf1) GPU 3: NVIDIA RTX A4000 (UUID: GPU-5c581a49-e0e5-8ec2-124d-b37e043a9086) GPU 4: NVIDIA RTX A4000 (UUID: GPU-a7061b64-b90e-8fec-ff31-8cc997c88880) GPU 5: NVIDIA RTX A4000 (UUID: GPU-2e590cba-70fd-4261-9f9f-c5ee813fd305) GPU 6: NVIDIA RTX A4000 (UUID: GPU-439de936-4b11-5ea6-184c-0a709d35d679) GPU 7: NVIDIA RTX A4000 (UUID: GPU-de7ef2e7-b6f8-fe81-dfab-a86629553bde)
You can also use SBATCH to submit your job. Here are two examples on how to do that.
$ sbatch --pty --gres=gpu:8 --account=gamma --partition=gamma --qos=huge-long --time=1-23:00:00 script.sh
OR
$ sbatch script.sh // script.sh // #!/bin/bash #SBATCH --gres=gpu:8 #SBATCH --account=gamma #SBATCH --partition=gamma #SBATCH --qos=huge-long #SBATCH --time=1-23:00:00 python your_file.py
Storage
There are 3 types of user storage available to users in GAMMA:
- Home directories
- Project directories
- Scratch directories
There is also read-only storage available for Dataset directories.
GAMMA users can also request Nexus project allocations.
Home Directories
You have 30GB of home directory storage available at /nfshomes/<username>.  It has both Snapshots and  Backups enabled.
Home directories are intended to store personal or configuration files only. We encourage you to not share any data in your home directory. You are encouraged to utilize our GitLab infrastructure to host your code repositories.
NOTE: To check your quota on this directory, use the command df -h ~.
Project Directories
You can request project based allocations for up to 8TB and up to 180 days with approval from a GAMMA faculty member.
To request an allocation, please contact staff with the faculty member(s) that approved the project in the conversation. Please include the following details:
- Project Name (short)
- Description
- Size (1TB, 2TB, etc.)
- Length in days (30 days, 90 days, etc.)
- Other user(s) that need to access the allocation, if any
These allocations will be available from /fs/gamma-projects under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation (requires re-approval from a GAMMA faculty member).
- If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation.
- If you do not respond to staff's request by the end of the allocation period, staff will make the allocation temporarily inaccessible.
- If you do respond asking for renewal but the original faculty approver does not respond within two weeks of the end of the allocation period, staff will also make the allocation temporarily inaccessible.
- If one month from the end of the allocation period is reached without both you and the faculty approver responding, staff will remove the allocation.
 
This data is backed up nightly.
Scratch Directories
Scratch data has no data protection, there are no snapshots and the data is not backed up. There are two types of scratch directories:
- Network scratch directory
- Local scratch directories
Network Scratch Directory
You are allocated 100GB of scratch space via NFS from /gammascratch/$username.  It is not backed up or protected in any way.  
This directory is automounted so you may not see your directory if you run ls /gammascratch but it will be mounted when you cd into your /gammascratch directory.
You may request a permanent increase of up to 200GB total space without any faculty approval by contacting staff. If you need space beyond 200GB, you will need faculty approval.
This file system is available on all submission and computational nodes within the cluster.
Local Scratch Directories
These file systems are not available over NFS and there are no backups or snapshots available for these file systems.
- Each computational node that you can schedule compute jobs on has one or more local scratch directories.  These are always named /scratch0,/scratch1, etc. These directories are local to each node, ie. the/scratch0on two different nodes are completely separate.- These directories are almost always more performant than any other storage available to the job. However, you must stage data to these directories within the confines of your jobs and stage the data out before the end of your jobs.
- These local scratch directories have a tmpwatch job which will delete unaccessed data after 90 days, scheduled via maintenance jobs to run once a month during our monthly maintenance windows. Again, please make sure you secure any data you write to these directories at the end of your job.
 
- Gamma has invested in a 20TB NVMe scratch file system on nexusgamma00.umiacs.umd.eduthat is available as/scratch1. To utilize this space, you will need to copy data from/to this over SSH from a compute node. To make this easier, you may want to setup SSH keys that will allow you to copy data without prompting for passwords.- The /scratch1directory onnexusgamma00.umiacs.umd.edudoesn't have a tmpwatch. The files in this directory need to be manually removed once they are no longer needed.
 
- The 
Datasets
We have read-only dataset storage available at /fs/gamma-datasets.  If there are datasets that you would like to see curated and made available, please see  this page.
The list of GAMMA datasets we currently host can be viewed here.