Nexus/Vulcan: Difference between revisions
(→QoS) |
(→Usage) |
||
Line 32: | Line 32: | ||
|} | |} | ||
Vulcan users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on vulcan## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''. | Vulcan users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the <code>vulcan-dpart</code> partition has a <code>GrpTRES</code> limit of 100% of the available cores/RAM on vulcan## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named '''vulcan'''. | ||
Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs). | Please note that the Vulcan compute nodes are also in the institute-wide <code>scavenger</code> partition in Nexus. Vulcan users still have scavenging priority over these nodes via the <code>vulcan-scavenger</code> partition (i.e., all <code>vulcan-</code> partition jobs (other than <code>vulcan-scavenger</code>) can preempt both <code>vulcan-scavenger</code> and <code>scavenger</code> partition jobs, and <code>vulcan-scavenger</code> partition jobs can preempt <code>scavenger</code> partition jobs). |
Revision as of 15:32, 17 August 2023
The Vulcan standalone cluster's compute nodes will fold into Nexus on Thursday, August 17th, 2023 during the scheduled maintenance window for August (5-8pm).
The Nexus cluster already has a large pool of compute resources made possible through leftover funding for the Brendan Iribe Center. Details on common nodes already in the cluster (Tron partition) can be found here.
In addition, the Vulcan cluster's standalone submission nodes vulcansub00.umiacs.umd.edu
and vulcansub01.umiacs.umd.edu
will be retired on Thursday, September 21st, 2023 during that month's maintenance window (5-8pm), as they will no longer be able to submit jobs to Vulcan compute nodes after the August maintenance window. Please use nexusvulcan00.umiacs.umd.edu
and nexusvulcan01.umiacs.umd.edu
for any general purpose Vulcan compute needs after this time.
As of Friday, July 21st, 2023, two 1080 Ti compute nodes (vulcan11
and vulcan12
) have been moved into Nexus to give you a chance to test your new submission scripts now if you would like. Please continue to run your normal Vulcan workloads on the standalone Vulcan cluster for now, as two compute nodes will not be able to handle jobs from multiple users simultaneously. Only the vulcan-dpart
and vulcan-scavenger
partitions are available to test with.
Please see the Timeline section below for concrete dates in chronological order.
Please contact staff with any questions or concerns.
Usage
The Nexus cluster submission nodes that are allocated to Vulcan are nexusvulcan00.umiacs.umd.edu
and nexusvulcan01.umiacs.umd.edu
. You must use these nodes to submit jobs to Vulcan compute nodes after the August maintenance window. Submission from vulcansub00.umiacs.umd.edu
or vulcansub01.umiacs.umd.edu
will no longer work.
All partitions, QoSes, and account names from the standalone Vulcan cluster are being moved over to Nexus when the compute nodes move. However, please note that vulcan-
is prepended to all of the values that were present in the standalone Vulcan cluster to distinguish them from existing values in Nexus. The lone exception is the base account currently named vulcan
in the standalone cluster (it is also named just vulcan
in Nexus).
Here are some before/after examples of job submission with various parameters:
Standalone Vulcan cluster submission command | Nexus cluster submission command |
---|---|
srun --partition=dpart --qos=medium --account=abhinav --gres=gpu:rtxa4000:2 --pty bash
|
srun --partition=vulcan-dpart --qos=vulcan-medium --account=vulcan-abhinav --gres=gpu:rtxa4000:2 --pty bash
|
srun --partition=cpu --qos=cpu --pty bash
|
srun --partition=vulcan-cpu --qos=vulcan-cpu --pty bash
|
srun --partition=scavenger --qos=scavenger --account=vulcan --gres=gpu:4 --pty bash
|
srun --partition=vulcan-scavenger --qos=vulcan-scavenger --account=vulcan --gres=gpu:4 --pty bash
|
Vulcan users (exclusively) can schedule non-interruptible jobs on the moved nodes with any non-scavenger job parameters. Please note that the vulcan-dpart
partition has a GrpTRES
limit of 100% of the available cores/RAM on vulcan## nodes plus 50% of the available cores/RAM on legacy## nodes, so your job may need to wait if all available cores/RAM (or GPUs) are in use. It also has a max submission limit of 500 jobs per user simultaneously so as to not overload the cluster. This is codified by the partition QoS named vulcan.
Please note that the Vulcan compute nodes are also in the institute-wide scavenger
partition in Nexus. Vulcan users still have scavenging priority over these nodes via the vulcan-scavenger
partition (i.e., all vulcan-
partition jobs (other than vulcan-scavenger
) can preempt both vulcan-scavenger
and scavenger
partition jobs, and vulcan-scavenger
partition jobs can preempt scavenger
partition jobs).
Timeline
Each event will be completed within the timeframe specified.
Date | Event |
---|---|
July 21st 2023 | Two compute nodes vulcan11 and vulcan12 are moved into Nexus so submission can be tested
|
August 17th 2023, 5-8pm | All other standalone Vulcan cluster compute nodes are moved into Nexus in corresponding vulcan- named partitions
|
September 21st 2023, 5-8pm | vulcansub00.umiacs.umd.edu and vulcansub01.umiacs.umd.edu are taken offline
|
Migration
Home Directories
The Nexus uses NFShomes home directories - if your UMIACS account was created before February 22nd, 2023, you have been using /cfarhomes/<username>
as your home directory on the standalone Vulcan cluster. While /cfarhomes
is available on Nexus, your shell initialization scripts from it will not automatically load. Please copy over anything you need to your /nfshomes/<username>
directory at your earliest convenience, as /cfarhomes
may be retired in the coming year.
Post-Migration
There are currently 45 GPU nodes available running a mixture of NVIDIA RTX A6000, NVIDIA RTX A5000, NVIDIA RTX A4000, NVIDIA Quadro P6000, NVIDIA GeForce GTX 1080 Ti, NVIDIA GeForce RTX 2080 Ti, and NVIDIA Tesla P100 cards. There are also 2 CPU-only nodes available.
All nodes are scheduled with the SLURM resource manager.
Partitions
There are three partitions available to general Vulcan SLURM users. You must specify a partition when submitting your job.
- vulcan-dpart - This is the default partition. Job allocations are guaranteed.
- vulcan-scavenger - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in other
vulcan-
partitions are ready to be scheduled. - vulcan-cpu - This partition is for CPU focused jobs. Job allocations are guaranteed.
There are a few additional partitions available to subsets of Vulcan users based on specific requirements.
Accounts
Vulcan has a base SLURM account vulcan
which has a modest number of guaranteed billing resources available to all cluster users at any given time. Other faculty that have invested in Vulcan compute infrastructure have an additional account provided to their sponsored accounts on the cluster, which provides a number of guaranteed billing resources corresponding to the amount that they invested.
If you do not specify an account when submitting your job, you will receive the vulcan
account. If your faculty sponsor has their own account, it is recommended to use that account for job submission.
The current faculty accounts are:
- abhinav
- djacobs
- jbhuang
- lsd
- metzler
- rama
- ramani
- yaser
- zwicker
$ sacctmgr show account format=account%20,description%30,organization%10 Account Descr Org -------------------- ------------------------------ ---------- ... ... ... vulcan vulcan vulcan vulcan-abhinav vulcan - abhinav shrivastava vulcan vulcan-djacobs vulcan - david jacobs vulcan vulcan-jbhuang vulcan - jia-bin huang vulcan vulcan-lsd vulcan - larry davis vulcan vulcan-metzler vulcan - chris metzler vulcan vulcan-rama vulcan - rama chellappa vulcan vulcan-ramani vulcan - ramani duraiswami vulcan vulcan-yaser vulcan - yaser yacoob vulcan vulcan-zwicker vulcan - matthias zwicker vulcan ... ... ...
Faculty can manage this list of users via our Directory application in the Security Groups section. The security group that controls access has the prefix vulcan_
and then the faculty username. It will also list slurm://nexusctl.umiacs.umd.edu
as the associated URI.
You can check your account associations by running the show_assoc command to see the accounts you are associated with. Please contact staff and include your faculty member in the conversation if you do not see the appropriate association.
$ show_assoc User Account MaxJobs GrpTRES QOS ---------- ---------------- ------- ------------- -------------------------------------------------------------------------------- ... ... ... ... abhinav abhinav 48 vulcan-cpu,vulcan-default,vulcan-high,vulcan-medium,vulcan-scavenger abhinav vulcan 48 vulcan-cpu,vulcan-default,vulcan-medium,vulcan-scavenger ... ... ... ...
You can also see the total number of Track-able Resources (TRES) allowed for each account by running the following command. Please make sure you give the appropriate account that you are looking for. As shown below, there is a concurrent limit of 64 total GPUs for all users not in a contributing faculty group.
$ sacctmgr show assoc account=vulcan format=user,account,qos,grptres User Account QOS GrpTRES ---------- ---------- -------------------- ------------- vulcan gres/gpu=64 ... ...
QoS
You need to decide the QOS to submit with which will set a certain number of restrictions to your job. If you do not specify a QoS when submitting your job using the --qos
parameter, you will receive the vulcan-default QoS assuming you are using a Vulcan account.
The following sacctmgr
command will list the current QOS. Either the vulcan-default
, vulcan-medium
, or vulcan-high
QOS is required for the vulcan-dpart partition. Please note that only faculty accounts (see above) have access to the vulcan-high
QoS.
The following example will show you the current limits that the QOS have.
$ show_qos Name MaxWall MaxTRES MaxJobsPU MaxSubmitPU MaxTRESPU GrpTRES -------------------- ----------- ------------------------------ --------- ----------- ------------------------------ -------------------- ... ... ... ... ... ... ... vulcan-medium 3-00:00:00 cpu=8,gres/gpu=2,mem=64G 2 vulcan-high 1-12:00:00 cpu=16,gres/gpu=4,mem=128G 2 vulcan-default 7-00:00:00 cpu=4,gres/gpu=1,mem=32G 2 vulcan-scavenger 3-00:00:00 cpu=32,gres/gpu=8,mem=256G vulcan-janus 3-00:00:00 cpu=32,gres/gpu=10,mem=256G vulcan-exempt 7-00:00:00 cpu=32,gres/gpu=8,mem=256G 2 vulcan-cpu 2-00:00:00 cpu=1024,mem=4T 4 vulcan-exclusive 30-00:00:00 vulcan-sailon 3-00:00:00 cpu=32,gres/gpu=8,mem=256G gres/gpu=48 ... ... ... ... ... ... ...
Storage
Vulcan has the following storage available. Please also review UMIACS Local Data Storage policies including any volume that is labeled as scratch.
Vulcan users can also request Nexus project allocations.
Home Directory
You have 20GB of storage available at /nfshomes/<username>
. It has both Snapshots and Backups available if need be.
Home directories are intended to store personal or configuration files only. We encourage users to not share any data in their home directory.
Scratch Directories
Scratch data has no data protection including no snapshots and the data is not backed up. There are two types of scratch directories in the Vulcan compute infrastructure:
- Network scratch directory
- Local scratch directories
Network Scratch Directory
You are allocated 300GB of scratch space via NFS from /vulcanscratch/$username
. It is not backed up or protected in any way. This directory is automounted so you will need to cd
into the directory or request/specify a fully qualified file path to access this.
You may request a temporary increase of up to 500GB total space for a maximum of 120 days without any faculty approval by contacting staff@umiacs.umd.edu. Once the temporary increase period is over, you will be contacted and given a one-week window of opportunity to clean and secure your data before staff will forcibly remove data to get your space back under 300GB. If you need space beyond 500GB or for longer than 120 days, you will need faculty approval and/or a project directory.
This file system is available on all submission, data management, and computational nodes within the cluster.
Local Scratch Directories
Each computational node that you can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0
, /scratch1
, etc. These are almost always more performant than any other storage available to the job. However, you must stage their data within the confine of their job and stage the data out before the end of their job.
These local scratch directories have a tmpwatch job which will delete unaccessed data after 90 days, scheduled via maintenance jobs to run once a month at 1am. Different nodes will run the maintenance jobs on different days of the month to ensure the cluster is still highly available at all times. Please make sure you secure any data you write to these directories at the end of your job.
Datasets
We have read-only dataset storage available at /fs/vulcan-datasets
. If there are datasets that you would like to see curated and available, please see this page.
The following is the list of datasets available:
Dataset | Path |
---|---|
3D-FRONT | /fs/vulcan-datasets/3d-front |
3D-FUTURE | /fs/vulcan-datasets/3d-future |
Action Genome | /fs/vulcan-datasets/AG |
ActivityNet | /fs/vulcan-datasets/ActivityNet |
CATER | /fs/vulcan-datasets/CATER |
COVID-DA | /fs/vulcan-datasets/COVID-DA |
CelebA | /fs/vulcan-datasets/CelebA |
CelebA-HQ | /fs/vulcan-datasets/CelebA-HQ |
CelebAMask-HQ | /fs/vulcan-datasets/CelebAMask-HQ |
Charades | /fs/vulcan-datasets/Charades |
CharadesEgo | /fs/vulcan-datasets/CharadesEgo |
CIFAR10 | /fs/vulcan-datasets/cifar-10-python |
CIFAR100 | /fs/vulcan-datasets/cifar-100-python |
CityScapes | /fs/vulcan-datasets/cityscapes |
COCO | /fs/vulcan-datasets/coco |
Conceptual Captions | /fs/vulcan-datasets/conceptual_captions |
CUB | /fs/vulcan-datasets/CUB |
DeepFashion | /fs/vulcan-datasets/DeepFashion |
Digits | /fs/vulcan-datasets/digits_full |
Edges2handbags | /fs/vulcan/datasets/edges2handbags |
Edges2shoes | /fs/vulcan/datasets/edges2shoes |
EGTEA | /fs/vulcan/datasets/EGTEA |
emnist | /fs/vulcan-datasets/emnist |
EPIC Kitchens 2018 | /fs/vulcan-datasets/Epics-kitchen-2018 |
EPIC Kitchens 2020 | /fs/vulcan-datasets/EPIC-Kitchens-2020 |
Facades | /fs/vulcan/datasets/facades |
from_games (GTA5) | /fs/vulcan-datasets/from_games |
FFHQ | /fs/vulcan-datasets/ffhq-dataset |
FineGym | /fs/vulcan-datasets/FineGym |
Google Landmarks Dataset v2 | /fs/vulcan-datasets/google-landmark-v2 |
HAA500 | /fs/vulcan-datasets/haa500 |
HICO | /fs/vulcan-datasets/HICO |
HMDB51 | /fs/vulcan-datasets/HMDB51 |
Honda_100h | /fs/vulcan-datasets/honda_100h |
HPatches | /fs/vulcan-datasets/HPatches |
Human3.6M | /fs/vulcan-datasets/human3.6 |
IM2GPS (test only) | /fs/vulcan-datasets/im2gps |
ImageNet | /fs/vulcan-datasets/imagenet |
iNaturalist Dataset 2021 | /fs/vulcan-datasets/inat_comp_2021 |
InteriorNet | /fs/vulcan-datasets/InteriorNet |
Kinetics-400 | /fs/vulcan-datasets/Kinetics-400 |
Labelled Faces in the Wild | /fs/vulcan-datasets/lfw |
LibriSpeech | /fs/vulcan-datasets/LibriSpeech |
LSUN | /fs/vulcan-datasets/LSUN |
LVIS | /fs/vulcan-datasets/LVIS |
Maps | /fs/vulcan-datasets/maps |
Matterport3D | /fs/vulcan-datasets/Matterport3D |
MegaDepth | /fs/vulcan-datasets/MegaDepth |
MineRL | /fs/vulcan-datasets/MineRL |
Mini-ImageNet | /fs/vulcan-datasets/miniImagenet |
MIT Indoor | /fs/vulcan-datasets/mit_indoor |
MIT Places | /fs/vulcan-datasets/mit_places |
Multi-PIE Face | /fs/vulcan-datasets/multipie |
Night2day | /fs/vulcan-datasets/night2day |
ObjectNet3D | /fs/vulcan-datasets/ObjectNet3D |
Occluded Video Instance Segmentation | /fs/vulcan-datasets/ovis-2021 |
Office | /fs/vulcan-datasets/office |
Office-Home | /fs/vulcan-datasets/office_home |
omniglot | /fs/vulcan-datasets/omniglot |
OOPS | /fs/vulcan-datasets/OOPS |
OpenImagesv4 | /fs/vulcan-datasets/OpenImagesv4 |
PartNet | /fs/vulcan-datasets/PartNet |
Pascal VOC | /fs/vulcan-datasets/pascal_voc |
PIC (HOI-A) | /fs/vulcan-datasets/PIC |
PubLayNet | /fs/vulcan-datasets/PubLayNet |
Replica | /fs/vulcan-datasets/Replica |
ScanNet | /fs/vulcan-datasets/ScanNet |
ShapeNetCore.v2 | /fs/vulcan-datasets/ShapeNetCore.v2 |
Something-Something-V1 | /fs/vulcan-datasets/SomethingV1 |
Something-Something-V2 | /fs/vulcan-datasets/SomethingV2 |
SYNTHIA-RAND-CITYSCAPES | /fs/vulcan-datasets/SYNTHIA-RAND-CITYSCAPES |
TAPOS | /fs/vulcan-datasets/TAPOS |
Tiny ImageNet | /fs/vulcan-datasets/tiny_imagenet |
Tumblr GIF Description | /fs/vulcan-datasets/TGIF |
Thingi10K | /fs/vulcan-datasets/Thingi10K |
UCF101 | /fs/vulcan-datasets/UCF101 |
VirtualHomes | /fs/vulcan-datasets/VirtualHomes |
visda17 | /fs/vulcan-datasets/visda17 |
visda17_openset | /fs/vulcan-datasets/VISDA |
visda19 | /fs/vulcan-datasets/visda |
Visual Genome | /fs/vulcan-datasets/VG |
Visual Relationship Detection | /fs/vulcan-datasets/VRD |
VOCdevkit | /fs/vulcan-datasets/VOCdevkit |
VoxCeleb2 | /fs/vulcan-datasets/VoxCeleb2 |
WILDS | /fs/vulcan-datasets/WILDS |
xView2 | /fs/vulcan-datasets/xView2 |
YCB Object Models | /fs/vulcan-datasets/YCB |
YouTube8M | /fs/vulcan-datasets/YouTube8M |
YouTubeVIS-2019 | /fs/vulcan-datasets/YouTubeVIS-2019 |
YouTubeVIS-2021 | /fs/vulcan-datasets/YouTubeVIS-2021 |
Project Storage
Users within the Vulcan compute infrastructure can request project based allocations for up to 10TB for up to 180 days by contacting staff with approval from the Vulcan faculty manager (Dr. Shrivastava). These allocations will be available from /fs/vulcan-projects
under a name that you provide when you request the allocation. Near the end of the allocation period, staff will contact you and ask if you would like to renew the allocation for up to another 180 days (requires re-approval from Dr. Shrivastava). If you are no longer in need of the storage allocation, you will need to relocate all desired data within two weeks of the end of the allocation period. Staff will then remove the allocation. If you or the original faculty approver do not respond to staff's request or within one month of the end of the allocation period, staff will remove the allocation.
This data, by default, will be backed up nightly and have a limited snapshot schedule (1 daily snapshot). Upon request, staff can both exclude the data from backups and/or disable snapshots on the project storage volume. We currently have 100TB total to support these projects which includes the snapshot data for this volume.
Object Storage
All Vulcan users can request project allocations in the UMIACS Object Store. Please email staff@umiacs.umd.edu with a short project name and the amount of storage you will need to get started.
An example on how to use the umobj command line utilities can be found here. A full set of documentation for the utilities can be found on the umobj Gitlab page.