MBRC: Difference between revisions

From UMIACS
Jump to navigation Jump to search
(Replaced content with "The [https://mbrc.umd.edu MBRC] at the University of Maryland is located within the Institute for Advanced Computer Studies. The MBRC has a cluster of computational (CPU/GPU) resources that are available to be scheduled. Details on this cluster can be found here.")
Tag: Replaced
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
The MBRC ([https://mbrc.umd.edu MBRC]) at the University of Maryland is located within the Institute for Advanced Computer Studies.  The MBRC has a cluster of computational (CPU/GPU) resources that are available to be scheduled.
The [https://mbrc.umd.edu MBRC] at the University of Maryland is located within the Institute for Advanced Computer Studies.  The MBRC has a cluster of computational (CPU/GPU) resources that are available to be scheduled.  Details on this cluster can be found [[Nexus/MBRC | here]].
 
=Compute Infrastructure=
 
Each of UMIACS cluster computational infrastructures is accessed through the submission nodeUsers will need to submit jobs through the [[SLURM]] resource manager once they have logged into the submission node.  Each cluster in UMIACS has different quality of service (QoS) that need to be selected upon submission of a job and many like this one has specific other resources such as GPUs that need to be requested for a job. 
 
The current submission node(s) for ''MBRC'' is:
* <code>mbrcsub00.umiacs.umd.edu</code>
 
== Partition ==
There are two partitions to the MBRC [[SLURM]] computational infrastructure.  If you do not specify a partition when submitting your job you will receive the '''dpart'''.
 
* '''dpart''' - This is the default partition and job allocations are guaranteed.
* '''scavenger''' - This is the alternate partition that allows jobs longer run times and more resources but is preemptable when jobs in the '''dpart''' are ready to be scheduled.
 
== QoS ==
MBRC currently has 1 QoS for the '''dpart''' and 1 QoS for the '''scavenger''' partition.  The important parts here is that in different QoS you can have a shorter/longer maximum wall time, a total number of jobs running at once and a maximum number of track-able resources (TRES) for the job.  In the scavenger QoS there is one more constraint that you are restricted by the total number of TRES per user (over multiple jobs).
 
<pre>
# show_qos
      Name    MaxWall MaxJobs                        MaxTRES    MaxTRESPU
---------- ----------- ------- ------------------------------ -------------
  default  1-00:00:00      1      cpu=4,mem=32G,gres/gpu=2
scavenger  2-00:00:00                                          gres/gpu=8
</pre>
 
== GPUs ==
Jobs that require GPU resources need to explicitly request the resources within their job submission.  This is done through Generic Resource Scheduling (GRES).  Currently all nodes in the cluster are homogeneous however in the future this may not be the case.  Users may use the most generic identifier in this case '''gpu''' a colon and a number to select without explicitly naming the type of GPU (ie. <code>--gres gpu:4</code> for 4 GPUs). 
 
<pre>
$ sinfo -o "%15n %10c %10m  %25f %25G"
NODELIST        CPUS      MEMORY      AVAIL_FEATURES            GRES
mbrc[00]      32        191896          (null)                    gpu:rtx2080ti:8
mbrc[01]      32        191896          (null)                    gpu:rtx2080ti:8
</pre>
 
 
== Job Submission and Management ==
Users should review our [[SLURM]] [[SLURM/JobSubmission | job submission]] and [[SLURM/JobStatus | job management]] documentation. 
 
A very quick start to get an interactive shell is as follows when run on the submission node.  This will allocate 1 GPU with 16GB of memory (system RAM) in the QoS default for 4 hours maximum time.  If the job goes beyond these limits either the memory allocation or the maximum time it will be terminated immediately.
 
<pre>
srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
</pre>
 
<pre>
[jheager2@mbrcsub00:~ ] $ srun --pty --gres=gpu:1 --mem=16G --qos=default --time=04:00:00 bash
[jheager2@mbrc00:~ ] $ nvidia-smi -L
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-4ad5c018-b9bc-e664-233a-5d9ee8ad05cb)
</pre>
 
=Data Storage=
 
Until the final storage investment arrives we have made available a temporary allocation of storage.  This section is subject to change.  There are 3 types of storage available to users in the MBRC home directories, project directories and scratch directories.
 
== Home Directories ==
Home directories in the MBRC computational infrastructure are available from the Institutes [[NFShomes]] as <code>/nfshomes/USERNAME</code> where USERNAME is your username.  These home directories have very limited storage and are intended for your personal files, configuration and source code.  Your home directory is '''not''' intended for data sets or other large scale data holdings.  Users are encouraged to utilize our [[GitLab]] infrastructure to host your code repositories.
 
'''NOTE''': To check your quota on this directory you will need to use the <code>quota -s</code> command.
 
Your home directory data is fully protected and has both snapshots and is backed up nightly.
 
== Project Directories ==
For this cluster we have decided to allocate network storage on a project by project basis. Jonathan Heagerty will be the point of contact as it pertains to allocating the requested/required storage for each project. As a whole, the MBRC Cluster has limited network storage and for this there will be limits to how much and how long network storage can be appropriated.
 
If the requested storage size is significantly large relative to the total allotted amount, the request will be relayed from Jonathan Heagerty to the MBRC Cluster faculty for approval. Two other situations that would need approval from the MBRC Cluster faculty would be: To request an increase to a projects current storage allotment or To request a time extension for a projects storage.
 
When making a request for storage please provide the following information to staff@umiacs.umd.edu:
        - Name of user requesting storage:
                Example: jheager2
        - Name of project:
                Example: Foveated Rendering
        - Collaborators working on the project:
                Example: Sida Li
        - Storage size:
                Example: 1TB
        - Length of time for storage:
                Example: 6-8 months
 
== Scratch Directories ==
 
=== Local Scratch Directory===
Each computational node that a user can schedule compute jobs on has one or more local scratch directories. These are always named /scratch0, /scratch1 ~ 3.5TB. These are almost always more performant than any other storage available to the job. However users must stage their data within the confine of their job and stage the data out before the end of their job.

Latest revision as of 14:55, 3 August 2023

The MBRC at the University of Maryland is located within the Institute for Advanced Computer Studies. The MBRC has a cluster of computational (CPU/GPU) resources that are available to be scheduled. Details on this cluster can be found here.