Nexus/Submission Node Policy

From UMIACS
Jump to navigation Jump to search

Submission nodes are intended for users to submit jobs to the Nexus cluster. User interactivity is critical here, and while some of these submission nodes have a decent amount of cores and memory, highly-intensive computational jobs should not be run on these nodes. We have minimal resource restrictions in place at the moment, but we may explore further CPU/memory restrictions in the future. For the time being, be a good neighbor!

In general, we expect that users on these nodes do not use more than 1-2GB of memory at a time and do not sustain >=100% CPU usage (at least 1 core fully utilized) for extended periods of time. Utilizing more than 1 core is fine for bursty workloads, but sustained high CPU utilization can lead to issues with interactivity, and these issues can be drastically amplified if there is also memory contention on the node (swapping memory requires CPU time).

Examples

Appropriate submission node uses

  • Job submission
  • SSH jump host for user workstations
  • Basic code editing in editors with minimal extensions (e.g. emacs, nano, vim, neovim without an LSP).

Appropriate submission node uses, with caveats

These kinds of workloads can be run on submission nodes, but if they are incorrectly configured, they can affect interactivity of the nodes. If we notice repeated behavior like this, we may reach out to you to ask about your environment and suggest ways to lower resource utilization.

  • IDEs can be fine, but keep in mind which extensions you install. Some extensions/behaviors may use more memory/CPU than you realize.
  • For smaller projects, code compilation can be fine. If a project takes longer than 30 seconds to compile, please ensure compilation is limited to a single thread, or consider compiling the code in a Slurm job on one of the dedicated CPU nodes (we suggest using the scavenger partition).
    • Depending on a project's file structure, code compilation can be highly I/O dependent, which can also impact user interactivity.
  • Setting up environments with user package managers (e.g. pip, npm, conda/mamba)
    • The resource usage of package managers mostly depends on the specific packages being installed. Complex environments can take a while with package managers like conda/mamba due to dependency resolution. Other packages aren't actually binary releases but will instead compile projects transparently to the user. For these reasons, we suggest that non-trivial environments should be set up in a Slurm job (we suggest using the scavenger partition).
  • Running lightweight programs to interact with Slurm jobs currently running on the cluster.

Inappropriate submission node uses

If tech staff notices processes with the following behavior and node interactivity is affected, we may kill these processes. If a user repeatedly causes issues, further action may be taken on their account.

  • Compiling large projects using all threads on a submission node.
  • Running nontrivial computation requiring significant CPU and/or memory resources.

Advice for monitoring/reducing CPU/memory utilization

Here is some general guidance for tracking your resource utilization on a node, as well as some practices you can follow to slightly reduce CPU usage of certain tasks.

  • `top`
    • By default, `top` sorts by %CPU, but you can sort by %MEM by pressing "M" (Shift + "m").
    • You can limit it to show only processes from your user by starting it with `-t $USER`
    • VIRT memory usage doesn't mean too much, as oftentimes programs will request way more memory than they actually need, and the kernel may not actually give them this memory until it is actually required. RSS memory is the actual amount of memory used by a process at a given point. If there is no suffix, the reported value is in KB.
  • /sys/fs/cgroup/memory/user.slice/user-$UID_NUMBER.slice/memory.usage_in_bytes
    • This is the current combined memory usage of all processes running under your user.