Compute/DataLocality
This page covers some best practices related to data processing on UMIACS Compute resources.
Introduction
It is recommended to store data that is actively being worked on as close to the processing source as possible. In the context of a cluster job the data being processed, as well as any generated results, should be stored on a disk physically installed in the compute node itself. We'll cover how to identify local disk space later on this page.
General Workflow
The following is a suggested workflow for a computational job:
- Copy the data to be processed to the local compute node.
- Process the data, storing results on local disk space.
- Once processing is finished, transfer results to permanent storage location. (i.e. a network file share)
- Clean up data and results from compute node local disk space.
Why this matters
Similar to how too many users on a single machine can slow it down, too many users accessing a network file server can impact performance. This issue is further compounded in the context of cluster jobs, as a single user can generate hundreds if not thousands of jobs all trying to access the same network fileserver. By utilizing the local disks on the compute nodes, you effectively distribute the data access load and reduce the load on the central fileserver.
Following these best practices isn't just about being a good neighbor however, they will also improve the performance of your jobs.
Warehouse Analogy
Analogy 1 :: Remote Warehouse
In the first analogy a business ships packages out of it's Main Office, and keeps it's inventory of items in a remote warehouse. Each time the company receives in order, it must go through the following process
- Request item from warehouse.
- Warehouse locates item in warehouse
- Warehouse retrieves the item
- Ware house delivers the item to the Main Office
- Main Office packages and delivers the item to the customer.
As shown above, storing the item in a remote warehouse results in a significant amount of overhead for each customer request. While the warehouse may be able to keep up when only the Main office is making requests, it may quickly become oversubscribed should a satellite office be opened and start placing orders.
Analogy 2 :: Local inventory Stock
In the second analogy, the same business starts keeping a local stock of the inventory it will need for the day. Now their workflow is divided up into two different types of operations:
- Daily Tasks
- Request items needed for local inventory from warehouse
- Warehouse locates item in warehouse
- Warehouse retrieves the item
- Ware house delivers the item to the Main Office
- Per Customer Order Tasks:
- Find item in local stock
- Package and deliver item to customer.
While the warehouse is still delivering the same amount of packages it can condense the number of tasks required, reducing it's load. In the event that a satellite office also start making requests, the main office will not be impacted as it already as all the packages it needs.
Explanation
Relating the above examples to UMIACS Compute resources, the warehouse would be the network file server, and the main office + satellite offices would be a job running on the cluster.