Compute/DataLocality

From UMIACS
Revision as of 22:16, 24 May 2017 by Sabobbin (talk | contribs) (Created page with " This page covers some best practices related to data processing on UMIACS Compute resources. ==Introduction== It is recommended to store data that is actively being worked...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


This page covers some best practices related to data processing on UMIACS Compute resources.

Introduction

It is recommended to store data that is actively being worked on as close to the processing source as possible. In the context of a cluster job the data being processed, as well as any generated results, should be stored on a disk physically installed in the compute node itself. We'll cover how to identify local disk space later on this page.

General Workflow

The following is a suggested workflow for a computational job:

  1. Copy the data to be processed to the local compute node.
  2. Process the data, storing results on local disk space.
  3. Once processing is finished, transfer results to permanent storage location. (i.e. a network file share)
  4. Clean up data and results from compute node local disk space.

Why this matters

Similar to how too many users on a single machine can slow it down, too many users accessing a network file server can impact performance. This issue is further compounded in the context of cluster jobs, as a single user can generate hundreds if not thousands of jobs all trying to access the same network fileserver. By utilizing the local disks on the compute nodes, you effectively distribute the data access load and reduce the load on the central fileserver.

Following these best practices isn't just about being a good neighbor however, they will also improve the performance of your jobs.


Warehouse Analogy

Analogy 1 :: Remote Warehouse

In the first analogy a business ships packages out of it's Main Office, and keeps it's inventory of items in a remote warehouse. Each time the company receives in order, it must go through the following process

  1. Request item from warehouse.
  2. Warehouse locates item in warehouse
  3. Warehouse retrieves the item
  4. Ware house delivers the item to the Main Office
  5. Main Office packages and delivers the item to the customer.

As shown above, storing the item in a remote warehouse results in a significant amount of overhead for each customer request. While the warehouse may be able to keep up when only the Main office is making requests, it may quickly become oversubscribed should a satellite office be opened and start placing orders.

Analogy 2 :: Local inventory Stock

In the second analogy, the same business starts keeping a local stock of the inventory it will need for the day. Now their workflow is divided up into two different types of operations:

  1. Daily Tasks
    • Request items needed for local inventory from warehouse
    • Warehouse locates item in warehouse
    • Warehouse retrieves the item
    • Ware house delivers the item to the Main Office
  2. Per Customer Order Tasks:
    • Find item in local stock
    • Package and deliver item to customer.

While the warehouse is still delivering the same amount of packages it can condense the number of tasks required, reducing it's load. In the event that a satellite office also start making requests, the main office will not be impacted as it already as all the packages it needs.

Explanation

Relating the above examples to UMIACS Compute resources, the warehouse would be the network file server, and the main office + satellite offices would be a job running on the cluster.