Compute/DataLocality: Difference between revisions

Revision as of 15:10, 25 May 2017

This page covers some best practices related to data processing on UMIACS Compute resources.

Data Locality

It is recommended to store data that is actively being worked on as close to the processing source as possible. In the context of a cluster job, the data being processed, as well as any generated results, should be stored on a disk physically installed in the compute node itself. We'll cover how to identify local disk space later on this page.

General Workflow

The following is a suggested workflow for a computational job:

Copy the data to be processed to the local compute node.
Process the data, storing results on local disk space.
Once processing is finished, transfer results to permanent storage location. (i.e. a network file share)
Clean up data and results from compute node local disk space.

Why this matters

Similar to how too many processes on a single machine can slow it down, too many users accessing a network file server can impact performance. This issue is further compounded in the context of cluster jobs, as a single user can generate hundreds if not thousands of jobs all trying to access the same network fileserver. By utilizing the local disks on the compute nodes, you effectively distribute the data access load and reduce the load on the central fileserver.

Following these best practices isn't just about being a good neighbor however, they will also improve the performance of your jobs.

To further illustrate this issue, consider a service like Netflix. While Netflix invests heavily in their data storage and supporting network, if they allowed their customers to access it directly it would quickly reach capacity resulting in performance degradation for all users. In order to accommodate this Netflix distributes it's data into various caching tiers, which are much closer to the end user. This distribution evens the load across multiple different devices, increasing the performance and availability for all users.

While UMIACS obviously does not operate at the same scale as Netflix, the same issues are still present within the compute infrastructure. Processing data that resides on local disk space reduces the load on the central file server and improves the performance of the process.

Data Storage

When possible, it is recommended that data be stored in an archive file.

Utilizing archive files provide the following benefits:

Faster data transfers
Reduced data size
Easier data management.

Practically every filesystem in existence has limitations in it's ability to handle large numbers of small files. By grouping large collections of small files into a single archive file we reduce the impact of this limitation, as well as improve the efficiency of data storage when combined with techniques such as compression. Another advantage manifests when transferring data over the network. In order to transfer a file a connection to the remote location has to be established and closed for each file, which can add significant overhead when dealing with large numbers of files. When the files are collected into a single archive file we reduce the number of connections that are created and destroyed, and focus more on streaming data.

Common utilities for creating archive files are tar and zip.

Identifying Local Disk Space

Local disk storage at UMIACS typically conforms to the following guidelines:

Directory name starts with /scratch
Almost every UMIACS supported machine has a /scratch0
Machines with multiple local disks will have multiple /scratchX directories, where X is a number that increases with the number of disks.

# Output shortened for brevity.
-bash-4.2$ lsblk 
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                  8:0    0 931.5G  0 disk 
└─sda2               8:2    0 930.5G  0 part 
  ├─vol00-scratch0 253:3    0   838G  0 lvm  /scratch0
sdb                  8:16   0   477G  0 disk 
└─sdb1               8:17   0   477G  0 part /scratch1
sdc                  8:32   0 953.9G  0 disk 
└─sdc1-scratch2    253:2    0 953.9G  0 lvm  /scratch2

As shown above, common utilities such as lsblk can be used to identify the specific configuration on a given node.





Local data storage is considered transitory and as such is not backed up.  It is not intended for long-term storage of critical/sensitive data.

If you have any questions about the available local disk storage on a given cluster please refer to the documentation specific for that cluster, or contact the UMIACS Help Desk.

@@ Line 1: / Line 1: @@
 This page covers some best practices related to data processing on UMIACS Compute resources.
-==Introduction==
+==Data Locality==
-It is recommended to store data that is actively being worked on as close to the processing source as possible.  In the context of a cluster job the data being processed, as well as any generated results, should be stored on a disk physically installed in the compute node itself.  We'll cover how to identify local disk space later on this page.
+It is recommended to store data that is actively being worked on as close to the processing source as possible.  In the context of a cluster job, the data being processed, as well as any generated results, should be stored on a disk physically installed in the compute node itself.  We'll cover how to identify local disk space later on this page.
 ===General Workflow===
@@ Line 14: / Line 12: @@
 ===Why this matters===
+Similar to how too many processes on a single machine can slow it down, too many users accessing a network file server can impact performance.  This issue is further compounded in the context of cluster jobs, as a single user can generate hundreds if not thousands of jobs all trying to access the same network fileserver.  By utilizing the local disks on the compute nodes, you effectively distribute the data access load and reduce the load on the central fileserver.
-Similar to how too many users on a single machine can slow it down, too many users accessing a network file server can impact performance.  This issue is further compounded in the context of cluster jobs, as a single user can generate hundreds if not thousands of jobs all trying to access the same network fileserver.  By utilizing the local disks on the compute nodes, you effectively distribute the data access load and reduce the load on the central fileserver.
+Following these best practices isn't just about being a good neighbor however, they will also improve the performance of your jobs.
+<br/>
+<br/>
+To further illustrate this issue, consider a service like Netflix.  While Netflix invests heavily in their data storage and supporting network, if they allowed their customers to access it directly it would quickly reach capacity resulting in performance degradation for all users.  In order to accommodate this Netflix distributes it's data into various caching tiers, which are much closer to the end user.  This distribution evens the load across multiple different devices, increasing the performance and availability for all users.
+<br/>
+While UMIACS obviously does not operate at the same scale as Netflix, the same issues are still present within the compute infrastructure.  Processing data that resides on local disk space reduces the load on the central file server and improves the performance of the process.
-Following these best practices isn't just about being a good neighbor however, they will also improve the performance of your jobs.
+==Data Storage==
+When possible, it is recommended that data be stored in an archive file.
+Utilizing archive files provide the following benefits:
+* Faster data transfers
+* Reduced data size
+* Easier data management.
-==Warehouse Analogy==
+Practically every filesystem in existence has limitations in it's ability to handle large numbers of small files.  By grouping large collections of small files into a single archive file we reduce the impact of this limitation, as well as improve the efficiency of data storage when combined with techniques such as compression.  Another advantage manifests when transferring data over the network.  In order to transfer a file a connection to the remote location has to be established and closed for each file, which can add significant overhead when dealing with large numbers of files.  When the files are collected into a single archive file we reduce the number of connections that are created and destroyed, and focus more on streaming data.
-===Analogy 1 :: Remote Warehouse===
+Common utilities for creating archive files are <code>tar</code> and <code>zip</code>.
-In the first analogy a business ships packages out of it's Main Office, and keeps it's inventory of items in a remote warehouse.  Each time the company receives in order, it must go through the following process
-# Request item from warehouse.
-# Warehouse locates item in warehouse
-# Warehouse retrieves the item
-# Ware house delivers the item to the Main Office
-# Main Office packages and delivers the item to the customer.
-As shown above, storing the item in a remote warehouse results in a significant amount of overhead for each customer request.  While the warehouse may be able to keep up when only the Main office is making requests, it may quickly become oversubscribed should a satellite office be opened and start placing orders.
+==Identifying Local Disk Space==
+Local disk storage at UMIACS typically conforms to the following guidelines:
+* Directory name starts with <code>/scratch</code>
+* Almost every UMIACS supported machine has a <code>/scratch0</code>
+* Machines with multiple local disks will have multiple <code>/scratchX</code> directories, where X is a number that increases with the number of disks.
+<pre>
+# Output shortened for brevity.
+-bash-4.2$ lsblk
+NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
+sda                  8:0    0 931.5G  0 disk
+└─sda2               8:2    0 930.5G  0 part
+  ├─vol00-scratch0 253:3    0   838G  0 lvm  /scratch0
+sdb                  8:16   0   477G  0 disk
+└─sdb1               8:17   0   477G  0 part /scratch1
+sdc                  8:32   0 953.9G  0 disk
+└─sdc1-scratch2    253:2    0 953.9G  0 lvm  /scratch2
+</pre>
-=== Analogy 2 :: Local inventory Stock===
+As shown above, common utilities such as <code>lsblk</code> can be used to identify the specific configuration on a given node.
-In the second analogy, the same business starts keeping a local stock of the inventory it will need for the day.  Now their workflow is divided up into two different types of operations:
-# Daily Tasks
-#* Request items needed for local inventory from warehouse
-#* Warehouse locates item in warehouse
-#* Warehouse retrieves the item
-#* Ware house delivers the item to the Main Office
-# Per Customer Order Tasks:
-#* Find item in local stock
-#* Package and deliver item to customer.
-While the warehouse is still delivering the same amount of packages it can condense the number of tasks required, reducing it's load.  In the event that a satellite office also start making requests, the main office will not be impacted as it already as all the packages it needs.
+{{Note|Local data storage is considered transitory and as such is not backed up.  It is not intended for long-term storage of critical/sensitive data.}}
-===Explanation===
+If you have any questions about the available local disk storage on a given cluster please refer to the documentation specific for that cluster, or contact [[HelpDesk | the UMIACS Help Desk]].
-Relating the above examples to UMIACS Compute resources, the warehouse would be the network file server, and the main office + satellite offices would be a job running on the cluster.

Compute/DataLocality: Difference between revisions

Revision as of 15:10, 25 May 2017

Contents

Data Locality

General Workflow

Why this matters

Data Storage

Identifying Local Disk Space

Navigation menu

Compute/DataLocality: Difference between revisions

Revision as of 15:10, 25 May 2017

Data Locality

General Workflow

Why this matters

Data Storage

Identifying Local Disk Space

Navigation menu

Search