Personal tools

Toaster:DigArchFileSets

From Adapt

Jump to: navigation, search

Overview

While packages are the smallest complete unit in our system, we will be performing all replication functions against larger data pools called filesets. Groups of filesets will be created that will replicate packages between all filesets in a group. By requiring that packages be stored in filesets, we can easily recover lost nodes merely by copying entire filesets from other members in a group.

Information regarding fileset groups and filesets is stored locally on nodes participating in the group. The Master may receive notification of group changes or re-harvest information from individual nodes, but should not be considered authoritative.

Fileset

A fileset is a storage device, area that stores packages. In the prototype, filesets will be filesystems on test nodes. It will be assumed that a fileset has complete control over the filesystem on the node it resides on.

A fileset is responsible for internally validating data that is stored. On package corruption, it is responsible for contacting other members of it's group to recover the package.

Fileset group organization

Fileset groups contain a master fileset and slave filesets. All metadata regarding a fileset group mirrored between all members in the group. The master fileset is the only node allowed to publish metadata updates. Fileset group metadata consists of purge requests, lists of contained packages, and lists of member filesets.

Building a fileset group

  1. On the master node, create a new fileset belonging to a new fileset group
  2. Set the fileset to be master of the new group.
  3. On the master fileset, enter the nodes and fileset names of all member filesets. They will start life marked offline
  4. Next, on each node that will participate, create a new fileset and set it's uuid to the group on the master node. In addition, set the master node as master of the fileset.
  5. The new fileset will contact the master and start syncing packages off of it, or any available peers. The node will also update it's fileset group information with all other peers involved.

Fileset metadata

Detailed list of fileset-level metadata that each node tracks:

  • peer nodes
  • master node
  • advertized size
  • fileset ID


Master Failure

On failure of the master, any other fileset may be promoted to master. Each member fileset will need to be manually updated to recognize the new master fileset. While all nodes mirror the master's metadata there is still the possibility of inconsistant metadata. A utility should be created to allow manual reconciling of metadata onto a new master. Filesets should then sync fileset group metadata with the new master.

During failure of a master, a fileset may still continue to self-audit it's data and recover packages from any other member in a group. Since there is no master, no new packages will be added, and no purges will occur.

Storage Classes

Groups of similiar fileset groups can be organized into storage classes that are presented to ingesters. Storage classes only exist on the main archive manager and do not have any operational use beyond describing aggregations of fileset groups to ingesters.

Any fileset group that will accept data must belong to a storage class. When an ingestor requests storage from the archive manager, it's request contains a storage class in addition resource size. A fileset group may belong to more than one storage class in order to offer ingestors different types of service (IE, storage classes based on speed of filesets rather than number of replicas)

For example, fileset groups containing 3 filesets with each fileset existing on slow media may be grouped into a storage class that can be requested.

Future work

  • how hard would it be to construct an archival filesystem to use as a fileset? What types of commit and integrity requirements exist?