Pawn:Xfdu Publishing
From Adapt
Project done for v.5 of PAWN
Overview
Up until now all digital collections that have been submitted to PAWN have stayed within PAWN waiting to be delivered to a designated repository. The purpose of PAWN is to act as a conduit for archiving digital content from various user machines to a particular destination for long term storage. For the past 3 months we have been utilizing tools from the XFDU (XML Formatted Data Unit) developed by NASA to package the digital content and transmit it from PAWN to a specified destination. After several trials, we were able to embed collections into JAR (Java Archive) files along with an xfdumanifest.xml file that preserves the heirarchy of data within the collection.
Implementation
Packaging digital collections is done through the use of 3 interfaces and a factory class from the Adapt package:
1. ClientGuiPanel - provides interface for user input before collections are archived by PAWN.
2. ConfigurationGuiPanel - provides interface for user input when establishing Drivers via the PAWN Scheduler Administration Tool.
3. DataMover - filters metadata from each object in a collection and makes it available to developers for file manipulation.
4. SimpleResourceFactory - returns a bundle of class instances that implement the above three interfaces and integrates them into the archive functionality of PAWN.
The XFDU package containing the collections is created within the methods of DataMover, in particular the processObject() and endTransfer() methods of DataMover. Every object in a collection is passed through the processObject() method including an InputStream instance of the object. The object and any necessary content are stored within hashmaps for later usage in endTransfer() where the data is recollected and stored within a manifest file of the XFDU package being developed. The process works as follows:
1. Object and it's metadata is sent to processObject()
2. Following objects are created from the XFDU tools:
- Data - object type stored within the package manifest.xml, a physical representation of the data to be stored. Stores only one DataObjectType object.
- DataObjectType - contains attributes such as file size, checksum, checksum type, and bytestream information of the file or data information stored, internal object to a Data object. Capable of storing 1 or more ByteStreamObjects.
- ByteStreamType - object that contains either the raw data of the object or a filelocation or ReferenceType to the object's location. Capable of store 1 or more ReferenceType objects.
- ReferenceType - object that provides a filelocation tag to the object's location.
3. ReferenceType is set to point to the full file path of the data and is then stored into ByteStreamType instance collection. The ByteStreamType object is added to the DataObjectType instance collection. The DataObjectType object stores the InputStream size, checksum, and checksum type of the data being stored and is then set as the Data object's internal content. By this point the data entered into processObject() has been wrapped into an XFDU Data object ready for storage.
4. The Data object created in step 3 is stored in a list in order to be referenced later in endTransfer(). In addition a hashmap is used to store the InputStream instances for each object. This map will be used in creating the physical files that will be stored in the jar file that represents the XFDU package.
5. Control is sent to endTransfer() after all the objects in the collection have been processed. A Package instance is created, passed into an implementation of the Packager interface, and is finally stored in the designated directory that has been selected for storage (in our testcases it has been /tmp/pool-xfdu).
Operations in the endTransfer() method are quite complex with the use of several helper classes, hashmaps and list objects, and recursion to nest DataObjectPointer instances within the xfdumanifest.xml file within the package, which preserves the order of the original collection. The process is as follows:
1. The xfdumanifest.xml file is extracted from the newly created Package object.
2. An instance of a helper class we call XFDUPack is created and passed the collection of Data objects collected in processObject(). This helper class traverses the list distinguishing between the different folders that exist within each Data object's filepath. A map is created to pair the new directory paths (paths that are discovered during the traversal) with the files that are contained within them. After the map has been organized the helper class provides methods for retrieving the map and the list of directory keys that were created internally (since map objects do not keep pairs in order based on when they are added, a list of keys was created to maintain the order).
3. The list order is traversed and the map from the XFDUPack helper class is passed to a function that recursively converts each object in the map into a BasicContentInformation object while preserving the order of the collection. BasicContentInformation objects are used to pass the data to the InformationPackageMapType object stored within the xfdumanifest.xml file. This InformationPackageMapType object is represented as the protion of the manifest file that contains the DataObjectPointerType instances that map to the DataObjectType instances listed in the manifest file. In other words the InformationPackageMapType displays how folders are nested with folders and where the Data files would be within these folders. This way the heirarchy of the original collection is preserved so that if the collection had to return to it's original form, it could be done by refering to the order that the BasicContentInformation objects are stored.
4. A file instance is created to become the jar file containing the XFDU package and then passed to a FileOutputStream instance. The Package instance created earlier is passed a helper class we call XFDUPackager that implements the Packager interface, which contains only two methods: save() and open(). The map storing the InputStream instances is passed to this helper class followed by the Package and FileOutputStream instance as arguments to the save() method. Within the save method a JarOutputStream instance is created passing FileOutputStream as it's argument, and a private method within the class is called to iterate through the order of the InfomrationPackageMapType portion of the xfdumanifest.xml. As each ContentUnit (unit storing the BasicContentInformation objects) is referenced, a PackageTypeHandler is used to place the directory structure as an entry for the jar file. Each BasicContentInformation object is queried for it's filepath, which is used as a key to the map containing the InputStream instances so the file can be physically created within it's home directory.
Finally the package is completed as a jar file.
Implementation Summary
Digital collections are processed in PAWN using a series of interfaces that input driver configurations and user specifications for data transmission. The final transformation of a collection is in the form of an XFDU package, which contains an xfdumanifest.xml file with details of how the collection is ordered followed by the physical content that makes up the collection. This is done by passing each item in a collection as raw data that is transformed into a physical file within the directory structure that it's path name dictates. These loaded directories are then stored within a jar file that represents the XFDU Package containing the collection (other formats such as tar and zip will be used later). Each package contains an xfdumanifest.xml file which contains an heirarchy of file pointers to references listed within the same file. These references in turn point to the physical files that are stored with the package. When the package is referenced, it's xfdumanifest.xml file is retrieved and reviewed like an inventory list to show the collection(s) contained in the package.
Setting defaults for the package name and destination directory for each package can be specified using the following configuration interface on the Producers side of the PAWN Scheduler Administration tool (Figure 1).
- Figure 1
- XFDU Configuration Interface:
The defaults specified in the configuration interface are passed to the client interface (Figure 2).
- Figure 2
- XFDU Client Interface:
The user has the freedom to change the name of the package and destination before archiving the selected collection(s). Both interfaces offer a browse option in situations when the user would prefer to surf around their local machine for a destination directory path rather than typing it in.
Additional Notes
We faced several challenges when using the XFDU package, most of these challenges were overcome and are documented under Comments, including several minor modifications to the XFDU package source code.
Library Mapping
mapping of xfdu supplied jars to which library in PAWN contains the library.
./lib/jaas.jar xfdu ./lib/namespace.jar javax.xml.bind ./lib/jaxb-xjc.jar javax.xml.bind ./lib/relaxngDatatype.jar javax.xml.bind ./lib/jconfig-2.5.jar xfdu ./lib/jta-spec1_0_1.jar xfdu ./lib/msv-20030225.jar xfdu ./lib/commons-lang-2.1.jar xfdu ./lib/providerutil.jar xfdu ./lib/commons-collections-3.1.jar xfdu ./lib/commons-digester.jar xfdu ./lib/jacksum-1.4.jar xfdu ./lib/jug-1.1.jar xfdu ./lib/jax-qname.jar javax.xml.bind ./lib/commons-collections.jar xfdu ./lib/commons-logging.jar org.apache.axis ./lib/dom4j-1.4.jar xfdu ./lib/commons-jxpath-1.2.jar xfdu ./lib/commons-beanutils.jar xfdu ./lib/mail.jar javax.mail ./lib/xsdlib.jar javax.xml.bind ./lib/commons-compress-SNAPSHOT.jar xfdu ./lib/commons-io-1.0.jar xfdu ./lib/maven-xsddoc-plugin-0.8.jar xfdu ./lib/jaxb-impl.jar javax.xml.bind ./lib/xalan-2.4.1.jar javax.xml.parsers ./lib/activation.jar javax.activation ./lib/jaxb-libs.jar javax.xml.bind ./lib/xmlParserAPIs-2.2.1.jar xfdu ./lib/commons-configuration-1.1.jar xfdu ./lib/jaxb-api.jar javax.xml.bind ./lib/xercesImpl-2.6.2.jar javax.xml.parsers ./lib/schematron.jar xfdu