During the past five years, we have done extensive work on long term digital preservation and access, which has been supported by NARA, NSF, Library of Congress, and Mellon Foundation. In this proposal, we are planning to build on this experience by exploring novel approaches for the significantly more challenging issues in authenticity, tracking and monitoring services, information discovery and access, and evaluation strategies. To make the relationship between this proposal and prior work clear, we will review in this section the technologies that we developed under the main funded projects.
Transcontinental Persistent Archive Platform (TPAP) – NARA Project
This project, led by the San Diego Supercomputer Center, involves a collaboration between SDSC, the University of Maryland, and the National Archives. The goal of this project is to demonstrate support for automation of archival processes through use of data grid technology, which includes the development and testing of technologies on a distributed prototype between the three main sites (NARA, SDSC, and UMD). We have made major contributions in building and testing the prototype infrastructure as well as the development of a number of significant technologies. Here we describe the following two technologies developed by our group:
Producer – Archive Workflow Network (PAWN): A mature software platform that is extremely flexible in implementing centralized and distributed ingestion and processing workflows. It is built on an infrastructure-independent scalable, secure, and reliable architecture. Working with NARA, we developed several versions of the software, which demonstrated how current NARA practices could be represented in a distributed platform. Specifically, PAWN was able to demonstrate the following capabilities
- Approval and signature gathering that can be customized depending on ingest requirements of a record schedule
- A mechanism to allow users to have customizable roles rather than locking users into predefined roles.
- Show how to present Record Schedules to end users through easy to understand templates.
- Allow automated and manual process chains to be defined and executed on data in PAWN. In addition, allow these chains to be customized depending on record set and template requirements.
PAWN has been used extensively on the TPAP grid for ingesting and organizing a wide variety of records including NARA and SLAC records. While PAWN was developed specifically to model NARA workflow, many groups have expressed interest in PAWN, including the Library of Congress and several NDIIPP partners.
SRB Replication Monitoring: A web portal was developed to audit the state of replication for various collections in TPAP. The portal is designed to give a high level overview of a collection’s status between several SRB zones. In addition, the portal periodically checks collections to ensure that new files in a collection are replicated to relevant sites. The following functionality is provided:
- Track all files at a master location and periodically copy new files to replica sites
- Track the state of a replica file and site.
- Log any actions taken on a collection (replica created, new file found, etc) and any errors that may have occurred during replication.
- Provide a web front-end to the replica state of various collections.
DIGARCH Project: Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information
Supported by NSF and the Library of Congress
The main goal of this project has been the development of advanced technologies for preparing, maintaining, and archiving digital information for long term preservation, including the testing of these technologies on scientific, historical, and educational collections covering widely different technical requirements. Main results include:
Methodologies to Ensure the Bit Integrity of Digital Archives. We have developed a software platform, called ACE (Auditing Control Environment), to ensure the bit-level integrity of digital objects. This software uses cryptographic techniques to maintain and continually verify the integrity of archived objects using linked hashing. ACE is platform independent, scalable to very large centralized or distributed archives, and cost effective.
Registry of Persistent Information About Formats. Another challenging problem facing long term digital archives is how to handle format obsolescence. Using well-proven, scalable and secure web technologies, we have built a persistent format registry, called FOCUS (FOrmat CUration Service), which contains persistent information about each format, including its source and specifications, current existing viewers and their sources, and conversion software. FOCUS has been demonstrated in a number of conferences and workshops, and has been well-received by the community.
The International Children Digital Library (ICDL) Book Builder. We created the Book Builder interface that enables the bulk ingestion of digital objects that are already managed by a relational database. The Book Builder was used to process through PAWN large collections of books in the International Children Digital Library (ICDL). This work showed how the flexibility required of PAWN for TPAP could be applied to other collections. Another extension of PAWN includes the incorporation of a gateway for the Dublin Core metadata standard in addition to its native flat file format.
Development of an Implementation Plan to Port the ACE Software to Portico and DSpace
Supported by Mellon Foundation.
The main goal of this grant is to develop detailed implementation plans to port our integrity checking and auditing software tool ACE to two major and very different archiving architectures: Portico and DSpace. We have developed detailed implementation plans for both platforms, and in particular have worked with Portico staff to ensure that it can meet their requirements (ingestion of 1Million files per day). In the meantime we were able to develop a preliminary implementation of ACE on Portico and ran some initial tests on site.
Chronopolis – Supported by the Library of Congress
The University of Maryland is collaborating with SDSC/UCSD Libraries, and the National Center for Atmospheric Research (NCAR) to build the Chronopolis digital preservation environment based on the SRB and to ingest and preserve substantial collections from the two NDIIPP partners: California Digital Library (CDL) Web-at-Risk and the InterUniversity Consortium for Political and Social Research (ICPSR) DATA-PASS. The University of Maryland will use some of the software tools developed under the TPAP project, such as the SRB Replication Monitor, to support Chronopolis. In addition, the University of Maryland will expand and support its Chronopolis grid brick to provide up to 50TB of additional disk storage for this project, and will manage its node in the federated repositories in close coordination with SDSC and NCAR.