Personal tools

Webarc:Main

From Adapt

Revision as of 20:50, 12 September 2008 by Scsong (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

In the era of digital information, the efforts to preserve valuable human activities have broadened to also include documents, images, audio and video in their digital form. An unprecedented amount of information encompassing almost every facet of human activity across the world exists in the form of zeros and ones, and is also growing at an extremely fast pace. Moreover, the digital representation is often the only form in which such information is recorded.

However, the long-term preservation of digital information involves many complex issues. Digital information is generally very fragile over long term and susceptible to various threats. For example, various media technologies degrade over time, potentially causing random bit errors. Also hardware/software failure can lead to loss of data. Technology changes make current hardware, software, and digital formats invalid. There also can be malicious computer/network security attacks. Accidental operational errors cannot be overlooked either.

Moreover, many digital objects face another set of challenges. They are dynamically updated at an unknown frequency, and often are interrelated to each other with temporal dependence. For these objects, the data collecting process in an archive needs to be able to determine whether or not an update occurred when it encounters a previously archived object. Otherwise, the archive can significantly waste its storage space by storing duplicate copies over and over again. Also, an archive needs to be aware of the interlinking relationships among the archived objects to better organize and manage the holdings, possibly accelerating the access performance. For example, in a system where small objects are packaged together in a container, and accesses are made on a container basis, placing heavily interlinked objects together in the same container will greatly improve the overall access speed.

Another important issue in the long-term preservation pertains to discovery and delivery of the preserved contents. In essence, the major purpose of preservation is to provide the preserved knowledge to the users who need it in the future. It is, thus, vital for any preservation system to provide an easy way to find and access the relevant contents. However, it is not a trivial matter to provide an effective, yet cost-effective, method to find the requested information mainly due to the large and ever-growing size of the preserved data. Preservation systems that solely rely on a relational database with well defined schemas may allow their users to find information more easily using well-structured queries. However, fitting every type of digital objects into a fixed set of schemas is often impossible. Clearly, we need a more general framework to enable effective information discovery and access to the archived contents.

In this research, we address three important problems in long-term preservation. First, we devise a preliminary methodology to ensure the authenticity of the preserved contents on a long-term basis. Second, we devise a methodology to efficiently store and index inter-related objects. Third, we devise a methodology to discover requested information from the preserved contents. While having played a major role in the past, traditional methods cannot be directly applied for long-term preservation of digital objects. To our best knowledge, no previous research has addressed the long-term preservation mechanisms that can proactively ensure the authenticity of the preserved contents in any serious or comprehensive manner. Neither has any previous work extensively dealt with a methodology that can efficiently and temporally handle inter-related objects for storage and indexing in a long-term archive. Furthermore, although many access schemes have been developed for a preservation system, most of them lack in convenience and/or effectiveness.