Personal tools

Webarc:Main: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Introduction==
==Overview==
In the era of digital information, the efforts to preserve valuable human activities have broadened to also include documents, images, audio and video in their digital form. An unprecedented amount of information encompassing almost every facet of human activity across the world exists in the form of zeros and ones, and is also growing at an extremely fast pace. Moreover, the digital representation is often the only form in which such information is recorded.
<center>[[Image:web.jpg]]</center>


However, the long-term preservation of digital information involves many complex issues. Digital information is generally very fragile over long term and susceptible to various threats. For example, various media technologies degrade over time, potentially causing random bit errors. Also hardware/software failure can lead to loss of data. Technology changes make current hardware, software, and digital formats invalid. There also can be malicious computer/network security attacks. Accidental operational errors cannot be overlooked either.
An unprecedented amount of information encompassing almost every facet of human activity across the world is currently available on the web and is growing at an extremely fast pace. In many cases, the web is the only medium where such information is recorded. However, the web is an ephemeral medium whose contents are constantly changing and new information is rapidly replacing old information, resulting in the disappearance of a large number of web pages every day and in a permanent loss of part of our cultural and scientific heritage on a regular basis. A number of efforts, currently underway, are trying to develop methodologies and tools for capturing and archiving some of the web’s contents that are deemed critical. However there are major technical, social, and political challenges that are confronting these efforts. Major technical challenges include automatic tools to identify, find, and collect web contents to be archived, automatic extraction of metadata and context for such contents including linking structures that are inherent to the web, the organization and indexing of the data and the metadata, and the development of preservation and access mechanisms for current and future users, all at unprecedented scale and complexity.


Moreover, many digital objects face another set of challenges. They are dynamically updated at an unknown frequency, and often are interrelated to each other with temporal dependence. For these objects, the data collecting process in an archive needs to be able to determine whether or not an update occurred when it encounters a previously archived object. Otherwise, the archive can significantly waste its storage space by storing duplicate copies over and over again. Also, an archive needs to be aware of the interlinking relationships among the archived objects to better organize and manage the holdings, possibly accelerating the access performance. For example, in a system where small objects are packaged together in a container, and accesses are made on a container basis, placing heavily interlinked objects together in the same container will greatly improve the overall access speed.
Leaving aside dynamic and deep contents, web contents involve a wide variety of objects such as html pages, documents, multimedia files, scripts, etc., as well as, linking structures involving these objects. While the size of most web pages is small, the total number of web pages on a single web site can range from one to several millions. For example, as of Oct 30, 2006, Wikipedia.org alone claims to have about 1.4 million articles, each making up a distinct web page. A critical piece of web archiving is to capture the linking structures and organize the archived pages in such a way that future generations of users will be able to access and navigate through the archived web information in the same way as in the original linked structure. Note that by that time, the archived web contents may have migrated through several generations of hardware and software upgrades, including migration through different types of media, different file systems, and different formats.


Another important issue in the long-term preservation pertains to discovery and delivery of the preserved contents. In essence, the major purpose of preservation is to provide the preserved knowledge to the users who need it in the future. It is, thus, vital for any preservation system to provide an easy way to find and access the relevant contents. However, it is not a trivial matter to provide an effective, yet cost-effective, method to find the requested information mainly due to the large and ever-growing size of the preserved data. Preservation systems that solely rely on a relational database with well defined schemas may allow their users to find information more easily using well-structured queries. However, fitting every type of digital objects into a fixed set of schemas is often impossible. Clearly, we need a more general framework to enable effective information discovery and access to the archived contents.
While many challenges for archiving web contents exist, we focus in our work on ''scalable'' solutions supporting ''compact storage'' and ''fast access'' to large scale web archives. Our efforts to achieve the goal have been made around answering the following two questions:


In this research, we address two important problems in long-term preservation. First, we devise a methodology to efficiently store and index inter-related objects. Two, we devise a methodology to discover requested information from the preserved contents. While having played a major role in the past, traditional methods cannot be directly applied for long-term preservation of digital objects. To our best knowledge, no significant research effort has been made to deal with a methodology that can efficiently and temporally handle inter-related objects for storage and indexing in a long-term archive. Also, most existing access schemes for a preservation system lack in convenience and/or effectiveness.
# How do we compactly store contents in a way to retrieve contents quickly?
 
==Objectives==
In this research, we address some of the fundamental issues in web archiving storage and access. Specifically, we focus on providing technical methodologies to deal with the following two questions.
 
# How do we compactly store and index contents in a way to preserve complex structures and linked information, and retrieve contents quickly?
# How do we enable effective information discovery and access to the archived contents?
# How do we enable effective information discovery and access to the archived contents?


First, we aim to devise a methodology to efficiently store and index interrelated digital objects that may change over time. Ingested (or crawled) digital objects are sometimes interrelated to one another. A typical example of such interrelated objects is the web objects where the objects are hyperlinked to one another. Thus, this research considers the problem of storing a linked set of web objects into web containers in such a way as to minimize the number of containers accessed during a typical browsing session. We also consider a temporal indexing scheme that can locate web objects based on their URLs and the time of acquisition. One of the major benefits that come from an efficient temporal indexing scheme is to detect any duplicates before ingesting newly crawled web objects into an archive. This duplicate detection capability can potentially save a considerable amount of the storage space by not storing the identical objects more than once. For storing hyperlinked web objects, we have developed an approach based on a link-analysis technique and a graph portioning technique. This approach contrasts to the conventional approach where web objects are simply packaged into web containers on a first-come-first-served basis. The experimental results showed significant improvement over the conventional approach, and, at the same time, the overhead incurred for the link analysis and graph partitioning was very low. For the temporal indexing scheme, we have also developed and implemented a persistent indexing scheme called PISA (Persistent Indexing Structure for Archives). PISA has shown to be very fast and space-efficient; it demonstrated a considerable storage space saving with a negligible overhead.
To address the first question, we devise methodologies to put together tightly tightly related objects [[Webarc:Packaging|[1]]], and index their locations temporarily [[Webarc:PISA|[2]]]. Using the fast location index [[Webarc:PISA|[2]]], a quick duplicate detection scheme is also designed to maintain compact archive storage. As the first step toward answering the second question, we devise a temporal text-search index [[Webarc:Temporal Text-Search Index|[3]]] using a similar idea that we use for the fast location index.
 
Second, we aim to devise an efficient information discovery scheme. The ever growing size of a digital archive makes it harder over the long-term to pin-point the relevant information that needs to be accessed. However, many existing archives, such as the Internet Archive, simply rely on the user’s prior knowledge about what archive documents need to be retrieved. For example, when a user enters a URL, the Internet Archive gives a list of dates on which the contents at the URL were archived. From the list, the user then selects a date to view the archived page on the date. For archives based on a relational database with well-defined schemas, structured database queries can work in some cases. However, not every archive uses a relational database, and more importantly, even with such a database, it is often practically impossible to know the exact attributes of a certain dataset with which queries can be made. We, therefore, do not assume any particular infrastructure or data storage – such as the existence of database and schemas. We only have the following two assumptions: 1) each archived object has a unique handle (such as a combination of a URL and a time, or a unique OID) with which we can retrieve the object from the system, 2) and we can inspect every object in the system. Based on the two assumptions above, we plan to build an efficient information discovery system that will allow users to find and access more relevant contents in a long-term archive. In particular, we plan to develop a search scheme that adopts some of traditional information retrieval methodologies and more recent document ranking techniques based on link analysis. We also plan to incorporate a temporal dimension that the archived objects possess through version updates when computing the relevancy ranking.


==Our Approaches==
==Our Approaches==
* [[Webarc:PISA|PISA:Persistent Indexing Structure for Archives]
* [[Webarc:Packaging|[1] Web Container Packaging]]
* [[Webarc:Packaging|Web Container Packaging]
* [[Webarc:PISA|[2] PISA:Persistent Indexing Structure for Archives]]
* [[Webarc:Search and Access Strategies |[3] Search and Access Strategies]]
<!-- * [[Webarc:Temporal Scoring |[4] Temporal Scoring]] -->

Latest revision as of 20:23, 24 November 2009

Overview

Web.jpg

An unprecedented amount of information encompassing almost every facet of human activity across the world is currently available on the web and is growing at an extremely fast pace. In many cases, the web is the only medium where such information is recorded. However, the web is an ephemeral medium whose contents are constantly changing and new information is rapidly replacing old information, resulting in the disappearance of a large number of web pages every day and in a permanent loss of part of our cultural and scientific heritage on a regular basis. A number of efforts, currently underway, are trying to develop methodologies and tools for capturing and archiving some of the web’s contents that are deemed critical. However there are major technical, social, and political challenges that are confronting these efforts. Major technical challenges include automatic tools to identify, find, and collect web contents to be archived, automatic extraction of metadata and context for such contents including linking structures that are inherent to the web, the organization and indexing of the data and the metadata, and the development of preservation and access mechanisms for current and future users, all at unprecedented scale and complexity.

Leaving aside dynamic and deep contents, web contents involve a wide variety of objects such as html pages, documents, multimedia files, scripts, etc., as well as, linking structures involving these objects. While the size of most web pages is small, the total number of web pages on a single web site can range from one to several millions. For example, as of Oct 30, 2006, Wikipedia.org alone claims to have about 1.4 million articles, each making up a distinct web page. A critical piece of web archiving is to capture the linking structures and organize the archived pages in such a way that future generations of users will be able to access and navigate through the archived web information in the same way as in the original linked structure. Note that by that time, the archived web contents may have migrated through several generations of hardware and software upgrades, including migration through different types of media, different file systems, and different formats.

While many challenges for archiving web contents exist, we focus in our work on scalable solutions supporting compact storage and fast access to large scale web archives. Our efforts to achieve the goal have been made around answering the following two questions:

  1. How do we compactly store contents in a way to retrieve contents quickly?
  2. How do we enable effective information discovery and access to the archived contents?

To address the first question, we devise methodologies to put together tightly tightly related objects [1], and index their locations temporarily [2]. Using the fast location index [2], a quick duplicate detection scheme is also designed to maintain compact archive storage. As the first step toward answering the second question, we devise a temporal text-search index [3] using a similar idea that we use for the fast location index.

Our Approaches