Webarc:Main

From Adapt

Revision as of 16:23, 24 November 2009 by Scsong (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Web.jpg

An unprecedented amount of information encompassing almost every facet of human activity across the world is currently available on the web and is growing at an extremely fast pace. In many cases, the web is the only medium where such information is recorded. However, the web is an ephemeral medium whose contents are constantly changing and new information is rapidly replacing old information, resulting in the disappearance of a large number of web pages every day and in a permanent loss of part of our cultural and scientific heritage on a regular basis. A number of efforts, currently underway, are trying to develop methodologies and tools for capturing and archiving some of the web’s contents that are deemed critical. However there are major technical, social, and political challenges that are confronting these efforts. Major technical challenges include automatic tools to identify, find, and collect web contents to be archived, automatic extraction of metadata and context for such contents including linking structures that are inherent to the web, the organization and indexing of the data and the metadata, and the development of preservation and access mechanisms for current and future users, all at unprecedented scale and complexity.

Leaving aside dynamic and deep contents, web contents involve a wide variety of objects such as html pages, documents, multimedia files, scripts, etc., as well as, linking structures involving these objects. While the size of most web pages is small, the total number of web pages on a single web site can range from one to several millions. For example, as of Oct 30, 2006, Wikipedia.org alone claims to have about 1.4 million articles, each making up a distinct web page. A critical piece of web archiving is to capture the linking structures and organize the archived pages in such a way that future generations of users will be able to access and navigate through the archived web information in the same way as in the original linked structure. Note that by that time, the archived web contents may have migrated through several generations of hardware and software upgrades, including migration through different types of media, different file systems, and different formats.

While many challenges for archiving web contents exist, we focus in our work on scalable solutions supporting compact storage and fast access to large scale web archives. Our efforts to achieve the goal have been made around answering the following two questions:

  1. How do we compactly store contents in a way to retrieve contents quickly?
  2. How do we enable effective information discovery and access to the archived contents?

To address the first question, we devise methodologies to put together tightly tightly related objects [1], and index their locations temporarily [2]. Using the fast location index [2], a quick duplicate detection scheme is also designed to maintain compact archive storage. As the first step toward answering the second question, we devise a temporal text-search index [3] using a similar idea that we use for the fast location index.

Our Approaches