Webarc:Search and Access Strategies

From Adapt

Background

  • The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
  • Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
  • We need effective and scalable access strategies for web archives covering significant temporal spans.

Goals

  • An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
  • A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
  • Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
  • A framework that allows effective search using keywords and time spans for large scale web archives.

Existing Access Methods

Existing methods.gif

Problems With Existing Methods

  • Inefficient handling of time-constrained search.

Inefficient temporal handling.gif

  • Ineffective delivery of search results
    • Inadequate relevancy scoring.
      • Scoring is performed over the entire history.
    • Ungrouped search results.
      • URL is not unique in web archives – time dependent.
      • Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
      • Users can want to focus more on a specific time-period within the results.
    • Lack of a group-scoring methodology.
      • What group to show on the top is not clear without a group-scoring methodology.

Overview of Our Approach

  • Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
  • Scoring within a temporal context by computing term weights as a function of time. Click here for details.
  • Grouping similar search results, while scoring search results as a group. Click here and here for details.


Basic Techniques

  • Determine a snapshot of web contents covering a time window

SCk = { All web objects valid within a time interval [tk~tk+1) }

  • Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.

Temporal index.gif

Scoring within a Temporal Context

Temporal scoring.gif

Search User Interface

Search maryland 5 7.jpg Search airpollution.jpg

Grouping Search Results

Grouping search results.gif

Group-wide Scoring

Grouping is good, but now which group to place first on the result page?

  • Simple method : use average or highest score among members
  • More effective method: compute a relevancy score as a group.
    • Instead of tf(t), we use df(t), document frequency of t in group.
    • Instead of idf(t), we use igf(t), inverse group frequency .

We extend some of the best known IR technologies for group ranking.


Publication

Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf