Personal tools

Webarc:Search and Access Strategies: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
Line 12: Line 12:
== Existing Access Methods ==
== Existing Access Methods ==
* Chronological Listing
* Chronological Listing
[[Image:chronological_listing.jpg|320px]]
[[Image:chronological_listing.jpg|200px]]
* Directory
* Directory
[[Image:directory.jpg|320px]]
[[Image:directory.jpg|200px]]
* Text-Search
* Text-Search
[[Image:text-search.jpg|320px]]
[[Image:text-search.jpg|200px]]
* Hybrid
* Hybrid
[[Image:hybrid.jpg|320px]]
[[Image:hybrid.jpg|200px]]


== Problems With Existing Methods ==
== Problems With Existing Methods ==
* Inefficient handling of time-constrained search.
* Inefficient handling of time-constrained search.
[[Image:inefficient_temporal_handling.png|320px]]
[[Image:inefficient_temporal_handling.png|640px]]
* Ineffective delivery of search results
* Ineffective delivery of search results
** Inadequate relevancy scoring.
** Inadequate relevancy scoring.
*** Scoring is performed over the entire history.
*** Scoring is performed over the entire history.
** Lack of search result grouping.
** Ungrouped search results.
*** URL is not unique in web archives – time dependent.
*** URL is not unique in web archives – time dependent.
*** Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.  
*** Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.  
*** Users can want to focus more on a  specific time-period within the results.
*** Users can want to focus more on a  specific time-period within the results.
Lack of a group-scoring methodology.
** Lack of a group-scoring methodology.
*** What group to show on the top is not clear without a group-scoring methodology.
*** What group to show on the top is not clear without a group-scoring methodology.


Line 40: Line 40:


== Basic Techniques ==
== Basic Techniques ==
Determine a snapshot of web contents covering a time window�SCk = { All web objects valid within a time interval [tk~tk+1) }
* Determine a snapshot of web contents covering a time window
Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.  
SCk = { All web objects valid within a time interval [tk~tk+1) }
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
[[Image:temporal_index.png|640px]]


== Scoring within a Temporal Context ==
== Scoring within a Temporal Context ==
* Determine a snapshot of web contents covering a time window
[[Image:temporal_scoring.png|640px]]
�SCk = { All web objects valid within a time interval [tk~tk+1) }
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
[[Image:temporal_index.png|320px]]


== Search User Interface ==
== Search User Interface ==
Line 53: Line 52:


== Grouping Search Results ==
== Grouping Search Results ==
[[Image:grouping_search_results.png|320px]]
[[Image:grouping_search_results.png| 640px]]


== Group-wide Scoring ==
== Group-wide Scoring ==

Revision as of 19:22, 12 May 2009

Background

  • The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
  • Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
  • We need effective and scalable access strategies for web archives covering significant temporal spans.

Goals

  • An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
  • A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
  • Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
  • A framework that allows effective search using keywords and time spans for large scale web archives.

Existing Access Methods

  • Chronological Listing

Chronological listing.jpg

  • Directory

Directory.jpg

  • Text-Search

Text-search.jpg

  • Hybrid

Hybrid.jpg

Problems With Existing Methods

  • Inefficient handling of time-constrained search.

Inefficient temporal handling.png

  • Ineffective delivery of search results
    • Inadequate relevancy scoring.
      • Scoring is performed over the entire history.
    • Ungrouped search results.
      • URL is not unique in web archives – time dependent.
      • Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
      • Users can want to focus more on a specific time-period within the results.
    • Lack of a group-scoring methodology.
      • What group to show on the top is not clear without a group-scoring methodology.

Overview of Our Approach

  • Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
  • Scoring within a temporal context by computing term weights as a function of time. Click here for details.
  • Grouping similar search results, while scoring search results as a group. Click here and here for details.


Basic Techniques

  • Determine a snapshot of web contents covering a time window

SCk = { All web objects valid within a time interval [tk~tk+1) }

  • Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.

File:Temporal index.png

Scoring within a Temporal Context

File:Temporal scoring.png

Search User Interface

Search maryland 5 7.jpgSearch airpollution.jpg

Grouping Search Results

File:Grouping search results.png

Group-wide Scoring

Grouping is good, but now which group to place first on the result page?

  • Simple method : use average or highest score among members
  • More effective method: compute a relevancy score as a group.
    • Instead of tf(t), we use df(t), document frequency of t in group.
    • Instead of idf(t), we use igf(t), inverse group frequency .

We extend some of the best known IR technologies for group ranking.


Publication

Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf