Personal tools

Webarc:Search and Access Strategies: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
Line 12: Line 12:
== Existing Access Methods ==
== Existing Access Methods ==
* Chronological Listing
* Chronological Listing
[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
[[Image:chronological_listing.jpg|320px]]
* Directory
* Directory
[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
[[Image:directory.jpg|320px]]
* Text-Search
* Text-Search
[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
[[Image:text-search.jpg|320px]]
* Hybrid
* Hybrid
[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
[[Image:hybrid.jpg|320px]]


== Problems With Existing Methods ==
== Problems With Existing Methods ==
* Inefficient handling of time-constrained search.
* Inefficient handling of time-constrained search.
[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
[[Image:inefficient_temporal_handling.png|320px]]
* Ineffective delivery of search results
* Ineffective delivery of search results
Inadequate relevancy scoring.
** Inadequate relevancy scoring.
Scoring is performed over the entire history.
*** Scoring is performed over the entire history.
Ungrouped search results.
** Lack of search result grouping.
URL is not unique in web archives – time dependent.
*** URL is not unique in web archives – time dependent.
Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.  
*** Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.  
Users can want to focus more on a  specific time-period within the results.
*** Users can want to focus more on a  specific time-period within the results.
Lack of a group-scoring methodology.
Lack of a group-scoring methodology.
What group to show on the top is not clear without a group-scoring methodology.
*** What group to show on the top is not clear without a group-scoring methodology.


== Overview of Our Approach ==
== Overview of Our Approach ==
Line 40: Line 40:


== Basic Techniques ==
== Basic Techniques ==
Determine a snapshot of web contents covering a time window�SCk = { All web objects valid within a time interval [tk~tk+1) }
Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.


== Scoring within a Temporal Context ==
== Scoring within a Temporal Context ==
* Determine a snapshot of web contents covering a time window
�SCk = { All web objects valid within a time interval [tk~tk+1) }
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
[[Image:temporal_index.png|320px]]


== Search User Interface ==
== Search User Interface ==
[[Image:search_maryland_5_7.jpg|320px]][[Image:search_airpollution.jpg|320px]]


== Grouping Search Results ==
== Grouping Search Results ==
[[Image:grouping_search_results.png|320px]]


== Group-wide Scoring ==
== Group-wide Scoring ==
Grouping is good, but now which group to place first on the result page?
* Simple method : use average or highest score among members
* More effective method: compute  a relevancy score as a group.
** Instead of tf(t), we use df(t), document frequency of t in group.
** Instead of idf(t), we use igf(t), inverse group frequency .
We extend some of the best known IR technologies for group ranking.




== Publication ==
== Publication ==
Song, S. and JaJa, J., Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage:UMIACS-TR-2008-08. 2008, University of Maryland Institute for Advanced Computer Studies. [[media:temporal-web-archiving-final-umiacs-tr-2008-08.pdf|pdf]]
Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. [[media:Archiving2009_submitted.pdf|pdf]]

Revision as of 19:17, 12 May 2009

Background

  • The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
  • Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
  • We need effective and scalable access strategies for web archives covering significant temporal spans.

Goals

  • An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
  • A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
  • Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
  • A framework that allows effective search using keywords and time spans for large scale web archives.

Existing Access Methods

  • Chronological Listing

Chronological listing.jpg

  • Directory

Directory.jpg

  • Text-Search

Text-search.jpg

  • Hybrid

Hybrid.jpg

Problems With Existing Methods

  • Inefficient handling of time-constrained search.

Inefficient temporal handling.png

  • Ineffective delivery of search results
    • Inadequate relevancy scoring.
      • Scoring is performed over the entire history.
    • Lack of search result grouping.
      • URL is not unique in web archives – time dependent.
      • Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
      • Users can want to focus more on a specific time-period within the results.

Lack of a group-scoring methodology.

      • What group to show on the top is not clear without a group-scoring methodology.

Overview of Our Approach

  • Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
  • Scoring within a temporal context by computing term weights as a function of time. Click here for details.
  • Grouping similar search results, while scoring search results as a group. Click here and here for details.


Basic Techniques

Determine a snapshot of web contents covering a time window�SCk = { All web objects valid within a time interval [tk~tk+1) } Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.

Scoring within a Temporal Context

  • Determine a snapshot of web contents covering a time window

�SCk = { All web objects valid within a time interval [tk~tk+1) }

  • Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.

File:Temporal index.png

Search User Interface

Search maryland 5 7.jpgSearch airpollution.jpg

Grouping Search Results

File:Grouping search results.png

Group-wide Scoring

Grouping is good, but now which group to place first on the result page?

  • Simple method : use average or highest score among members
  • More effective method: compute a relevancy score as a group.
    • Instead of tf(t), we use df(t), document frequency of t in group.
    • Instead of idf(t), we use igf(t), inverse group frequency .

We extend some of the best known IR technologies for group ranking.


Publication

Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf