Webarc:Search and Access Strategies: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
== Existing Access Methods == | == Existing Access Methods == | ||
* Chronological Listing | * Chronological Listing | ||
[[Image: | [[Image:chronological_listing.jpg|320px]] | ||
* Directory | * Directory | ||
[[Image: | [[Image:directory.jpg|320px]] | ||
* Text-Search | * Text-Search | ||
[[Image: | [[Image:text-search.jpg|320px]] | ||
* Hybrid | * Hybrid | ||
[[Image: | [[Image:hybrid.jpg|320px]] | ||
== Problems With Existing Methods == | == Problems With Existing Methods == | ||
* Inefficient handling of time-constrained search. | * Inefficient handling of time-constrained search. | ||
[[Image: | [[Image:inefficient_temporal_handling.png|320px]] | ||
* Ineffective delivery of search results | * Ineffective delivery of search results | ||
Inadequate relevancy scoring. | ** Inadequate relevancy scoring. | ||
Scoring is performed over the entire history. | *** Scoring is performed over the entire history. | ||
** Lack of search result grouping. | |||
URL is not unique in web archives – time dependent. | *** URL is not unique in web archives – time dependent. | ||
Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. | *** Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. | ||
Users can want to focus more on a specific time-period within the results. | *** Users can want to focus more on a specific time-period within the results. | ||
Lack of a group-scoring methodology. | Lack of a group-scoring methodology. | ||
What group to show on the top is not clear without a group-scoring methodology. | *** What group to show on the top is not clear without a group-scoring methodology. | ||
== Overview of Our Approach == | == Overview of Our Approach == | ||
Line 40: | Line 40: | ||
== Basic Techniques == | == Basic Techniques == | ||
Determine a snapshot of web contents covering a time window�SCk = { All web objects valid within a time interval [tk~tk+1) } | |||
Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. | |||
== Scoring within a Temporal Context == | == Scoring within a Temporal Context == | ||
* Determine a snapshot of web contents covering a time window | |||
�SCk = { All web objects valid within a time interval [tk~tk+1) } | |||
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. | |||
[[Image:temporal_index.png|320px]] | |||
== Search User Interface == | == Search User Interface == | ||
[[Image:search_maryland_5_7.jpg|320px]][[Image:search_airpollution.jpg|320px]] | |||
== Grouping Search Results == | == Grouping Search Results == | ||
[[Image:grouping_search_results.png|320px]] | |||
== Group-wide Scoring == | == Group-wide Scoring == | ||
Grouping is good, but now which group to place first on the result page? | |||
* Simple method : use average or highest score among members | |||
* More effective method: compute a relevancy score as a group. | |||
** Instead of tf(t), we use df(t), document frequency of t in group. | |||
** Instead of idf(t), we use igf(t), inverse group frequency . | |||
We extend some of the best known IR technologies for group ranking. | |||
== Publication == | == Publication == | ||
Song, S. and JaJa, J. | Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. [[media:Archiving2009_submitted.pdf|pdf]] |
Revision as of 19:17, 12 May 2009
Background
- The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
- Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
- We need effective and scalable access strategies for web archives covering significant temporal spans.
Goals
- An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
- A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
- Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
- A framework that allows effective search using keywords and time spans for large scale web archives.
Existing Access Methods
- Chronological Listing
- Directory
- Text-Search
- Hybrid
Problems With Existing Methods
- Inefficient handling of time-constrained search.
- Ineffective delivery of search results
- Inadequate relevancy scoring.
- Scoring is performed over the entire history.
- Lack of search result grouping.
- URL is not unique in web archives – time dependent.
- Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
- Users can want to focus more on a specific time-period within the results.
- Inadequate relevancy scoring.
Lack of a group-scoring methodology.
- What group to show on the top is not clear without a group-scoring methodology.
Overview of Our Approach
- Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
- Scoring within a temporal context by computing term weights as a function of time. Click here for details.
- Grouping similar search results, while scoring search results as a group. Click here and here for details.
Basic Techniques
Determine a snapshot of web contents covering a time window�SCk = { All web objects valid within a time interval [tk~tk+1) } Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
Scoring within a Temporal Context
- Determine a snapshot of web contents covering a time window
�SCk = { All web objects valid within a time interval [tk~tk+1) }
- Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
Search User Interface
Grouping Search Results
File:Grouping search results.png
Group-wide Scoring
Grouping is good, but now which group to place first on the result page?
- Simple method : use average or highest score among members
- More effective method: compute a relevancy score as a group.
- Instead of tf(t), we use df(t), document frequency of t in group.
- Instead of idf(t), we use igf(t), inverse group frequency .
We extend some of the best known IR technologies for group ranking.
Publication
Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf