Webarc:Search and Access Strategies: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
== Existing Access Methods == | == Existing Access Methods == | ||
[[Image:existing_methods. | [[Image:existing_methods.gif|480px]] | ||
== Problems With Existing Methods == | == Problems With Existing Methods == | ||
* Inefficient handling of time-constrained search. | * Inefficient handling of time-constrained search. | ||
[[Image:inefficient_temporal_handling. | [[Image:inefficient_temporal_handling.gif|480px]] | ||
* Ineffective delivery of search results | * Ineffective delivery of search results | ||
Line 40: | Line 40: | ||
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. | * Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. | ||
[[Image:temporal_index. | [[Image:temporal_index.gif|480px]] | ||
== Scoring within a Temporal Context == | == Scoring within a Temporal Context == | ||
[[Image:temporal_scoring. | [[Image:temporal_scoring.gif|480px]] | ||
== Search User Interface == | == Search User Interface == | ||
Line 50: | Line 50: | ||
== Grouping Search Results == | == Grouping Search Results == | ||
[[Image:grouping_search_results. | [[Image:grouping_search_results.gif|480px]] | ||
== Group-wide Scoring == | == Group-wide Scoring == |
Revision as of 19:33, 12 May 2009
Background
- The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
- Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
- We need effective and scalable access strategies for web archives covering significant temporal spans.
Goals
- An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
- A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
- Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
- A framework that allows effective search using keywords and time spans for large scale web archives.
Existing Access Methods
Problems With Existing Methods
- Inefficient handling of time-constrained search.
- Ineffective delivery of search results
- Inadequate relevancy scoring.
- Scoring is performed over the entire history.
- Ungrouped search results.
- URL is not unique in web archives – time dependent.
- Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
- Users can want to focus more on a specific time-period within the results.
- Lack of a group-scoring methodology.
- What group to show on the top is not clear without a group-scoring methodology.
- Inadequate relevancy scoring.
Overview of Our Approach
- Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
- Scoring within a temporal context by computing term weights as a function of time. Click here for details.
- Grouping similar search results, while scoring search results as a group. Click here and here for details.
Basic Techniques
- Determine a snapshot of web contents covering a time window
SCk = { All web objects valid within a time interval [tk~tk+1) }
- Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
Scoring within a Temporal Context
Search User Interface
Grouping Search Results
Group-wide Scoring
Grouping is good, but now which group to place first on the result page?
- Simple method : use average or highest score among members
- More effective method: compute a relevancy score as a group.
- Instead of tf(t), we use df(t), document frequency of t in group.
- Instead of idf(t), we use igf(t), inverse group frequency .
We extend some of the best known IR technologies for group ranking.
Publication
Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf