Webarc:Search and Access Strategies: Difference between revisions
From Adapt
No edit summary |
|||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Background == | |||
* The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium. | |||
* Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time. | |||
* We need effective and scalable access strategies for web archives covering significant temporal spans. | |||
== Goals == | |||
* An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery. | |||
* A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user. | |||
* Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span. | |||
* A framework that allows effective search using keywords and time spans for large scale web archives. | |||
== Existing Access Methods == | |||
[[Image:existing_methods.gif|480px]] | |||
== Problems With Existing Methods == | |||
* Inefficient handling of time-constrained search. | |||
[[Image:inefficient_temporal_handling.gif|480px]] | |||
* Ineffective delivery of search results | |||
** Inadequate relevancy scoring. | |||
*** Scoring is performed over the entire history. | |||
** Ungrouped search results. | |||
*** URL is not unique in web archives – time dependent. | |||
*** Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. | |||
*** Users can want to focus more on a specific time-period within the results. | |||
** Lack of a group-scoring methodology. | |||
*** What group to show on the top is not clear without a group-scoring methodology. | |||
== Overview of Our Approach == | |||
* Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details. | |||
* Scoring within a temporal context by computing term weights as a function of time. Click here for details. | |||
* Grouping similar search results, while scoring search results as a group. Click here and here for details. | |||
== Basic Techniques == | |||
* Determine a snapshot of web contents covering a time window | |||
SCk = { All web objects valid within a time interval [tk~tk+1) } | |||
* Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. | |||
[[Image:temporal_index.gif|480px]] | |||
== Scoring within a Temporal Context == | |||
[[Image:temporal_scoring.gif|320px]] | |||
== Search User Interface == | |||
[[Image:search_maryland_5_7.jpg|320px]] [[Image:search_airpollution.jpg|320px]] | |||
== Grouping Search Results == | |||
[[Image:grouping_search_results.gif|480px]] | |||
== Group-wide Scoring == | |||
Grouping is good, but now which group to place first on the result page? | |||
* Simple method : use average or highest score among members | |||
* More effective method: compute a relevancy score as a group. | |||
** Instead of tf(t), we use df(t), document frequency of t in group. | |||
** Instead of idf(t), we use igf(t), inverse group frequency . | |||
We extend some of the best known IR technologies for group ranking. | |||
== Publication == | |||
Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. [[media:Archiving2009_submitted.pdf|pdf]] |
Latest revision as of 19:38, 12 May 2009
Background
- The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.
- Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.
- We need effective and scalable access strategies for web archives covering significant temporal spans.
Goals
- An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.
- A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.
- Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.
- A framework that allows effective search using keywords and time spans for large scale web archives.
Existing Access Methods
Problems With Existing Methods
- Inefficient handling of time-constrained search.
- Ineffective delivery of search results
- Inadequate relevancy scoring.
- Scoring is performed over the entire history.
- Ungrouped search results.
- URL is not unique in web archives – time dependent.
- Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.
- Users can want to focus more on a specific time-period within the results.
- Lack of a group-scoring methodology.
- What group to show on the top is not clear without a group-scoring methodology.
- Inadequate relevancy scoring.
Overview of Our Approach
- Efficient time-constrained search by maintaining separate inverted lists for a given time window. Click here for details.
- Scoring within a temporal context by computing term weights as a function of time. Click here for details.
- Grouping similar search results, while scoring search results as a group. Click here and here for details.
Basic Techniques
- Determine a snapshot of web contents covering a time window
SCk = { All web objects valid within a time interval [tk~tk+1) }
- Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.
Scoring within a Temporal Context
Search User Interface
Grouping Search Results
Group-wide Scoring
Grouping is good, but now which group to place first on the result page?
- Simple method : use average or highest score among members
- More effective method: compute a relevancy score as a group.
- Instead of tf(t), we use df(t), document frequency of t in group.
- Instead of idf(t), we use igf(t), inverse group frequency .
We extend some of the best known IR technologies for group ranking.
Publication
Song, S. and JaJa, J. Search and Access Strategies for Web Archives. in Archiving 2009. 2009: IS&T. pdf