Webarc:Tools Developed: Difference between revisions

Revision as of 22:48, 9 November 2009

Input Data Preprocessors

LRE Monthly Dumper (Java): Based on mwdumper, this tool extracts from the MediaWiki XML dump a monthly snapshot at the end of each month. It also filters out minor edits and redirects.

MediaWiki-to-TREC Converter (Java): Also based on mwdumper, this tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within the month. Each record in DB has the form of { docID, (revision date, file name, offset) }.

Merge DB Constructor (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. for month m, <math>MergeDB_m = MergeDB_{m-1} \cup FreshDB_m</math>. Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.

Carryover DB Constructor (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies the Wiki articles that need to be carried over from the previous month (i.e. those that only appear in Merge DB). I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>. It constructs yet another DB (Carryover DB) for each month that contains these carryovers. This carryover identification step could also be performed during the Merge DB Construction process. However, we detached this step for two reasons: 1. Having four indexes (one Fresh DB and two Merge DBs and one Carryover DB) for each month requires a significant system resource, and often slows down the whole process. 2. Unlike the Merge DB Construction process, Carryover DBs can be constructed in parallel.

For our experiments, we have run the tools listed above sequentially to obtain monthly snapshots of Wikipedia articles from 2001 to 2007, and also to identify, for each month, the articles that do not have new revisions in the current month, thus need to be carried over and indexed for the current month.

Indexers

Lemur Indexer (modified) (C++): Based on the Lemur toolkit, we added additional input source support, namely Berkeley DB (for the Carryover DB). We also added extra statistics parameters for temporal scoring support. In particular, the modified index now also includes 'fresh document counts' (the number of non-carried-over documents) and 'term counts for fresh documents' (the number of terms within non-carried-over documents), not only for the entire index, but only for each individual term in the index.

Retrievers

Temporal Okapi Retrieval Method (C++): Based on the Okapi Retrieval Method implemented as a part of the Lemur toolkit, we added Temporal Okapi Retrieval Method that takes the extra statistics parameters (that we included in the modified version of Lemur Indexer) into account for temporally-anchored scoring.

Temporal KL Retrieval Method (C++): Based on the Simple KL Retrieval Method implemented as a part of the Lemur toolkit, Temporal KL Retrieval Method uses the extra statistics parameters (that we included in the modified version of Lemur Indexer) for temporally-anchored scoring.

Miscellaneous

Berkeley DB Wrapper for Carryover DB (Java): A simple and elegant wrapper for any C/C++ implementations that want to access Carryover DB (which is based on Java Berkeley DB) via Java JNI.

@@ Line 6: / Line 6: @@
 *[[Webarc:Merge DB Constructor|Merge DB Constructor]] (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. for month m, <math>MergeDB_m = MergeDB_{m-1} \cup FreshDB_m</math>. Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.
-*[[Webarc:Carryover DB Constructor|Carryover DB Constructor]] (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies articles the Wiki articles that were carried over from the previous month (i.e. those that only appear in Merge DB). I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>. It constructs yet another DB (Carryover DB) for each month that contains these carryovers. This carryover identification step could also be performed during the Merge DB Construction process. However, we detached this step for two reasons: 1. Having four indexes (one Fresh DB and two Merge DBs and one Carryover DB) for each month requires a significant system resource, and often slows down the whole process. 2. Unlike the Merge DB Construction process, Carryover DBs can be constructed in parallel.
+*[[Webarc:Carryover DB Constructor|Carryover DB Constructor]] (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies the Wiki articles that need to be carried over from the previous month (i.e. those that only appear in Merge DB). I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>. It constructs yet another DB (Carryover DB) for each month that contains these carryovers. This carryover identification step could also be performed during the Merge DB Construction process. However, we detached this step for two reasons: 1. Having four indexes (one Fresh DB and two Merge DBs and one Carryover DB) for each month requires a significant system resource, and often slows down the whole process. 2. Unlike the Merge DB Construction process, Carryover DBs can be constructed in parallel.
 For our experiments, we have run the tools listed above sequentially to obtain monthly snapshots of Wikipedia articles from 2001 to 2007, and also to identify, for each month, the articles that do not have new revisions in the current month, thus need to be carried over and indexed for the current month.

Personal tools

Webarc:Tools Developed: Difference between revisions - Adapt

Search

General

Projects

Research

Tools