Webarc:Tools Developed
From Adapt
Input Data Preprocessors
- LRE Monthly Dumper (Java): Based on mwdumper, this tool extracts from the MediaWiki XML dump a monthly snapshot at the end of each month. It also filters out minor edits and redirects.
- MediaWiki-to-TREC Converter (Java): Also based on mwdumper, this tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month where each record contains { docID, (revision date, file name, offset) } for all Wiki articles within the snapshot. We call this database 'Fresh DB'.
- Merge DB Constructor (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. Merge DB (month k) = Merge DB (month k-1) U Fresh DB (month k). Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.
- Carryover DB Constructor (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies articles the Wiki articles that were carried over from the previous month (i.e. those that only appear in Merge DB). It constructs yet another DB (Carryover DB) for each month that contains these carryovers. This carryover identification step could also be performed during the Merge DB Construction process. However, we detached this step for two reasons: 1. Having four indexes (one Fresh DB and two Merge DBs and one Carryover DB) for each month requires a significant system resource, and often slows down the whole process. 2. Unlike the Merge DB Construction process, Carryover DBs can be constructed in parallel.
For our experiments, we have run the tools listed above sequentially to obtain monthly snapshots of Wikipedia articles from 2001 to 2007, and also to identify, for each month, the articles that do not have new revisions in the current month, thus need to be carried over and indexed for the current month.
Indexers
- Lemur Indexer (modified) (C++): Based on the Lemur toolkit, we added additional input source support, namely Berkeley DB (for the Carryover DB). We also added extra statistics parameters for temporal scoring support. In particular, the modified index now also includes 'fresh document counts' (the number of non-carried-over documents) and 'term counts for fresh documents' (the number of terms within non-carried-over documents), not only for the entire index, but only for each individual term in the index.
Retrievers
- Temporal Okapi Retrieval Method (C++): Based on the Okapi Retrieval Method implemented as a part of the Lemur toolkit, we added Temporal Okapi Retrieval Method that takes the extra statistics parameters (that we included in the modified version of Lemur Indexer) into account for temporally-anchored scoring.
- Temporal KL Retrieval Method (C++): Based on the Simple KL Retrieval Method implemented as a part of the Lemur toolkit, Temporal KL Retrieval Method uses the extra statistics parameters (that we included in the modified version of Lemur Indexer) for temporally-anchored scoring.
Miscellaneous
- Berkeley DB Wrapper for Carryover DB (Java): A simple and elegant wrapper for any C/C++ implementations that want to access Carryover DB (which is based on Java Berkeley DB) via Java JNI.