Webarc:Tools Developed: Difference between revisions

Latest revision as of 22:40, 10 November 2009

Input Data Preprocessors

mwdumper: LRE Monthly Dumper (Java): Based on mwdumper, this tool extracts from the MediaWiki XML dump a monthly snapshot at the end of each month. It also filters out minor edits and redirects.

mwprep: MediaWiki-to-TREC Converter (Java): Also based on mwdumper, this tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within the month. Each record in DB has the form of { docID, (revision date, file name, offset) }.

colstate: Merge DB Constructor (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. for month m, <math>MergeDB_m = MergeDB_{m-1} \cup FreshDB_m</math>. Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.

codb: Carryover DB Constructor (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies the Wiki articles that need to be carried over from the previous month (i.e. those that only appear in Merge DB, not in the monthly snapshot). It constructs yet another DB (Carryover DB) for each month that contains these carryovers. I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>.

For our experiments, we have run the tools listed above sequentially to obtain monthly snapshots of Wikipedia articles from 2001 to 2007, and also to identify, for each month, the articles that do not have new revisions in the current month, thus need to be carried over and indexed for the current month.

Indexers

IndriBuildIndex: Lemur Indexer (modified) (C++): Based on the Lemur toolkit, we added additional input source support, namely Berkeley DB (for the Carryover DB). We also added extra statistics parameters for temporal scoring support. In particular, the modified index now also includes 'fresh document counts' (the number of non-carried-over documents) and 'term counts for fresh documents' (the number of terms within non-carried-over documents), both for the entire index and for each individual term in the index.

Distributed Index Support

PBS Jobs Scripts for Indexing (Bash): These scripts allow indexers to be deployed over multiple nodes in the Chimera cluster.

Retrievers

Temporal Okapi Retrieval Method (C++): Based on the Okapi Retrieval Method implemented as a part of the Lemur toolkit, we added Temporal Okapi Retrieval Method that takes the extra statistics parameters (that we included in the modified version of Lemur Indexer) into account for temporally-anchored scoring.

Temporal KL Retrieval Method (C++): Based on the Simple KL Retrieval Method implemented as a part of the Lemur toolkit, Temporal KL Retrieval Method uses the extra statistics parameters (that we included in the modified version of Lemur Indexer) for temporally-anchored scoring.

Distributed Search Support

tsearch: Temporal Search Client/Server (C): Using RPC, a search server provides two remote procedures for clients: getstats and search. Getstats returns statistics (such as doc count, term count, fresh doc count, ...) upon a provisional query, while search returns the actual search results for a query. Each search server is multithreaded and can handle multiple indexes simultaneously. Each search client is also multithreaded and issues a search request to multiple servers simultaneously.
PBS Jobs Scripts for Searching (Bash): These scripts allow search servers and clients to be deployed over multiple nodes in the Chimera cluster. For n nodes, and i indexes, each node is made responsible for i/n indexes. Indexes are distributed in an interlaced fashion: i.e. node 1 has index 1, n+1, 2n+1, ... and node 2 has index 2, n+2, 2n+2, ..., and so on.

Result Analysis

Trec_Eval (modified) (C): We added Kendall's Tau support in the existing Trec_Eval tool.

Miscellanea

Berkeley DB Wrapper for Carryover DB (Java): A simple and elegant wrapper for any C/C++ implementations that want to access Carryover DB (which is based on Java Berkeley DB) via Java JNI.

@@ Line 1: / Line 1: @@
 == Input Data Preprocessors ==
-*[[Webarc:LRE Monthly Dumper|LRE Monthly Dumper]] (Java): Based on mwdumper, this tool extracts from the MediaWiki XML dump a monthly snapshot at the end of each month. It also filters out minor edits and redirects.
+*[[Webarc:LRE Monthly Dumper|mwdumper: LRE Monthly Dumper]] (Java): Based on mwdumper, this tool extracts from the MediaWiki XML dump a monthly snapshot at the end of each month. It also filters out minor edits and redirects.
-*[[Webarc:MediaWiki-to-TREC Converter|MediaWiki-to-TREC Converter]] (Java): Also based on mwdumper, this tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within the month. Each record  in DB has the form of { docID, (revision date, file name, offset) }.
+*[[Webarc:MediaWiki-to-TREC Converter|mwprep: MediaWiki-to-TREC Converter]] (Java): Also based on mwdumper, this tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within the month. Each record  in DB has the form of { docID, (revision date, file name, offset) }.
-*[[Webarc:Merge DB Constructor|Merge DB Constructor]] (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. for month m, <math>MergeDB_m = MergeDB_{m-1} \cup FreshDB_m</math>. Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.
+*[[Webarc:Merge DB Constructor|colstate: Merge DB Constructor]] (Java): This tool constructs Merge DB for each month, which contains the union set of records between Merge DB of the previous month and Fresh DB of the current month. I.e. for month m, <math>MergeDB_m = MergeDB_{m-1} \cup FreshDB_m</math>. Since constructing a Merge DB for each month requires an existing Merge DB for the previous month, this tool needs to be run sequentially from the first month to the last month.
-*[[Webarc:Carryover DB Constructor|Carryover DB Constructor]] (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies articles the Wiki articles that were carried over from the previous month (i.e. those that only appear in Merge DB). I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>. It constructs yet another DB (Carryover DB) for each month that contains these carryovers. This carryover identification step could also be performed during the Merge DB Construction process. However, we detached this step for two reasons: 1. Having four indexes (one Fresh DB and two Merge DBs and one Carryover DB) for each month requires a significant system resource, and often slows down the whole process. 2. Unlike the Merge DB Construction process, Carryover DBs can be constructed in parallel.
+*[[Webarc:Carryover DB Constructor|codb: Carryover DB Constructor]] (Java): By comparing Fresh DB and Merge DB for each month, this tool identifies the Wiki articles that need to be carried over from the previous month (i.e. those that only appear in Merge DB, not in the monthly snapshot). It constructs yet another DB (Carryover DB) for each month that contains these carryovers. I.e. for month m, <math>CarryoverDB_m = MergeDB_m - FreshDB_m</math>.
 For our experiments, we have run the tools listed above sequentially to obtain monthly snapshots of Wikipedia articles from 2001 to 2007, and also to identify, for each month, the articles that do not have new revisions in the current month, thus need to be carried over and indexed for the current month.
+[[Image:preprocess.png|640px|Preprocessing Flowchart]]
 == Indexers ==
-*[[Webarc:Lemur Indexer (modified)|Lemur Indexer (modified)]] (C++): Based on the Lemur toolkit, we added additional input source support, namely Berkeley DB (for the Carryover DB). We also added extra statistics parameters for temporal scoring support. In particular, the modified index now also includes 'fresh document counts' (the number of non-carried-over documents) and 'term counts for fresh documents' (the number of terms within non-carried-over documents), not only for the entire index, but only for each individual term in the index.
+*[[Webarc:Lemur Indexer (modified)|IndriBuildIndex: Lemur Indexer (modified)]] (C++): Based on the Lemur toolkit, we added additional input source support, namely Berkeley DB (for the Carryover DB). We also added extra statistics parameters for temporal scoring support. In particular, the modified index now also includes 'fresh document counts' (the number of non-carried-over documents) and 'term counts for fresh documents' (the number of terms within non-carried-over documents), both for the entire index and for each individual term in the index.
+[[Image:index.png|480px|Indexing Flowchart]]
+== Distributed Index Support ==
+*[[Webarc:PBS Jobs Scripts for Indexing|PBS Jobs Scripts for Indexing]] (Bash): These scripts allow indexers to be deployed over multiple nodes in the Chimera cluster.
 == Retrievers ==
@@ Line 19: / Line 25: @@
 *[[Webarc:Temporal KL Retrieval Method|Temporal KL Retrieval Method]] (C++): Based on the Simple KL Retrieval Method implemented as a part of the Lemur toolkit, Temporal KL Retrieval Method uses the extra statistics parameters (that we included in the modified version of Lemur Indexer) for temporally-anchored scoring.
+== Distributed Search Support ==
+*[[Webarc:Temporal Search Client/Server|tsearch: Temporal Search Client/Server]] (C): Using RPC, a search server provides two remote procedures for clients: getstats and search. Getstats returns statistics (such as doc count, term count, fresh doc count, ...) upon a provisional query, while search returns the actual search results for a query. Each search server is multithreaded and can handle multiple indexes simultaneously. Each search client is also multithreaded and issues a search request to multiple servers simultaneously.
+*[[Webarc:PBS Jobs Scripts for Searching|PBS Jobs Scripts for Searching]] (Bash): These scripts allow search servers and clients to be deployed over multiple nodes in the Chimera cluster. For n nodes, and i indexes, each node is made responsible for i/n indexes. Indexes are distributed in an interlaced fashion: i.e. node 1 has index 1, n+1, 2n+1, ... and node 2 has index 2, n+2, 2n+2, ..., and so on.
+[[Image:search.png|480px|Distributed Search]]
+== Result Analysis ==
+*[[Webarc:Trec_Eval (modified)|Trec_Eval (modified)]] (C): We added Kendall's Tau support in the existing Trec_Eval tool.
-== Miscellaneous ==
+== Miscellanea ==
 *[[Webarc:Berkeley DB Wrapper for Carryover DB|Berkeley DB Wrapper for Carryover DB]] (Java): A simple and elegant wrapper for any C/C++ implementations that want to access Carryover DB (which is based on Java Berkeley DB) via Java JNI.

Personal tools

Webarc:Tools Developed: Difference between revisions - Adapt

Search

General

Projects

Research

Tools