Webarc:Lemur Indexer (modified)
From Adapt
What It Does
Given an input source, the Lemur/Indri Indexer (modified) either builds a new index or incrementally adds new documents on an existing index. Below are new features we added on top of the original Lemur/Indri toolkit. The original Lemur/Indri toolkit can be found at http://www.lemurproject.org/.
- New statistics parameters are included in the index. They are:
- fresh document count: the number of wiki revisions in the entire index in the monthly snapshot, not including those carried over from the previous snapshot.
- term count in fresh documents: the number of terms appearing in the monthly snapshot, not including those carried over from the previous snapshot.
- fresh document count for each term: the number of wiki revisions that contain a given term in the monthly snapshot, not including those carried over from the previous snapshot.
- term count in fresh documents for each term: the number of the occurrences of a given term in the monthly snapshot, not including those carried over from the previous snapshot.
Note that the the new index also contains the previous statistics that count documents and terms over the entire collection (monthly snapshot + carryover revisions).
- Berkeley DB support as a new input source: a new class indri.parse.BDBTaggedDocumentIterator is added. Minor changes in a number of related classes.
How To Build
The same way as you build the original Lemur/Indri toolkit. I.e. under the toolkit directory,
make; make install
How To Run
The same way as you run the original IndriBuildIndex (http://www.lemurproject.org/tutorials/interm_indexing-2.php).
./IndriBuildIndex <Indri indexer parameter file>
Input File
Mostly the same format as defined in the original Indri indexer parameter file format (http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex), with an additional support for the Berkley DB as an input source.
Example parameter file for indexing carryovers via Berkley DB.
<parameters> <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index> <indexType>indri</indexType> <corpus> <path>/fs/webarc3/data/wikipedia/bdb-monthly/month-003-co</path> <class>trectext_from_bdb</class> </corpus> </parameters> Example parameter file for indexing a monthly snapshot. <pre> <parameters> <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index> <indexType>indri</indexType> <corpus> <path>/fs/webarc3/data/wikipedia/preprocessed-monthly/trec-month-003.xml</path> <class>trectext</class> </corpus> </parameters>
Output Files
An Indri index directory at the location specified in the input parameter file (<index>...</index>).
Notes
- As a new index format contains extra statistics, the new index is not compatible with the original Indri index, thus won't be readable by the original Lemur toolkit.
Source Codes
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/lemur-4.10