Personal tools

Webarc:Lemur Indexer (modified)

From Adapt

Jump to: navigation, search

What It Does

Given an input source, the Lemur/Indri Indexer (modified) either builds a new index or incrementally adds new documents on an existing index. Below are new features we added on top of the original Lemur/Indri toolkit. The original Lemur/Indri toolkit can be found at http://www.lemurproject.org/.

  • New statistics parameters are included in the index. They are:
    • fresh document count: the number of wiki revisions in the entire index in the monthly snapshot, not including those carried over from the previous snapshot.
    • term count in fresh documents: the number of terms appearing in the monthly snapshot, not including those carried over from the previous snapshot.
    • fresh document count for each term: the number of wiki revisions that contain a given term in the monthly snapshot, not including those carried over from the previous snapshot.
    • term count in fresh documents for each term: the number of the occurrences of a given term in the monthly snapshot, not including those carried over from the previous snapshot.

Note that the the new index also contains the previous statistics that count documents and terms over the entire collection (monthly snapshot + carryover revisions).

  • Berkeley DB support as a new input source: a new class indri.parse.BDBTaggedDocumentIterator is added. Minor changes in a number of related classes.


How To Build

The same way as you'd build the original Lemur/Indri toolkit. I.e. under the toolkit directory,

make; make install

How To Run

The same way as you'd run the original IndriBuildIndex (http://www.lemurproject.org/tutorials/interm_indexing-2.php).

./IndriBuildIndex <Indri indexer parameter file>

Input File

Mostly the same format as defined in the original Indri indexer parameter file format (http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex), with an additional support for the Berkley DB as an input source.

Example parameter file for indexing carryovers via Berkley DB.

<parameters>
  <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/fs/webarc3/data/wikipedia/bdb-monthly/month-003-co</path>
    <class>trectext_from_bdb</class>
  </corpus>
</parameters>

Example parameter file for indexing a monthly snapshot.

<parameters>
  <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/fs/webarc3/data/wikipedia/preprocessed-monthly/trec-month-003.xml</path>
    <class>trectext</class>
  </corpus>
</parameters>

Output Files

An Indri index directory at the location specified in the input parameter file (<index>...</index>).

Notes

  • As a new index format contains extra statistics, the new index is not compatible with the original Indri index, thus won't be readable by the original Lemur toolkit.

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/lemur-4.10