Personal tools

Webarc:Lemur Indexer (modified): Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
 
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 9: Line 9:
Note that the the new index also contains the previous statistics that count documents and terms over the entire collection (monthly snapshot + carryover revisions).
Note that the the new index also contains the previous statistics that count documents and terms over the entire collection (monthly snapshot + carryover revisions).


* Berkeley DB as a new input source: a new class indri.parse.BDBTaggedDocumentIterator is added. Minor changes in a number of related classes.
* Berkeley DB support as a new input source: a new class indri.parse.BDBTaggedDocumentIterator is added. Minor changes in a number of related classes.




== How To Build ==
== How To Build ==
The same way as you build the original Lemur/Indri toolkit. I.e. under the toolkit directory,
The same way as you'd build the original Lemur/Indri toolkit. I.e. under the toolkit directory,
<pre>
<pre>
make; make install
make; make install
Line 19: Line 19:


== How To Run ==
== How To Run ==
The same way as you run the original IndriBuildIndex (http://www.lemurproject.org/tutorials/interm_indexing-2.php).
The same way as you'd run the original IndriBuildIndex (http://www.lemurproject.org/tutorials/interm_indexing-2.php).


</pre>
<pre>
./IndriBuildIndex <Indri indexer parameter file>
./IndriBuildIndex <Indri indexer parameter file>
</pre>
</pre>


== Input File ==
== Input File ==
Line 43: Line 38:
   </corpus>
   </corpus>
</parameters>
</parameters>
</pre>


Example parameter file for indexing a monthly snapshot.
Example parameter file for indexing a monthly snapshot.

Latest revision as of 16:37, 4 December 2009

What It Does

Given an input source, the Lemur/Indri Indexer (modified) either builds a new index or incrementally adds new documents on an existing index. Below are new features we added on top of the original Lemur/Indri toolkit. The original Lemur/Indri toolkit can be found at http://www.lemurproject.org/.

  • New statistics parameters are included in the index. They are:
    • fresh document count: the number of wiki revisions in the entire index in the monthly snapshot, not including those carried over from the previous snapshot.
    • term count in fresh documents: the number of terms appearing in the monthly snapshot, not including those carried over from the previous snapshot.
    • fresh document count for each term: the number of wiki revisions that contain a given term in the monthly snapshot, not including those carried over from the previous snapshot.
    • term count in fresh documents for each term: the number of the occurrences of a given term in the monthly snapshot, not including those carried over from the previous snapshot.

Note that the the new index also contains the previous statistics that count documents and terms over the entire collection (monthly snapshot + carryover revisions).

  • Berkeley DB support as a new input source: a new class indri.parse.BDBTaggedDocumentIterator is added. Minor changes in a number of related classes.


How To Build

The same way as you'd build the original Lemur/Indri toolkit. I.e. under the toolkit directory,

make; make install

How To Run

The same way as you'd run the original IndriBuildIndex (http://www.lemurproject.org/tutorials/interm_indexing-2.php).

./IndriBuildIndex <Indri indexer parameter file>

Input File

Mostly the same format as defined in the original Indri indexer parameter file format (http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex), with an additional support for the Berkley DB as an input source.

Example parameter file for indexing carryovers via Berkley DB.

<parameters>
  <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/fs/webarc3/data/wikipedia/bdb-monthly/month-003-co</path>
    <class>trectext_from_bdb</class>
  </corpus>
</parameters>

Example parameter file for indexing a monthly snapshot.

<parameters>
  <index>/fs/webarc3/data/wikipedia/lemur_index/monthly/month-003</index>
  <indexType>indri</indexType>
  <corpus>
    <path>/fs/webarc3/data/wikipedia/preprocessed-monthly/trec-month-003.xml</path>
    <class>trectext</class>
  </corpus>
</parameters>

Output Files

An Indri index directory at the location specified in the input parameter file (<index>...</index>).

Notes

  • As a new index format contains extra statistics, the new index is not compatible with the original Indri index, thus won't be readable by the original Lemur toolkit.

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/lemur-4.10