Personal tools

Webarc:Temporal Okapi Retrieval Method

From Adapt

Revision as of 15:10, 11 November 2009 by Scsong (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

What It Does

Implemented as a new retrieval method within the Lemur toolkit (see http://lemurproject.org/lemur/progexamples.php#exam-newmethod for details), this new method takes advantage of new statistics parameters added in the index. In particular, based on Okapi BM-25, it uses the new statistics parameters when scoring.

How To Use

The usage is similar to how you would call the original Lemur/Indri toolkit APIs as described in http://lemurproject.org/lemur/progexamples.php. The main differences are to instantiate TempOkapiRetMethod as your RetrievalMethod object, to set index-wide stats before calling scoreCollection, and to pass term-wide stats as a parameter when calling scoreCollection.

Example:

void runQuery(void)
{
   // open index
   lemur::api::Index *idx = new lemur::index::LemurIndriIndex(); 
   if (!(idx->open("Y:\data\wikipedia\lemur_index\monthly\month-003"))) {
         printf("Open Index Failed: %s\n", strIndex);
         fflush(stdout);
         return;
   }
   lemur::retrieval::ArrayAccumulator *accu = new lemur::retrieval::ArrayAccumulator(idx->docCount());
   lemur::api::RetrievalMethod *rm = new lemur::retrieval::TempOkapiRetMethod(*idx, *accu);

   }


   // obtain statistics first (this step may seem redundant in this example as we are dealing with only one index, but is necessary as we will need to override local statistics found in the current index, in real experiments) 
   int docCount = idx->docCount();
   long colTermCount = idx->termCount();
   int docAvgLen = idx->docLengthAvg();
   int termCountUnique = >idx->termCountUnique();

   map<lemur::api::TERMID_T, double> docCount_t;
   lemur::parse::StringQuery *qryterms = new lemur::parse::StringQuery();

   // construct query and term statistics
   lemur::api::Term tt;
   tt.spelling("university");
   qryterms->add("university");
   lemur::api::TERMID_T ti = retMethod->getIndex()->term(tt.spelling());
   docCount_t.insert(pair<lemur::api::TERMID_T, double>(ti, idx->docCount(ti)));

   lemur::api::Term tt;
   tt.spelling("maryland");
   qryterms->add("maryland");
   lemur::api::TERMID_T ti = retMethod->getIndex()->term(tt.spelling());
   docCount_t.insert(pair<lemur::api::TERMID_T, double>(ti, idx->docCount(ti)));

   // set index-wide stats
   retMethod->setCollectionStats(docAverageLength, docCount, colTermCount, termCountUnique);

   // finally score docs against query 
   lemur::api::IndexedRealVector results;

   retMethod->scoreCollection(*qr, docCount_t, results);

   // results are returned in 'results'

   return;
}

Notes

N/A

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/lemur-4.10