Personal tools

Webarc:Temporal Okapi Retrieval Method

From Adapt

Revision as of 01:02, 10 November 2009 by Scsong (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

What It Does

Implemented as a new retrieval method within the Lemur toolkit (see http://lemurproject.org/lemur/progexamples.php#exam-newmethod for details), this new method takes advantage of new statistics parameters added in the index. In particular, based on Okapi BM-25, it uses the new statistics parameters when scoring.

How To Use

The similar to how you would call the original Lemur/Indri toolkit API as described in http://lemurproject.org/lemur/progexamples.php. The main difference is to instantiate TempOkapiRetMethod as your RetrievalMethod object, and to set index-wide stats before call scoreCollection, and pass term-wide stats as a parameter when calling scoreCollection.

Example:

void runQuery(void)
{
   // open index
   lemur::api::Index *idx = new lemur::index::LemurIndriIndex(); 
   if (!(idx->open("Y:\data\wikipedia\lemur_index\monthly\month-003"))) {
         printf("Open Index Failed: %s\n", strIndex);
         fflush(stdout);
         return;
   }
   lemur::retrieval::ArrayAccumulator *accu = new lemur::retrieval::ArrayAccumulator(idx->docCount());
   lemur::api::RetrievalMethod *rm = new lemur::retrieval::TempOkapiRetMethod(*idx, *accu);

   }


   // obtain statistics first (this step may seem redundant in this example as we are dealing with only one index, but is necessary as we will need to override local statistics found in the current index, in real experiments) 
   int docCount = idx->docCount();
   long colTermCount = idx->termCount();
   int docAvgLen = idx->docLengthAvg();
   int termCountUnique = >idx->termCountUnique();

   map<lemur::api::TERMID_T, double> docCount_t;
   lemur::parse::StringQuery *qryterms = new lemur::parse::StringQuery();

   // construct query and term statistics
   lemur::api::Term tt;
   tt.spelling("university");
   qryterms->add("university");
   lemur::api::TERMID_T ti = retMethod->getIndex()->term(tt.spelling());
   docCount_t.insert(pair<lemur::api::TERMID_T, double>(ti, idx->docCount(ti)));

   lemur::api::Term tt;
   tt.spelling("maryland");
   qryterms->add("maryland");
   lemur::api::TERMID_T ti = retMethod->getIndex()->term(tt.spelling());
   docCount_t.insert(pair<lemur::api::TERMID_T, double>(ti, idx->docCount(ti)));

   // set index-wide stats
   retMethod->setCollectionStats(docAverageLength, docCount, colTermCount, termCountUnique);

   // finally score docs against query 
   lemur::api::IndexedRealVector results;

   retMethod->scoreCollection(*qr, docCount_t, results);

   // results are returned in 'results'

   return;
}

Notes

N/A

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/lemur-4.10