Personal tools

Webarc:Temporal Search Client/Server

From Adapt

Jump to: navigation, search

What It Does

This is an RPC server/client implementation for our temporal search experiments. The server provides two remote procedures for clients to call: getstats and search. getstats returns statistics (such as doc count, term count, fresh doc count, ...) upon a provisional query, while search returns the actual search results for a query. Each search server is multithreaded and can handle multiple indexes simultaneously. Each search client is also multithreaded and issues a search request to multiple servers simultaneously.

RPC interface definition

struct searchhits_struct {
   string docId<>;
   double score;
   searchhits_struct *next;
};
typedef struct searchhits_struct searchhits;

struct terms_counts_struct {
   string term<>;
   unsigned int docCount;
   unsigned int termCount_high32;
   unsigned int termCount_low32;
   terms_counts_struct *next;
}; 
typedef struct terms_counts_struct terms_counts;

struct terms_no_counts_struct {
   string term<>;
   terms_no_counts_struct *next;
}; 
typedef struct terms_no_counts_struct terms_no_counts;

struct col_stats_struct {
   unsigned int docCount;
   float docAverageLength;
   unsigned int termCountUnique;
   unsigned int termCount_high32;
   unsigned int termCount_low32;
};
typedef col_stats_struct col_stats;
   
struct final_query_struct {
   terms_counts terms;
   unsigned int beginTime;
   unsigned int endTime;
   unsigned int maxNoHits;
   col_stats colStat;
};
typedef struct final_query_struct final_query;

struct probe_query_struct {
   terms_no_counts terms;
   unsigned int beginTime;
   unsigned int endTime;
};
typedef struct probe_query_struct probe_query;

program tsearch {
   version v1 {
      searchhits search(final_query) = 1;
      final_query getstats(probe_query) = 2;
   }=1;
}=0x00001000;

How To Build

In tsearch directory,

make


How To Run

Run Search Server

cd bin
./tsearchsvr <timewindow configuration file>

Run Search Client

cd bin
./tsearchcli <rpc server list> <query terms list> <query spans list> <time windows list>

Input File

Search Server

<timewindow configuration file>: A file that lists the locations of the indexes and the retrieval method to be used.

Example contents in a list file:

1 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-001 TEMP_OKAPI
2 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-002 TEMP_OKAPI
3 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-003 TEMP_OKAPI
4 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-004 TEMP_OKAPI
5 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-005 TEMP_OKAPI
6 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-006 TEMP_OKAPI
7 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-007 TEMP_OKAPI
8 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-008 TEMP_OKAPI
9 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-009 TEMP_OKAPI
10 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-010 TEMP_OKAPI
11 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-011 TEMP_OKAPI

Search Client

<rpc server list>: A file that lists the locations of the RPC search servers to connect. Example:

chimera00
chimera02
chimera03
chimera04
chimera05
chimera06
chimera07
chimera08

<query terms list>: A file that lists search phrases to be used. Example:

1901 uk census
2004 taxes income from business partnership deductions

<query spans list>: A file that lists {begin time - end time} pairs to be used for each search phrase listed in <query terms list> Example:

1-64
2-65
3-66
4-67
5-68
6-69
7-70
8-71
9-72
10-73
11-74
12-75
13-76
14-77
15-78
16-79
17-80
18-81
19-82
20-83

: A file that lists the temporal points where a new time window begins. The last number in the list is when the last time window ends. Example:

0
8
16
24
32
40
48
56
64
72
80
83

Output File

Search server makes no output file. Search client creates a Trec_Eval complaint output file with the name of <query span list>_ under the current directory.

Notes

  • When running tsearchcli, place <query spans list> and in the current directory, and use bare file names without any preceding directory description. (I.e. 'qts.def' rather than './qts.def' or '/tmp/qts.def).
  • The current RPC implementation in Linux has the following limitations:
    • Although individual procedures can create multithreads to complete the job more quickly, the procedure itself cannot be multithreaded. (I.e. multiple calls to the same procedure will be processed sequentially).
    • There is no 8-byte data type support. Even the 'long' type in a 64-bit machine is not properly parsed in the RPC implementation. We walked around this issue by defining two 4-byte integer variables, each of which represents the low/upper half of the 64-bit value.

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/tsearch