Webarc:Temporal Search Client/Server
From Adapt
What It Does
This is an RPC server/client implementation for our temporal search experiments. The server provides two remote procedures for clients to call: getstats and search. Getstats returns statistics (such as doc count, term count, fresh doc count, ...) upon a provisional query, while search returns the actual search results for a query. Each search server is multithreaded and can handle multiple indexes simultaneously. Each search client is also multithreaded and issues a search request to multiple servers simultaneously.
RPC interface definition
struct searchhits_struct { string docId<>; double score; searchhits_struct *next; }; typedef struct searchhits_struct searchhits; struct terms_counts_struct { string term<>; unsigned int docCount; unsigned int termCount_high32; unsigned int termCount_low32; terms_counts_struct *next; }; typedef struct terms_counts_struct terms_counts; struct terms_no_counts_struct { string term<>; terms_no_counts_struct *next; }; typedef struct terms_no_counts_struct terms_no_counts; struct col_stats_struct { unsigned int docCount; float docAverageLength; unsigned int termCountUnique; unsigned int termCount_high32; unsigned int termCount_low32; }; typedef col_stats_struct col_stats; struct final_query_struct { terms_counts terms; unsigned int beginTime; unsigned int endTime; unsigned int maxNoHits; col_stats colStat; }; typedef struct final_query_struct final_query; struct probe_query_struct { terms_no_counts terms; unsigned int beginTime; unsigned int endTime; }; typedef struct probe_query_struct probe_query; program tsearch { version v1 { searchhits search(final_query) = 1; final_query getstats(probe_query) = 2; }=1; }=0x00001000;
How To Build
In tsearch directory,
make
How To Run
Run Search Server
cd bin ./tsearchsvr <timewindow configuration file>
Run Search Client
cd bin ./tsearchcli <rpc server list> <query terms list> <query spans list> <time windows list>
Input File
Search Server
<timewindow configuration file>: A file that lists the locations of the indexes and the retrieval method to be used.
Example contents in a list file:
1 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-001 TEMP_OKAPI 2 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-002 TEMP_OKAPI 3 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-003 TEMP_OKAPI 4 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-004 TEMP_OKAPI 5 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-005 TEMP_OKAPI 6 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-006 TEMP_OKAPI 7 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-007 TEMP_OKAPI 8 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-008 TEMP_OKAPI 9 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-009 TEMP_OKAPI 10 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-010 TEMP_OKAPI 11 /fs/webarc3/data/wikipedia/lemur_index/monthly/month-011 TEMP_OKAPI
Search Client
<rpc server list>: A file that lists the locations of the RPC search servers to connect. Example:
chimera00 chimera02 chimera03 chimera04 chimera05 chimera06 chimera07 chimera08
<query terms list>: A file that lists search phrases to be used. Example:
1901 uk census 2004 taxes income from business partnership deductions
<query spans list>: A file that lists {begin time - end time} pairs to be used for each search phrase listed in <query terms list> Example:
1-64 2-65 3-66 4-67 5-68 6-69 7-70 8-71 9-72 10-73 11-74 12-75 13-76 14-77 15-78 16-79 17-80 18-81 19-82 20-83
: A file that lists the temporal points where a new time window begins. The last number in the list is when the last time window ends. Example:
0 8 16 24 32 40 48 56 64 72 80 83
Output File
Search server makes no output file. Search client creates a Trec_Eval complaint output file with the name of <query span list>_ under the current directory.
Notes
- When running tsearchcli, place <query spans list> and in the current directory, and use bare file names without any preceding directory description. (I.e. 'qts.def' rather than './qts.def' or '/tmp/qts.def).
- The current RPC implementation in Linux has the following limitations:
- Although individual procedures can create multithreads to complete the job more quickly, the procedure itself cannot be multithreaded. (I.e. multiple calls to the same procedure will be processed sequentially).
- There is no 8-byte data type support. Even the 'long' type in a 64-bit machine is not properly parsed in the RPC implementation. We walked around this issue by defining two 4-byte integer variables, each of which represents the low/upper half of the 64-bit value.
Source Codes
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/tsearch