Personal tools

Webarc:MediaWiki-to-TREC Converter

From Adapt

Revision as of 22:02, 9 November 2009 by Scsong (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

What it does

This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record in DB has the form of { docID, (revision date, file name, offset) }.

Usage

In Eclipse, configure a run

  1. Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
  2. On the left pane, right click on 'Java Application --> New'
  3. Enter 'mwprep' in the Name field on the right pane.
  4. Select 'mwprep' in the Project field.
  5. Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
  6. Click 'Apply'


export mwprep as a runnable JAR.

  1. Right-click on 'mwprep' in Package Explorer, select 'export'.
  2. Right-click on 'mwprep' in Package Explorer, select 'export'.
  3. Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
  4. Put lremonthlydumper.jar in Export destination
  5. Select 'Package required libraries into generated JAR'
  6. Click 'Finish'

In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.

mkdir extracted-monthly
java -jar lremonthlydumper.jar <WikiDump.lst>

Input

<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:

/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08


Output

Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-<k>.xml corresponds to the extracted wiki articles for the kth month.

Notes

The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.

Source Code

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper