Webarc:MediaWiki-to-TREC Converter
From Adapt
What it does
This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record in DB has the form of { docID, (revision date, file name, offset) }.
Usage
In Eclipse, configure a run
- Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
- On the left pane, right click on 'Java Application --> New'
- Enter 'mwprep' in the Name field on the right pane.
- Select 'mwprep' in the Project field.
- Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
- Click 'Apply'
export mwprep as a runnable JAR.
- Right-click on 'mwprep' in Package Explorer, select 'export'.
- Right-click on 'mwprep' in Package Explorer, select 'export'.
- Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
- Put lremonthlydumper.jar in Export destination
- Select 'Package required libraries into generated JAR'
- Click 'Finish'
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.
mkdir extracted-monthly java -jar lremonthlydumper.jar <WikiDump.lst>
Input
<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
Output
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-<k>.xml corresponds to the extracted wiki articles for the kth month.
Notes
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.
Source Code
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper