Webarc:LRE Monthly Dumper
From Adapt
What it does
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.
Usage
In Eclipse, export mwdumper as a runnable JAR. 1. Right-click on 'mwdumper' in Package Explorer, select 'export'. 1. Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration. 1. Put lremonthlydumper.jar in Export destination 1. Select 'Package required libraries into generated JAR' 1. Click 'Finish'
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.
mkdir extracted-monthly java -jar lremonthlydumper.jar ''<WikiDump.lst''>
Input
<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
Output
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month). month-<k>.xml corresponds to the output file for the kth month.
Notes
The range of months is hard-coded in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory name, which can be easily changed.
Source Code
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper