Webarc:LRE Monthly Dumper
From Adapt
What It Does
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.
How To Build
In Eclipse, configure a run
- Right-click on 'mwdumper' in Package Explorer, select 'Run As.. --> Run Configurations'.
- On the left pane, right click on 'Java Application --> New'
- Enter 'mwdumper' in the Name field on the right pane.
- Select 'mwdumper' in the Project field.
- Select 'org.mediawiki.dumper.LREMonthlyDumper20012007p' in the Main class field
- Click 'Apply'
- Click 'Close'
In Eclipse, export mwdumper as a runnable JAR.
- Right-click on 'mwdumper' in Package Explorer, select 'export'.
- . Select 'mwdumper- mwdumper' as Launch configuration.
- Put <your directory>/mwdumper.jar in Export destination
- Select 'Package required libraries into generated JAR'
- Click 'Finish'
How To Run
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located (<your directory> above).
mkdir extracted-monthly java -jar mwdumper.jar <WikiDump.lst>
Input File
<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
Output Files
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-<k>.xml corresponds to the extracted wiki articles for the kth month.
Notes
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.
Source Codes
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper