Webarc:LRE Monthly Dumper: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 4: | Line 4: | ||
== Usage == | == Usage == | ||
In Eclipse, export mwdumper as a runnable JAR. | In Eclipse, export mwdumper as a runnable JAR. | ||
# Right-click on 'mwdumper' in Package Explorer, select 'export'. | |||
#. Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration. | |||
# Put lremonthlydumper.jar in Export destination | |||
# Select 'Package required libraries into generated JAR' | |||
# Click 'Finish' | |||
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located. | In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located. |
Revision as of 21:42, 9 November 2009
What it does
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.
Usage
In Eclipse, export mwdumper as a runnable JAR.
- Right-click on 'mwdumper' in Package Explorer, select 'export'.
- . Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
- Put lremonthlydumper.jar in Export destination
- Select 'Package required libraries into generated JAR'
- Click 'Finish'
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.
mkdir extracted-monthly java -jar lremonthlydumper.jar ''<WikiDump.lst''>
Input
<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
Output
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month). month-<k>.xml corresponds to the output file for the kth month.
Notes
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.
Source Code
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper