Personal tools

Webarc:LRE Monthly Dumper: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
Line 14: Line 14:
<pre>
<pre>
mkdir extracted-monthly
mkdir extracted-monthly
java -jar lremonthlydumper.jar ''<WikiDump.lst>''
java -jar lremonthlydumper.jar <WikiDump.lst>
</pre>
</pre>



Revision as of 21:43, 9 November 2009

What it does

From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.

Usage

In Eclipse, export mwdumper as a runnable JAR.

  1. Right-click on 'mwdumper' in Package Explorer, select 'export'.
  2. . Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
  3. Put lremonthlydumper.jar in Export destination
  4. Select 'Package required libraries into generated JAR'
  5. Click 'Finish'

In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.

mkdir extracted-monthly
java -jar lremonthlydumper.jar <WikiDump.lst>

Input

<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:

/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08


Output

Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month). month-<k>.xml corresponds to the output file for the kth month.

Notes

The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.

Source Code

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper