Personal tools

Webarc:LRE Monthly Dumper: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
 
No edit summary
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
== What it does ==
== What It Does ==
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.  
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.  


== Usage ==
== How To Build ==
In Eclipse, configure a run
# Right-click on ''mwdumper'' in Package Explorer, select ''Run As.. --> Run Configurations''.
# On the left pane, right click on ''Java Application --> New''
# Enter ''mwdumper'' in the Name field on the right pane.
# Select ''mwdumper'' in the Project field.
# Select ''org.mediawiki.dumper.LREMonthlyDumper20012007'' in the Main class field
# Click ''Apply''
# Click ''Close''
 
In Eclipse, export mwdumper as a runnable JAR.
In Eclipse, export mwdumper as a runnable JAR.
1. Right-click on 'mwdumper' in Package Explorer, select 'export'.
# Right-click on 'mwdumper' in Package Explorer, select 'export'.
1. Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
# Select 'Runnable JAR file' and click 'Next'.
1. Put lremonthlydumper.jar in Export destination
# Select 'mwdumper- mwdumper' as Launch configuration.
1. Select 'Package required libraries into generated JAR'
# Put ''<your directory>/mwdumper.jar'' in Export destination
1. Click 'Finish'
# Select 'Package required libraries into generated JAR'
# Click 'Finish'


In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.
== How To Run ==
In a shell terminal (or a command line prompt in Windows), change directory to where ''lremonthlydumper.jar'' is located (''<your directory>'' above).


<pre>
<pre>
mkdir extracted-monthly
mkdir extracted-monthly
java -jar lremonthlydumper.jar ''<WikiDump.lst''>
java -jar mwdumper.jar <WikiDump.lst>
</pre>
</pre>


== Input ==
== Input File ==
''<WikiDump.lst''>: A file that lists the locations of WikiMedia XML dump files
''<WikiDump.lst>'': A file that lists the locations of WikiMedia XML dump files
Example contents in a list file:
Example contents in a list file:
<pre>
<pre>
Line 33: Line 44:




== Output ==
== Output Files ==
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month). month-''<k>''.xml corresponds to the output file for the ''k''th month.  
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file ''month-<k>''.xml corresponds to the extracted wiki articles for the ''k''th month.  


== Notes ==
== Notes ==
The range of months is hard-coded in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory name, which can be easily changed.
The range of months is hard-coded as January 2001 ~ December 2007 in ''org.mediawiki.dumper.LREMonthlyDumper20012007'', and so is the output directory. Change these as necessary, or even better, parameterize them.


== Source Code ==
== Source Codes ==
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper

Latest revision as of 18:45, 10 November 2009

What It Does

From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.

How To Build

In Eclipse, configure a run

  1. Right-click on mwdumper in Package Explorer, select Run As.. --> Run Configurations.
  2. On the left pane, right click on Java Application --> New
  3. Enter mwdumper in the Name field on the right pane.
  4. Select mwdumper in the Project field.
  5. Select org.mediawiki.dumper.LREMonthlyDumper20012007 in the Main class field
  6. Click Apply
  7. Click Close

In Eclipse, export mwdumper as a runnable JAR.

  1. Right-click on 'mwdumper' in Package Explorer, select 'export'.
  2. Select 'Runnable JAR file' and click 'Next'.
  3. Select 'mwdumper- mwdumper' as Launch configuration.
  4. Put <your directory>/mwdumper.jar in Export destination
  5. Select 'Package required libraries into generated JAR'
  6. Click 'Finish'

How To Run

In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located (<your directory> above).

mkdir extracted-monthly
java -jar mwdumper.jar <WikiDump.lst>

Input File

<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:

/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08


Output Files

Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-<k>.xml corresponds to the extracted wiki articles for the kth month.

Notes

The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.

Source Codes

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper