Personal tools

Webarc:MediaWiki-to-TREC Converter: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
 
No edit summary
Line 1: Line 1:
== What it does ==
== What It Does ==
This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record  in DB has the form of { docID, (revision date, file name, offset) }.
This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record  in DB has the form of { docID, (revision date, file name, offset) }.


== Usage ==
== How To Build ==
In Eclipse, configure a run
In Eclipse, configure a run
# Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
# Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
Line 10: Line 10:
# Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
# Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
# Click 'Apply'
# Click 'Apply'
# Click 'Close'


 
In Eclipse, export mwprep as a runnable JAR.
export mwprep as a runnable JAR.
# Right-click on 'mwprep' in Package Explorer, select 'export'.
# Right-click on 'mwprep' in Package Explorer, select 'export'.
# Right-click on 'mwprep' in Package Explorer, select 'export'.
# Select 'mwprep - mwprep' as Launch configuration.
# Select 'lremonthlydumper20012007 - mwdumper' as Launch configuration.
# Put mwprep.jar in Export destination
# Put lremonthlydumper.jar in Export destination
# Select 'Package required libraries into generated JAR'
# Select 'Package required libraries into generated JAR'
# Click 'Finish'
# Click 'Finish'
Line 23: Line 22:


<pre>
<pre>
mkdir extracted-monthly
mkdir preprocessed-monthly
java -jar lremonthlydumper.jar <WikiDump.lst>
java -jar mwprep.jar <WikiDump.lst>
</pre>
</pre>


== Input ==
== Input ==
''<WikiDump.lst>'': A file that lists the locations of WikiMedia XML dump files
''<WikiDump.lst>'': A file that lists the locations of the monthly snapshots extracted from the WikiMedia XML dump. You probably want to break down the entire list into multiple smaller files and run multiple mwprep's in parallel.
 
Example contents in a list file:
Example contents in a list file:
<pre>
<pre>
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00
/fs/webarc3/data/wikipedia/extracted-monthly/month-000.xml
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01
/fs/webarc3/data/wikipedia/extracted-monthly/month-001.xml
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02
/fs/webarc3/data/wikipedia/extracted-monthly/month-002.xml
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03
/fs/webarc3/data/wikipedia/extracted-monthly/month-003.xml
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04
/fs/webarc3/data/wikipedia/extracted-monthly/month-004.xml
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
</pre>
</pre>




== Output ==
== Output ==
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-''<k>''.xml corresponds to the extracted wiki articles for the ''k''th month.  
# Under the 'preprocessed-monthly' directory, TREC-complaint files. The converted file for the original XML file <filename>.xml is named as trec-<filename>.xml.
# Under the 'bdb-monthly' directory, directories for Fresh DBs. Each directory corresponds to a DB which in turn corresponds to a single month. The DB directories are named as '<filename>-fresh'.


== Notes ==
== Notes ==
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.
The name of the output directories (preprocessed-monthly and bdb-monthly) are hard-coded in edu.umd.umiacs.mw.mwprep. Change these as necessary, or even better, parameterize them.


== Source Code ==
== Source Code ==
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwprep

Revision as of 22:15, 9 November 2009

What It Does

This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record in DB has the form of { docID, (revision date, file name, offset) }.

How To Build

In Eclipse, configure a run

  1. Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
  2. On the left pane, right click on 'Java Application --> New'
  3. Enter 'mwprep' in the Name field on the right pane.
  4. Select 'mwprep' in the Project field.
  5. Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
  6. Click 'Apply'
  7. Click 'Close'

In Eclipse, export mwprep as a runnable JAR.

  1. Right-click on 'mwprep' in Package Explorer, select 'export'.
  2. Select 'mwprep - mwprep' as Launch configuration.
  3. Put mwprep.jar in Export destination
  4. Select 'Package required libraries into generated JAR'
  5. Click 'Finish'

In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located.

mkdir preprocessed-monthly
java -jar mwprep.jar <WikiDump.lst>

Input

<WikiDump.lst>: A file that lists the locations of the monthly snapshots extracted from the WikiMedia XML dump. You probably want to break down the entire list into multiple smaller files and run multiple mwprep's in parallel.

Example contents in a list file:

/fs/webarc3/data/wikipedia/extracted-monthly/month-000.xml
/fs/webarc3/data/wikipedia/extracted-monthly/month-001.xml
/fs/webarc3/data/wikipedia/extracted-monthly/month-002.xml
/fs/webarc3/data/wikipedia/extracted-monthly/month-003.xml
/fs/webarc3/data/wikipedia/extracted-monthly/month-004.xml


Output

  1. Under the 'preprocessed-monthly' directory, TREC-complaint files. The converted file for the original XML file <filename>.xml is named as trec-<filename>.xml.
  2. Under the 'bdb-monthly' directory, directories for Fresh DBs. Each directory corresponds to a DB which in turn corresponds to a single month. The DB directories are named as '<filename>-fresh'.

Notes

The name of the output directories (preprocessed-monthly and bdb-monthly) are hard-coded in edu.umd.umiacs.mw.mwprep. Change these as necessary, or even better, parameterize them.

Source Code

svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwprep