Webarc:MediaWiki-to-TREC Converter: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
(3 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
In Eclipse, export mwprep as a runnable JAR. | In Eclipse, export mwprep as a runnable JAR. | ||
# Right-click on 'mwprep' in Package Explorer, select 'export'. | # Right-click on 'mwprep' in Package Explorer, select 'export'. | ||
# Select 'Runnable JAR file' and click 'Next'. | |||
# Select 'mwprep - mwprep' as Launch configuration. | # Select 'mwprep - mwprep' as Launch configuration. | ||
# Put mwprep.jar in Export destination | # Put mwprep.jar in Export destination. | ||
# Select 'Package required libraries into generated JAR' | # Select 'Package required libraries into generated JAR' | ||
# Click 'Finish' | # Click 'Finish' | ||
In a shell terminal (or a command line prompt in Windows), change directory to where | == How To Run == | ||
In a shell terminal (or a command line prompt in Windows), change directory to where mwprep.jar is located (<your directory> above). | |||
<pre> | <pre> | ||
mkdir preprocessed-monthly | mkdir preprocessed-monthly | ||
java -jar mwprep.jar < | mkdir bdb-monthly | ||
java -jar mwprep.jar < snapshots.lst> | |||
</pre> | </pre> | ||
== Input File == | == Input File == | ||
''< | ''<snapshots.lst>'': A file that lists the locations of the monthly snapshots extracted from the WikiMedia XML dump. You probably want to break down the entire list into multiple smaller files and run multiple mwprep's in parallel. | ||
Example contents in a list file: | Example contents in a list file: |
Latest revision as of 23:33, 9 November 2009
What It Does
This tool converts the default MediaWiki XML dump format to the TREC-complaint format. For our experimental purposes, it also constructs a Java Berkeley DB for each month, which we call 'Fresh DB'. A 'Fresh DB' contains records corresponding to all wiki articles updated within a month. Each record in DB has the form of { docID, (revision date, file name, offset) }.
How To Build
In Eclipse, configure a run
- Right-click on 'mwprep' in Package Explorer, select 'Run As.. --> Run Configurations'.
- On the left pane, right click on 'Java Application --> New'
- Enter 'mwprep' in the Name field on the right pane.
- Select 'mwprep' in the Project field.
- Select 'edu.umd.umiacs.mw.mwprep' in the Main class field
- Click 'Apply'
- Click 'Close'
In Eclipse, export mwprep as a runnable JAR.
- Right-click on 'mwprep' in Package Explorer, select 'export'.
- Select 'Runnable JAR file' and click 'Next'.
- Select 'mwprep - mwprep' as Launch configuration.
- Put mwprep.jar in Export destination.
- Select 'Package required libraries into generated JAR'
- Click 'Finish'
How To Run
In a shell terminal (or a command line prompt in Windows), change directory to where mwprep.jar is located (<your directory> above).
mkdir preprocessed-monthly mkdir bdb-monthly java -jar mwprep.jar < snapshots.lst>
Input File
<snapshots.lst>: A file that lists the locations of the monthly snapshots extracted from the WikiMedia XML dump. You probably want to break down the entire list into multiple smaller files and run multiple mwprep's in parallel.
Example contents in a list file:
/fs/webarc3/data/wikipedia/extracted-monthly/month-000.xml /fs/webarc3/data/wikipedia/extracted-monthly/month-001.xml /fs/webarc3/data/wikipedia/extracted-monthly/month-002.xml /fs/webarc3/data/wikipedia/extracted-monthly/month-003.xml /fs/webarc3/data/wikipedia/extracted-monthly/month-004.xml
Output Files
- Under the 'preprocessed-monthly' directory, TREC-complaint files. The converted file for the original XML file <filename>.xml is named as trec-<filename>.xml.
- Under the 'bdb-monthly' directory, directories for Fresh DBs. Each directory corresponds to a DB which in turn corresponds to a single month. The DB directories are named as '<filename>-fresh'.
Notes
The names of the output directories (preprocessed-monthly and bdb-monthly) are hard-coded in edu.umd.umiacs.mw.mwprep. Change these as necessary, or even better, parameterize them.
Source Codes
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwprep