Webarc:LRE Monthly Dumper: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== What | == What It Does == | ||
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects. | From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects. | ||
== | == How To Build == | ||
In Eclipse, configure a run | |||
# Right-click on ''mwdumper'' in Package Explorer, select ''Run As.. --> Run Configurations''. | |||
# On the left pane, right click on ''Java Application --> New'' | |||
# Enter ''mwdumper'' in the Name field on the right pane. | |||
# Select ''mwdumper'' in the Project field. | |||
# Select ''org.mediawiki.dumper.LREMonthlyDumper20012007'' in the Main class field | |||
# Click ''Apply'' | |||
# Click ''Close'' | |||
In Eclipse, export mwdumper as a runnable JAR. | In Eclipse, export mwdumper as a runnable JAR. | ||
# Right-click on 'mwdumper' in Package Explorer, select 'export'. | # Right-click on 'mwdumper' in Package Explorer, select 'export'. | ||
#. Select ' | # Select 'Runnable JAR file' and click 'Next'. | ||
# Put | # Select 'mwdumper- mwdumper' as Launch configuration. | ||
# Put ''<your directory>/mwdumper.jar'' in Export destination | |||
# Select 'Package required libraries into generated JAR' | # Select 'Package required libraries into generated JAR' | ||
# Click 'Finish' | # Click 'Finish' | ||
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located. | == How To Run == | ||
In a shell terminal (or a command line prompt in Windows), change directory to where ''lremonthlydumper.jar'' is located (''<your directory>'' above). | |||
<pre> | <pre> | ||
mkdir extracted-monthly | mkdir extracted-monthly | ||
java -jar | java -jar mwdumper.jar <WikiDump.lst> | ||
</pre> | </pre> | ||
== Input == | == Input File == | ||
''<WikiDump.lst>'': A file that lists the locations of WikiMedia XML dump files | ''<WikiDump.lst>'': A file that lists the locations of WikiMedia XML dump files | ||
Example contents in a list file: | Example contents in a list file: | ||
Line 33: | Line 44: | ||
== Output == | == Output Files == | ||
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month). month- | Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file ''month-<k>''.xml corresponds to the extracted wiki articles for the ''k''th month. | ||
== Notes == | == Notes == | ||
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them. | The range of months is hard-coded as January 2001 ~ December 2007 in ''org.mediawiki.dumper.LREMonthlyDumper20012007'', and so is the output directory. Change these as necessary, or even better, parameterize them. | ||
== Source | == Source Codes == | ||
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper | svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper |
Latest revision as of 18:45, 10 November 2009
What It Does
From the source WikiMedia XML dump files, it extracts the latest revision for each article valid at the end of each month (namely at 11:59:59.99 pm on January 31, February 28, March 31, so on.). It also filters out minor edits and redirects.
How To Build
In Eclipse, configure a run
- Right-click on mwdumper in Package Explorer, select Run As.. --> Run Configurations.
- On the left pane, right click on Java Application --> New
- Enter mwdumper in the Name field on the right pane.
- Select mwdumper in the Project field.
- Select org.mediawiki.dumper.LREMonthlyDumper20012007 in the Main class field
- Click Apply
- Click Close
In Eclipse, export mwdumper as a runnable JAR.
- Right-click on 'mwdumper' in Package Explorer, select 'export'.
- Select 'Runnable JAR file' and click 'Next'.
- Select 'mwdumper- mwdumper' as Launch configuration.
- Put <your directory>/mwdumper.jar in Export destination
- Select 'Package required libraries into generated JAR'
- Click 'Finish'
How To Run
In a shell terminal (or a command line prompt in Windows), change directory to where lremonthlydumper.jar is located (<your directory> above).
mkdir extracted-monthly java -jar mwdumper.jar <WikiDump.lst>
Input File
<WikiDump.lst>: A file that lists the locations of WikiMedia XML dump files Example contents in a list file:
/fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-00 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-01 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-02 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-03 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-04 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-05 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-06 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-07 /fs/webarc4/data/wikipedia/enwiki-20080103-pages-meta-history-08
Output Files
Under the 'extracted-monthly' directory, WikiMedia XML dump files (one for each month) are generated. The file month-<k>.xml corresponds to the extracted wiki articles for the kth month.
Notes
The range of months is hard-coded as January 2001 ~ December 2007 in org.mediawiki.dumper.LREMonthlyDumper20012007, and so is the output directory. Change these as necessary, or even better, parameterize them.
Source Codes
svn co http://narasvn.umiacs.umd.edu/repository/src/webarc/mwdumper