Personal tools

WarcManager

From Adapt

Revision as of 14:37, 27 April 2011 by Toaster (talk | contribs)
Jump to: navigation, search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Overview

The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.


Installation

The installation consists of a few simple steps.

Before you begin, you will need

1. Create database and setup permissions.

mysql> create database webarc;
mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
mysql> use webarc;
mysql> source schema.sql;

2. Install tomcat and JDBC driver

$ tar -xf apache-tomcat-6.0.32.tar.gz
$ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib

3. Install and configure the Warc Manager

$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
$ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
$ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml 
  • edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
 <Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" 
maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true"
type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc"
validationQuery="SELECT 1"/>

4. Start Tomcat

$ apache-tomcat-6.0.32/bin/startup.sh

Indexing Web Content

The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.

1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.

Ingestfiles.png

2. Enter the following information:

  • Server Directory - The directory on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
  • Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.

3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.

Searching

The main page offers two ways to search for collections.

1. If you know the page you want to view, start typing the full url into the search box starting with 'http'. You will see a drop-down box containing urls that match what you have typed. Clicking on any of them will load details for the URL.

2. Type a search string and click 'Search'. You can optionally add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. You can search for a specific site by entering 'http://www.site.com*' (ie, 'http://www.cnn.com/*')