Revision as of 13:18, 21 June 2011

Overview

The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.

Source Code Repository: https://subversion.umiacs.umd.edu/warc-utils/
Nightly builds: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/

Searching

The warc manager offers two ways to search collections.

1. If you know the page you want to view, start typing the full URL into the search box starting with 'http'. You will see a drop-down box containing URL's that match what you have typed. Clicking on any of them will load details for the URL.

2. Type a search string and click 'Search'. Optionally you can add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. Some common searches are listed below:

*.html - search for all files ending in '.html'
http://www.somesite.com* - View all URL's for www.somesite.com
http://www.somesite.com/*.html - view all '.html' files on somesite.com.

Search Results

The search results show all matching URL's along with some statistics about that url. Double clicking on any row will load details for any url.

First Crawl - The earliest crawl date for the url.
Last Crawl - The most recent crawl date for the url.
Harvested - Lists the total number of times a page appears in the archive. This includes duplicates and revisits
Unique - Shows how many unique copies of the page exist in the archive.

URL Details

The url details page shows detailed information for any URL in the collection. If there are multiple copies or different versions of a URL they will be listed in the left-hand column. Clicking on any entry will load details for that URL.

If you are accessing a URL that contains a html page (As opposed to an image) the warc manager will attempt to extract all links contained in that page and populate the table on the lower right. links that appear in the archive are shown in black, missing links are red. You can double-click on any page in the table to view details for that url. If the URL is not an html file (ie, image) the link table will remain empty.

Clicking on the 'Download File' link will allow you to download the selected page from the archive.

Installation

The installation consists of a few simple steps.

Before you begin, you will need to download:

Apache Tomcat: http://tomcat.apache.org
MySQL jdbc connector: http://dev.mysql.com/downloads/connector/j/
context.xml, schema.sql, and the warc webapp from: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/

1. Create database and setup permissions.

mysql> create database webarc;
mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
mysql> use webarc;
mysql> source schema.sql;

2. Install tomcat and JDBC driver

$ tar -xf apache-tomcat-6.0.32.tar.gz
$ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib

3. Install and configure the Warc Manager

$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
$ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
$ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml

edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:

 <Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" 
maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true"
type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc"
validationQuery="SELECT 1"/>

Add the following line after the resource line.

<Parameter name="ingest.url" value="http://localhost:8080/warc/rest/ingest"/>

4. Start Tomcat

$ apache-tomcat-6.0.32/bin/startup.sh

Indexing Web Content

The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.

1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.

2. Enter the following information:

Server Directory - The directory on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.

3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.

Personal tools

WarcManager: Difference between revisions - Adapt

Search

General

Projects

Research

Tools

WarcManager: Difference between revisions

From Adapt