WarcManager
From Adapt
Overview
The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.
- Source Code Repository: https://subversion.umiacs.umd.edu/warc-utils/
- Nightly builds: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/
Installation
The installation consists of a few simple steps.
1. Create database and setup permissions.
- mysql> create database webarc;
- mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
- mysql> use webarc;
- mysql> source schema.sql;
2. Install tomcat and JDBC driver
- $ tar -xf apache-tomcat-6.0.32.tar.gz
- $ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib
3. Install and configure the Warc Manager
- $ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
- $ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
- $ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml
- edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
<Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true" type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc" validationQuery="SELECT 1"/>