Personal tools

WarcManager: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
Line 9: Line 9:


The installation consists of a few simple steps.  
The installation consists of a few simple steps.  
Before you begin, you will need


1. Create database and setup permissions.
1. Create database and setup permissions.
* mysql> create database webarc;
mysql> create database webarc;
* mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
* mysql> use webarc;
mysql> use webarc;
* mysql> source schema.sql;
mysql> source schema.sql;
2. Install tomcat and JDBC driver
2. Install tomcat and JDBC driver
* $ tar -xf apache-tomcat-6.0.32.tar.gz
$ tar -xf apache-tomcat-6.0.32.tar.gz
* $ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib
$ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib
3. Install and configure the Warc Manager
3. Install and configure the Warc Manager
* $ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
* $ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
$ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
* $ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml  
$ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml  
* edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
* edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:


   <Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20"  
   <Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20"  
Line 29: Line 30:
  type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc"
  type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc"
  validationQuery="SELECT 1"/>
  validationQuery="SELECT 1"/>
==Detailed Installation==
4. Start Tomcat
$ apache-tomcat-6.0.32/bin/startup.sh


=Indexing Web Content=
=Indexing Web Content=
The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.
1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.
2. Enter the following information:
* Server Directory - The directory  on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
* Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.
3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.


=Searching=
=Searching=

Revision as of 14:30, 27 April 2011

Overview

The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.


Installation

The installation consists of a few simple steps.

Before you begin, you will need

1. Create database and setup permissions.

mysql> create database webarc;
mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD';
mysql> use webarc;
mysql> source schema.sql;

2. Install tomcat and JDBC driver

$ tar -xf apache-tomcat-6.0.32.tar.gz
$ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib

3. Install and configure the Warc Manager

$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war
$ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost
$ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml 
  • edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
 <Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" 
maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true"
type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc"
validationQuery="SELECT 1"/>

4. Start Tomcat

$ apache-tomcat-6.0.32/bin/startup.sh

Indexing Web Content

The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.

1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.

2. Enter the following information:

  • Server Directory - The directory on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
  • Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.

3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.

Searching