WarcManager: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 51: | Line 51: | ||
The main page offers two ways to search for collections. | The main page offers two ways to search for collections. | ||
1. If you know the page you want to view, start typing the full | 1. If you know the page you want to view, start typing the full URL into the search box starting with 'http'. You will see a drop-down box containing URL's that match what you have typed. Clicking on any of them will load details for the URL. | ||
2. <nowiki>Type a search string and click 'Search'. | 2. <nowiki>Type a search string and click 'Search'. Optionally you can add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. Some common searches are listed below: | ||
* <nowiki>*.html</nowiki> - search for all files ending in '.html' | |||
* <nowiki>http://www.somesite.com*</nowiki> - View all URL's for www.somesite.com | |||
* <nowiki>http://www.somesite.com/*.html</nowiki> - view all '.html' files on somesite.com. |
Revision as of 14:42, 27 April 2011
Overview
The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.
- Source Code Repository: https://subversion.umiacs.umd.edu/warc-utils/
- Nightly builds: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/
Installation
The installation consists of a few simple steps.
Before you begin, you will need
1. Create database and setup permissions.
mysql> create database webarc; mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD'; mysql> use webarc; mysql> source schema.sql;
2. Install tomcat and JDBC driver
$ tar -xf apache-tomcat-6.0.32.tar.gz $ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib
3. Install and configure the Warc Manager
$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war $ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost $ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml
- edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
<Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true" type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc" validationQuery="SELECT 1"/>
4. Start Tomcat
$ apache-tomcat-6.0.32/bin/startup.sh
Indexing Web Content
The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.
1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.
2. Enter the following information:
- Server Directory - The directory on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
- Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.
3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.
Searching
The main page offers two ways to search for collections.
1. If you know the page you want to view, start typing the full URL into the search box starting with 'http'. You will see a drop-down box containing URL's that match what you have typed. Clicking on any of them will load details for the URL.
2. Type a search string and click 'Search'. Optionally you can add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. Some common searches are listed below: * <nowiki>*.html - search for all files ending in '.html'
- http://www.somesite.com* - View all URL's for www.somesite.com
- http://www.somesite.com/*.html - view all '.html' files on somesite.com.