WarcManager: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
(8 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
* Source Code Repository: https:// | * Source Code Repository: https://gitlab.umiacs.umd.edu/adapt/warc-utils | ||
* Nightly builds: http:// | * <strike>Nightly builds: http://adaptci01.umiacs.umd.edu:8080/job/WARC%20Manager%202/</strike> | ||
=Searching= | |||
The warc manager offers two ways to search collections. | |||
[[Image:searchpage.png]] | |||
1. If you know the page you want to view, start typing the full URL into the search box starting with 'http'. You will see a drop-down box containing URL's that match what you have typed. Clicking on any of them will load details for the URL. | |||
2. <nowiki>Type a search string and click 'Search'. Optionally you can add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. Some common searches are listed below</nowiki>: | |||
* <nowiki>*.html</nowiki> - search for all files ending in '.html' | |||
* <nowiki>http://www.somesite.com*</nowiki> - View all URL's for www.somesite.com | |||
* <nowiki>http://www.somesite.com/*.html</nowiki> - view all '.html' files on somesite.com. | |||
==Search Results== | |||
[[Image:searchresults.png|thumb|Search Results]] | |||
The search results show all matching URL's along with some statistics about that url. Double clicking on any row will load details for any url. | |||
* First Crawl - The earliest crawl date for the url. | |||
* Last Crawl - The most recent crawl date for the url. | |||
* Harvested - Lists the total number of times a page appears in the archive. This includes duplicates and revisits | |||
* Unique - Shows how many unique copies of the page exist in the archive. | |||
==URL Details== | |||
[[Image:urldetails.png|thumb|URL Details]] | |||
The url details page shows detailed information for any URL in the collection. If there are multiple copies or different versions of a URL they will be listed in the left-hand column. Clicking on any entry will load details for that URL. | |||
If you are accessing a URL that contains a html page (As opposed to an image) the warc manager will attempt to extract all links contained in that page and populate the table on the lower right. links that appear in the archive are shown in black, missing links are red. You can double-click on any page in the table to view details for that url. If the URL is not an html file (ie, image) the link table will remain empty. | |||
Clicking on the 'Download File' link will allow you to download the selected page from the archive. | |||
=Installation= | =Installation= | ||
Line 10: | Line 41: | ||
The installation consists of a few simple steps. | The installation consists of a few simple steps. | ||
Before you begin, you will need | Before you begin, you will need to download: | ||
* Apache Tomcat: http://tomcat.apache.org | |||
* MySQL jdbc connector: http://dev.mysql.com/downloads/connector/j/ | |||
* context.xml, schema.sql, and the warc webapp from: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/ | |||
1. Create database and setup permissions. | 1. Create database and setup permissions. | ||
Line 30: | Line 64: | ||
type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc" | type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc" | ||
validationQuery="SELECT 1"/> | validationQuery="SELECT 1"/> | ||
Add the following line after the resource line. | |||
<Parameter name="ingest.url" value="http://localhost:8080/warc/rest/ingest"/> | |||
4. Start Tomcat | 4. Start Tomcat | ||
$ apache-tomcat-6.0.32/bin/startup.sh | $ apache-tomcat-6.0.32/bin/startup.sh | ||
Line 46: | Line 84: | ||
3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload. | 3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload. | ||
Latest revision as of 19:07, 29 September 2016
Overview
The Warc Manager is a tool to help archives quickly browse, search, and analyze archives of web crawl data. The manager is lightweight database web application which indexes and provides a nice browsing interface to a collection of warc data.
- Source Code Repository: https://gitlab.umiacs.umd.edu/adapt/warc-utils
Nightly builds: http://adaptci01.umiacs.umd.edu:8080/job/WARC%20Manager%202/
Searching
The warc manager offers two ways to search collections.
1. If you know the page you want to view, start typing the full URL into the search box starting with 'http'. You will see a drop-down box containing URL's that match what you have typed. Clicking on any of them will load details for the URL.
2. Type a search string and click 'Search'. Optionally you can add wildcards to your search. For example, entering '*.html' will search for all URL's ending in .html. Some common searches are listed below:
- *.html - search for all files ending in '.html'
- http://www.somesite.com* - View all URL's for www.somesite.com
- http://www.somesite.com/*.html - view all '.html' files on somesite.com.
Search Results
The search results show all matching URL's along with some statistics about that url. Double clicking on any row will load details for any url.
- First Crawl - The earliest crawl date for the url.
- Last Crawl - The most recent crawl date for the url.
- Harvested - Lists the total number of times a page appears in the archive. This includes duplicates and revisits
- Unique - Shows how many unique copies of the page exist in the archive.
URL Details
The url details page shows detailed information for any URL in the collection. If there are multiple copies or different versions of a URL they will be listed in the left-hand column. Clicking on any entry will load details for that URL.
If you are accessing a URL that contains a html page (As opposed to an image) the warc manager will attempt to extract all links contained in that page and populate the table on the lower right. links that appear in the archive are shown in black, missing links are red. You can double-click on any page in the table to view details for that url. If the URL is not an html file (ie, image) the link table will remain empty.
Clicking on the 'Download File' link will allow you to download the selected page from the archive.
Installation
The installation consists of a few simple steps.
Before you begin, you will need to download:
- Apache Tomcat: http://tomcat.apache.org
- MySQL jdbc connector: http://dev.mysql.com/downloads/connector/j/
- context.xml, schema.sql, and the warc webapp from: http://adaptvm01.umiacs.umd.edu:8080/jenkins/job/Warc%20Manager%202/
1. Create database and setup permissions.
mysql> create database webarc; mysql> grant all on webarc.* to webarc@localhost identified by 'PASSWORD'; mysql> use webarc; mysql> source schema.sql;
2. Install tomcat and JDBC driver
$ tar -xf apache-tomcat-6.0.32.tar.gz $ cp mysql-connector-java-5.0.7-bin.jar apache-tomcat-6.0.32/lib
3. Install and configure the Warc Manager
$ cp warc-webapp-1.0-SNAPSHOT.war apache-tomcat-6.0.32/webapps/warc.war $ mkdir -p apache-tomcat-6.0.32/conf/Catalina/localhost $ cp context.xml apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml
- edit apache-tomcat-6.0.32/conf/Catalina/localhost/warc.xml and make sure the password under the resource line is the same as you specified above:
<Resource auth="Container" driverClassName="com.mysql.jdbc.Driver" maxActive="20" maxIdle="10" maxWait="-1" name="jdbc/warcdb" password="PASSWORD" testOnBorrow="true" type="javax.sql.DataSource" url="jdbc:mysql://localhost/webarc" username="webarc" validationQuery="SELECT 1"/>
Add the following line after the resource line.
<Parameter name="ingest.url" value="http://localhost:8080/warc/rest/ingest"/>
4. Start Tomcat
$ apache-tomcat-6.0.32/bin/startup.sh
Indexing Web Content
The Warc Manager is able to index warc data stored on the same server that the manager is running on. To index warc files, you will need to know the directory on the server where they reside.
1. From the main screen, click 'Ingest Files'. You will see the ingest file screen.
2. Enter the following information:
- Server Directory - The directory on the warc manager where warc files are stored. The scan will recursively search for any files ending in 'warc.gz'
- Select Collection / New Collection: If you want to load these files into an existing collection, select it from the drop down box. Otherwise, type in the name of a new collection or label for these warc files.
3. Click 'Scan Directory' When you are ready to index warc files. You can return to the scan page any time to view the progress of your upload.