Webarc:PISA: Difference between revisions

Revision as of 22:09, 24 September 2008

Overview

When archived web pages are stored in containers, an external index is set up and managed that maps a URL and crawl time to the ID of the container and the offset where the corresponding information is archived. Without an external index, a sequential look up through all containers to search for the web information will take an enormous amount of time for any significant size web archive.

Previous indexing schemes construct a B⁺-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B⁺-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B⁺-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures .

The followings are some of the notable operations that PISA supports.

(URL, time) Insert: Given an object defined by a URL and a crawl time, insert the information regarding the object into the indexing structure with a pointer to its container and the offset within the container.
(URL, time) Query: Given a URL and a time, determine the location of the content of the URL that was valid at the specified time. The location refers to the container ID and the offset within the container. (URL, now) Query is a special case of this type of query, which determines the location of the most-recent content of the URL.
(URL, time span) Query: Given a URL and a time span, determine the locations of the contents of the URL that were valid during the specified time span.
(URL, ---) Query: Given a URL, determine all the locations and crawl times corresponding to the URL.
(----, time span) Query: Given a time span, determine all the URLs and their corresponding locations in the archive, which were valid at the specified time span.

Our PISA indexing structure is somewhat simpler and more practical than those reported in the literature, but achieves the same asymptotic performance for our dynamic operations as the best known schemes.

PISA Performance
(URL, time) Insert	(URL, now) Query	(URL, time) Query	(URL, time span) Query	(URL, ---) Query	(---, time span) Query	Space
<math> \textstyle O(log_B M) </math>	<math> \textstyle O(log_B M) </math>	<math> \textstyle O(log_B N) </math>	<math> \textstyle O((log_B N)+\frac{R}{B}) </math>	<math> \textstyle O((log_B N)+\frac{R}{B}) </math>	<math> \textstyle O( ) </math>	<math> \textstyle O(\frac{R}{B}) </math>

B: block size N: the number of all entries M: the number of live entries R: the number of results

Duplicate Detection

Besides PISA's main role as a location index, another place that PISA can be useful is in duplicate detection. Duplicate detection can be done by, in addition to the location information, storing hash of the content in leaf nodes of PISA, and comparing the hash of the current content with the one stored in PISA for the most recent content. Since queries to the most recent version takes only <math>O(log_B M)</math>, they can be performed very quickly, being independent of the total data size which keeps grows over time.

We have performed duplicate detection experiments on two different datasets: New Websites and Governors Websites. News Websites is from a Stanford’s event-based collection pertaining to the 2006 US House of Representatives Election, which was originally crawled from October 25, 2006 to November 11, 2006. Since news websites tend to be updated very frequently (several times a day), an archive collection obtained by a series of daily crawls is less likely to contain duplicate web pages. Governors Websites is from a Stanford’s event-based collection pertaining to Hurricane Katrina, originally crawled from September 3, 2005 to October 29 over eight State Governors' websites (Alabama, Mississippi, North Dakota, New Mexico, Oklahoma, Pennsylvania, Texas and Louisiana). Since updates in government websites are usually slow-paced, this dataset is representative of an archiving scenario over slow changing websites, where duplicates are expected to occur more frequently.

	Days of Crawls	Crawl Dates	Avg # of Pages / Crawl	Total # of Pages	Avg Size / Crawl	Total Size
News Websites	8	10/25/06 ~ 11/02/06	298,269	2,335,600	9.32GB	73.38GB
Governors Websites	28	09/08/05 ~ 10/07/05	10,055	273,717	228MB	5.98GB

The following two graphs show the percentage of the number of duplicate web pages detected in News Websites and Governors Websites, respectively.

The following two graphs show the percentage of the byte size of duplicate web sites.

The time overhead to process the two datasets is in the following table.

	Total	Per Page (Average)
News Websites	13m 52s	0.356ms	11.076ms
Governors Websites	27s	0.098ms	4.444ms

@@ Line 4: / Line 4: @@
 Previous indexing schemes construct a B<sup>+</sup>-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B<sup>+</sup>-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B<sup>+</sup>-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures .
-PISA supports the following operations.
+The followings are some of the notable operations that PISA supports.
 * '''(URL, time) Insert''': Given an object defined by a URL and a crawl time, insert the information regarding the object into the indexing structure with a pointer to its container and the offset within the container.
-* '''(URL, time) Query''': Given a URL and a time, determine the location of the content of the URL that was valid at the specified time. The location refers to the container ID and the offset within the container.
+* '''(URL, time) Query''': Given a URL and a time, determine the location of the content of the URL that was valid at the specified time. The location refers to the container ID and the offset within the container. (URL, ''now'') Query is a special case of this type of query, which determines the location of the most-recent content of the URL.
 * '''(URL, time span) Query''': Given a URL and a time span, determine the locations of the contents of the URL that were valid during the specified time span.
 * '''(URL, ---) Query''': Given a URL, determine all the locations and crawl times corresponding to the URL.
@@ Line 19: / Line 19: @@
 |-
 ! (URL, time) Insert
+! (URL, ''now'') Query
 ! (URL, time) Query
 ! (URL, time span) Query
@@ Line 25: / Line 26: @@
 ! Space
 |-
+| align="center" | <math> \textstyle O(log_B M) </math>
 | align="center" | <math> \textstyle O(log_B M) </math>
 | align="center" | <math> \textstyle O(log_B N) </math>
@@ Line 38: / Line 40: @@
 '''R''': the number of results
+== Duplicate Detection ==
+Besides PISA's main role as a location index, another place that PISA can be useful is in duplicate detection. Duplicate detection can be done by, in addition to the location information, storing hash of the content in leaf nodes of PISA, and comparing the hash of the current content with the one stored in PISA for the most recent content. Since queries to the most recent version takes only <math>O(log_B M)</math>, they can be performed very quickly, being independent of the total data size which keeps grows over time.
+We have performed duplicate detection experiments on two different datasets: New Websites and Governors Websites. News Websites is from a Stanford’s event-based collection pertaining to the 2006 US House of Representatives Election, which was originally crawled from October 25, 2006 to November 11, 2006. Since news websites tend to be updated very frequently (several times a day), an archive collection obtained by a series of daily crawls is less likely to contain duplicate web pages. Governors Websites is from a Stanford’s event-based collection pertaining to Hurricane Katrina, originally crawled from September 3, 2005 to October 29 over eight State Governors' websites (Alabama, Mississippi, North Dakota, New Mexico, Oklahoma, Pennsylvania, Texas and Louisiana). Since updates in government websites are usually slow-paced, this dataset is representative of an archiving scenario over slow changing websites, where duplicates are expected to occur more frequently.
+{| class="wikitable" border="1" align="center"
+|-
+!
+! Days of Crawls
+! Crawl Dates
+! Avg # of Pages / Crawl
+! Total # of Pages
+! Avg Size / Crawl
+! Total Size
+|-
+! align="left" | News Websites
+| align="right" | 8
+| align="right" | 10/25/06 ~ 11/02/06
+| align="right" | 298,269
+| align="right" | 2,335,600
+| align="right" | 9.32GB
+| align="right" | 73.38GB
+|-
+! align="left" | Governors Websites
+| align="right" | 28
+| align="right" | 09/08/05 ~ 10/07/05
+| align="right" | 10,055
+| align="right" | 273,717
+| align="right" | 228MB
+| align="right" | 5.98GB
+|}
+The following two graphs show the percentage of the number of duplicate web pages detected in News Websites and Governors Websites, respectively.
-PISA is essentially a tree-like hierarchical structure. However, it differs from the tree structure, since two different parent nodes can point to the same child node. In PISA, we call internal nodes including the root ''index blocks'', while we call leaf nodes ''data blocks''. Also, we call data items in an index block as ''index entries'', and those in a data block ''data entries''. We use the term ''block'' to mean both index block and data block, and ''entry'' to mean both index entry and data entry, whenever the distinction is not necessary.
+[[Image:dup_nw_pages.png|320px]][[Image:dup_gw_pages.png|320px]]
-Each index block is composed of header entries and index entries. Inside the header entries are the number of all entries in the block, the number of live entries in the block, the pointer to the previous version, and the key with which the upper level block indexes the current block. Each index entry contains an index key, birth time, death time, and a pointer to a block in the child level. We use {key, t_birth, t_death, ptr} to indicate these four fields.
+The following two graphs show the percentage of the byte size of duplicate web sites.
-Similarly, a data block includes the same set of header entries, and data entries. Each data entry is composed of four fields, {key, t_birth, t_death, loc}, that are a data key (URL), birth time, death time, and the location where the actual contents with the data key resides in the archive. Note that the location information typically has a container ID (or filename) and an offset.
+[[Image:dup_nw_size.png|320px]][[Image:dup_gw_size.png|320px]]
+The time overhead to process the two datasets is in the following table.
+{| class="wikitable" border="1" align="center"
+|-
+!
+! Total
+! Per Page (Average)
+|-
+! align="left" | News Websites
+| align="right" | 13m 52s
+| align="right" | 0.356ms
+| align="right" | 11.076ms
+|-
+! align="left" | Governors Websites
+| align="right" | 27s
+| align="right" | 0.098ms
+| align="right" | 4.444ms
+|}

Personal tools

Webarc:PISA: Difference between revisions - Adapt

Search

General

Projects

Research

Tools

Webarc:PISA: Difference between revisions

From Adapt

Revision as of 22:09, 24 September 2008

Overview

Duplicate Detection

	This page was last edited on 24 September 2008, at 22:09. Privacy policy About Adapt Disclaimers
	Mozilla Cavendish skin modified by DaSch for the Web Community Wiki GitHub project page – Report a bug – Skin version: 2.6.0