Webarc:PISA: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
When archived web pages are stored in containers, an external index is set up and managed that maps a URL and crawl time to the ID of the container and the offset where the corresponding information is archived. Without an external index, a sequential look up through all containers to search for the web information will take an enormous amount of time for any significant size web archive. | When archived web pages are stored in containers, an external index is set up and managed that maps a URL and crawl time to the ID of the container and the offset where the corresponding information is archived. Without an external index, a sequential look up through all containers to search for the web information will take an enormous amount of time for any significant size web archive. | ||
Previous indexing schemes construct a B<sup>+</sup>-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B<sup>+</sup>-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B<sup>+</sup>-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures . Our PISA indexing structure is somewhat simpler and more practical than those reported in the literature, but achieves the same asymptotic performance for our dynamic operations as the best known schemes. | Previous indexing schemes construct a B<sup>+</sup>-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B<sup>+</sup>-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B<sup>+</sup>-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures . | ||
PISA supports the following operations. | |||
* '''(URL, time) Insert''': Given an object defined by a URL and a crawl time, insert the information regarding the object into the indexing structure with a pointer to its container and the offset within the container. | |||
* '''(URL, time) Query''': Given a URL and a time, determine the location of the content of the URL that was valid at the specified time. The location refers to the container ID and the offset within the container. | |||
* '''(URL, time span) Query''': Given a URL and a time span, determine the locations of the contents of the URL that were valid during the specified time span. | |||
* '''(URL, ---) Query''': Given a URL, determine all the locations and crawl times corresponding to the URL. | |||
* '''(----, time span) Query''': Given a time span, determine all the URLs and their corresponding locations in the archive, which were valid at the specified time span. | |||
Our PISA indexing structure is somewhat simpler and more practical than those reported in the literature, but achieves the same asymptotic performance for our dynamic operations as the best known schemes. | |||
{| class="wikitable" border="1" align="center" | {| class="wikitable" border="1" align="center" | ||
|+ PISA Performance | |+ PISA Performance | ||
|- | |- | ||
! Insert | ! (URL, time) Insert | ||
! URL | ! (URL, time) Query | ||
! URL- | ! (URL, time span) Query | ||
! | ! (URL, ---) Query | ||
! (---, time span) Query | |||
! Space | ! Space | ||
|- | |- | ||
Line 17: | Line 29: | ||
| align="center" | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | | align="center" | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | ||
| align="center" | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | | align="center" | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | ||
| align="center" | <math> O(\frac{R}{B) </math> | | align="center" | <math> \textstyle O( ) </math> | ||
| align="center" | <math> O(\frac{R}{B}) </math> | |||
|} | |} | ||
'''B''': block size | |||
'''N''': the number of all entries | |||
'''M''': the number of live entries | |||
'''R''': the number of results | |||
PISA is essentially a tree-like hierarchical structure. However, it differs from the tree structure, since two different parent nodes can point to the same child node. In PISA, we call internal nodes including the root ''index blocks'', while we call leaf nodes ''data blocks''. Also, we call data items in an index block as ''index entries'', and those in a data block ''data entries''. We use the term ''block'' to mean both index block and data block, and ''entry'' to mean both index entry and data entry, whenever the distinction is not necessary. | PISA is essentially a tree-like hierarchical structure. However, it differs from the tree structure, since two different parent nodes can point to the same child node. In PISA, we call internal nodes including the root ''index blocks'', while we call leaf nodes ''data blocks''. Also, we call data items in an index block as ''index entries'', and those in a data block ''data entries''. We use the term ''block'' to mean both index block and data block, and ''entry'' to mean both index entry and data entry, whenever the distinction is not necessary. |
Revision as of 21:12, 24 September 2008
Overview
When archived web pages are stored in containers, an external index is set up and managed that maps a URL and crawl time to the ID of the container and the offset where the corresponding information is archived. Without an external index, a sequential look up through all containers to search for the web information will take an enormous amount of time for any significant size web archive.
Previous indexing schemes construct a B+-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B+-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B+-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures .
PISA supports the following operations.
- (URL, time) Insert: Given an object defined by a URL and a crawl time, insert the information regarding the object into the indexing structure with a pointer to its container and the offset within the container.
- (URL, time) Query: Given a URL and a time, determine the location of the content of the URL that was valid at the specified time. The location refers to the container ID and the offset within the container.
- (URL, time span) Query: Given a URL and a time span, determine the locations of the contents of the URL that were valid during the specified time span.
- (URL, ---) Query: Given a URL, determine all the locations and crawl times corresponding to the URL.
- (----, time span) Query: Given a time span, determine all the URLs and their corresponding locations in the archive, which were valid at the specified time span.
Our PISA indexing structure is somewhat simpler and more practical than those reported in the literature, but achieves the same asymptotic performance for our dynamic operations as the best known schemes.
(URL, time) Insert | (URL, time) Query | (URL, time span) Query | (URL, ---) Query | (---, time span) Query | Space |
---|---|---|---|---|---|
<math> O(log_B M) </math> | <math> O(log_B N) </math> | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | <math> \textstyle O( ) </math> | <math> O(\frac{R}{B}) </math> |
B: block size N: the number of all entries M: the number of live entries R: the number of results
PISA is essentially a tree-like hierarchical structure. However, it differs from the tree structure, since two different parent nodes can point to the same child node. In PISA, we call internal nodes including the root index blocks, while we call leaf nodes data blocks. Also, we call data items in an index block as index entries, and those in a data block data entries. We use the term block to mean both index block and data block, and entry to mean both index entry and data entry, whenever the distinction is not necessary.
Each index block is composed of header entries and index entries. Inside the header entries are the number of all entries in the block, the number of live entries in the block, the pointer to the previous version, and the key with which the upper level block indexes the current block. Each index entry contains an index key, birth time, death time, and a pointer to a block in the child level. We use {key, t_birth, t_death, ptr} to indicate these four fields.
Similarly, a data block includes the same set of header entries, and data entries. Each data entry is composed of four fields, {key, t_birth, t_death, loc}, that are a data key (URL), birth time, death time, and the location where the actual contents with the data key resides in the archive. Note that the location information typically has a container ID (or filename) and an offset.