Webarc:PISA
From Adapt
Overview
When archived web pages are stored in containers, an external index is set up and managed that maps a URL and crawl time to the ID of the container and the offset where the corresponding information is archived. Without an external index, a sequential look up through all containers to search for the web information will take an enormous amount of time for any significant size web archive.
Previous indexing schemes construct a B+-Tree in which each key entry consists of the concatenation of a URL and crawl time. Based on the concatenated key, a search through the B+-Tree leads to a container ID and an offset where the corresponding content has been stored. Clearly the B+-Tree will have separate entries for the same URL with different crawl times, and the corresponding web content will be stored separately for every crawl even though the content may not change over the past several crawls. To avoid storing any duplicate web content and yet maintain the ability to recover the contents corresponding to a URL and a date, we use techniques that are similar to those introduced for persistent or multi-version data structures . Our PISA indexing structure is somewhat simpler and more practical than those reported in the literature, but achieves the same asymptotic performance for our dynamic operations as the best known schemes.
Insert | URL-Time Query | URL-Timespan Query | Timeslice Query | Space |
---|---|---|---|---|
<math> O(log_B M) </math> | <math> O(log_B N) </math> | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | <math> \textstyle O((log_B N)+\frac{R}{B}) </math> | <math> O(\frac{R}{B) </math> |
PISA is essentially a tree-like hierarchical structure. However, it differs from the tree structure, since two different parent nodes can point to the same child node. In PISA, we call internal nodes including the root index blocks, while we call leaf nodes data blocks. Also, we call data items in an index block as index entries, and those in a data block data entries. We use the term block to mean both index block and data block, and entry to mean both index entry and data entry, whenever the distinction is not necessary.
Each index block is composed of header entries and index entries. Inside the header entries are the number of all entries in the block, the number of live entries in the block, the pointer to the previous version, and the key with which the upper level block indexes the current block. Each index entry contains an index key, birth time, death time, and a pointer to a block in the child level. We use {key, t_birth, t_death, ptr} to indicate these four fields.
Similarly, a data block includes the same set of header entries, and data entries. Each data entry is composed of four fields, {key, t_birth, t_death, loc}, that are a data key (URL), birth time, death time, and the location where the actual contents with the data key resides in the archive. Note that the location information typically has a container ID (or filename) and an offset.