Personal tools

Webarc:Packaging

From Adapt

Revision as of 15:23, 24 September 2008 by Scsong (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
    • Background
      • Web Graph

Web can be considered as a weighted, directed graph where web pages are vertices and hyperlinks are edges in the graph. The size of a page is the weight of the corresponding vertex. File:Webgraph.jpg

      • Web Containers

There are many different ways to archive the web. One of the most popular methods is through use of containers. In this container method, a multitude of web pages are aggregated into a container. An emerging standard for such containers is the WARC format, which evolved from the ARC container format developed by the Internet Archive, currently the world’s largest internet archive. File:Webcontainers.jpg

    • Motivation
      • Conventional Container Packaging

The conventional way to construct containers is rather naive. Most web crawls are based on Breadth First Search or Depth First Search, and throughout the crawl session, first seen pages are put into a container until the container gets full. When the container gets full, a new container is created for the pages to be seen next. The process repeats itself until the crawl session ends.

      • Problem with the Conventional Container Packaging

Although simple and straightforward, one problem with this simple packaging approach appears at access time. That is, tightly related pages can end up in different containers, and more number of containers will be necessary in a random browsing session. File:Conv container packaging.jpg

    • Our Goal

We aim to develop a methodology to minimize the probability that a user jumps often between containers when accessing the web archive. File:Ideal container packaging.jpg

    • Our Approach
      • Graph Analysis Technique

Based on PageRank, we compute EdgeRank for each edge that represents the global probability of the edge to be taken.

EdgeRank <math> ER = \frac{\frac{1-d}{N} + d\sum_{v \in I_u} p_{vu} PR(v)}{outdegree(v)}</math>

      • Graph Partitioning Technique