Personal tools

Webarc:Combining Statistics in Multiple Time Windows

From Adapt

Revision as of 22:34, 20 November 2009 by Scsong (talk | contribs)
Jump to: navigation, search

We maintain an index for each time window. In each index, we store four term-dependent statistics parameters and four term-independent (index-wide) statistics parameters as follows:

  • Term-dependent
    • <math>df(t, tw_k)</math>: the number of all documents containing term t within time window k.
    • <math>df(t, tw_k)'</math>: the number of fresh documents containing term t within time window k.
    • <math>tf(t, tw_k)</math>: the number of occurrences of term t in all documents within time window k.
    • <math>tf(t, tw_k)'</math>: the number of occurrences of term t in fresh documents within time window k.
  • Term-independent (index-wide)
    • <math>df(tw_k)</math>: the number of all documents within time window k.
    • <math>df(tw_k)'</math>: the number of fresh documents within time window k.
    • <math>tf(tw_k)</math>: the number of all terms in all documents within time window k.
    • <math>tf(tw_k)'</math>: the number of all terms in fresh documents within time window k.

By fresh documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

We combine statistics for time windows i~j as follows:

  • <math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k')</math>
  • <math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k')</math>
  • <math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k')</math>
  • <math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k')</math>