Personal tools

Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
 
Line 17: Line 17:
We combine statistics for time windows ''i''~''j'' as follows:
We combine statistics for time windows ''i''~''j'' as follows:


*<math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k')</math>
*<math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k)'</math>
*<math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k')</math>
*<math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k)'</math>
*<math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k')</math>
*<math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k)'</math>
*<math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k')</math>
*<math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k)'</math>

Latest revision as of 22:35, 20 November 2009

We maintain an index for each time window. In each index, we store four term-dependent statistics parameters and four term-independent (index-wide) statistics parameters as follows:

  • Term-dependent
    • <math>df(t, tw_k)</math>: the number of all documents containing term t within time window k.
    • <math>df(t, tw_k)'</math>: the number of fresh documents containing term t within time window k.
    • <math>tf(t, tw_k)</math>: the number of occurrences of term t in all documents within time window k.
    • <math>tf(t, tw_k)'</math>: the number of occurrences of term t in fresh documents within time window k.
  • Term-independent (index-wide)
    • <math>df(tw_k)</math>: the number of all documents within time window k.
    • <math>df(tw_k)'</math>: the number of fresh documents within time window k.
    • <math>tf(tw_k)</math>: the number of all terms in all documents within time window k.
    • <math>tf(tw_k)'</math>: the number of all terms in fresh documents within time window k.

By fresh documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

We combine statistics for time windows i~j as follows:

  • <math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k)'</math>
  • <math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k)'</math>
  • <math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k)'</math>
  • <math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k)'</math>