Personal tools

Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
No edit summary
Line 3: Line 3:
* Term-dependent
* Term-dependent
** <font size="4"><math>df(t, tw_k)</math></font>: the number of all documents  containing term t within time window k.
** <font size="4"><math>df(t, tw_k)</math></font>: the number of all documents  containing term t within time window k.
** <font size="4"><math>df(t, tw_k')</math></font>: the number of fresh documents  containing term t within time window k.
** <font size="4"><math>df(t, tw_k)'</math></font>: the number of fresh documents  containing term t within time window k.
** <font size="4"><math>tf(t, tw_k)</math></font>: the number of occurrences of term t in all documents within time window k.
** <font size="4"><math>tf(t, tw_k)</math></font>: the number of occurrences of term t in all documents within time window k.
** <font size="4"><math>tf(t, tw_k')</math></font>: the number of occurrences of term t in fresh documents within time window k.
** <font size="4"><math>tf(t, tw_k)'</math></font>: the number of occurrences of term t in fresh documents within time window k.


* Term-independent (index-wide)
* Term-independent (index-wide)
** <font size="4"><math>df(tw_k)</math></font>: the number of all documents  within time window k.
** <font size="4"><math>df(tw_k)</math></font>: the number of all documents  within time window k.
** <font size="4"><math>df(tw_k')</math></font>: the number of fresh documents  within time window k.
** <font size="4"><math>df(tw_k)'</math></font>: the number of fresh documents  within time window k.
** <font size="4"><math>tf(tw_k)</math></font>: the number of all terms in all documents within time window k.
** <font size="4"><math>tf(tw_k)</math></font>: the number of all terms in all documents within time window k.
** <font size="4"><math>tf(tw_k')</math></font>: the number of all terms in fresh documents within time window k.
** <font size="4"><math>tf(tw_k)'</math></font>: the number of all terms in fresh documents within time window k.


By ''fresh'' documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.
By ''fresh'' documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

Revision as of 22:33, 20 November 2009

We maintain an index for each time window. In each index, we store four term-dependent statistics parameters and four term-independent (index-wide) statistics parameters as follows:

  • Term-dependent
    • <math>df(t, tw_k)</math>: the number of all documents containing term t within time window k.
    • <math>df(t, tw_k)'</math>: the number of fresh documents containing term t within time window k.
    • <math>tf(t, tw_k)</math>: the number of occurrences of term t in all documents within time window k.
    • <math>tf(t, tw_k)'</math>: the number of occurrences of term t in fresh documents within time window k.
  • Term-independent (index-wide)
    • <math>df(tw_k)</math>: the number of all documents within time window k.
    • <math>df(tw_k)'</math>: the number of fresh documents within time window k.
    • <math>tf(tw_k)</math>: the number of all terms in all documents within time window k.
    • <math>tf(tw_k)'</math>: the number of all terms in fresh documents within time window k.

By fresh documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

We combine statistics for time windows i~j as follows:

  • <math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k')</math>
  • <math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k')</math>
  • <math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k')</math>
  • <math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k')</math>