Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions

Revision as of 22:34, 20 November 2009

We maintain an index for each time window. In each index, we store four term-dependent statistics parameters and four term-independent (index-wide) statistics parameters as follows:

Term-dependent
- <math>df(t, tw_k)</math>: the number of all documents containing term t within time window k.
- <math>df(t, tw_k)'</math>: the number of fresh documents containing term t within time window k.
- <math>tf(t, tw_k)</math>: the number of occurrences of term t in all documents within time window k.
- <math>tf(t, tw_k)'</math>: the number of occurrences of term t in fresh documents within time window k.

Term-independent (index-wide)
- <math>df(tw_k)</math>: the number of all documents within time window k.
- <math>df(tw_k)'</math>: the number of fresh documents within time window k.
- <math>tf(tw_k)</math>: the number of all terms in all documents within time window k.
- <math>tf(tw_k)'</math>: the number of all terms in fresh documents within time window k.

By fresh documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

We combine statistics for time windows i~j as follows:

<math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k')</math>
<math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k')</math>
<math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k')</math>
<math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k')</math>

@@ Line 3: / Line 3: @@
 * Term-dependent
 ** <font size="4"><math>df(t, tw_k)</math></font>: the number of all documents  containing term t within time window k.
-** <font size="4"><math>df(t, tw_k)'</math></font>: the number of fresh documents  containing term t within time window k.
+** <font size="4"><math>df(t, tw_k)'</math></font>: the number of '''fresh''' documents  containing term t within time window k.
 ** <font size="4"><math>tf(t, tw_k)</math></font>: the number of occurrences of term t in all documents within time window k.
-** <font size="4"><math>tf(t, tw_k)'</math></font>: the number of occurrences of term t in fresh documents within time window k.
+** <font size="4"><math>tf(t, tw_k)'</math></font>: the number of occurrences of term t in '''fresh''' documents within time window k.
 * Term-independent (index-wide)
 ** <font size="4"><math>df(tw_k)</math></font>: the number of all documents  within time window k.
-** <font size="4"><math>df(tw_k)'</math></font>: the number of fresh documents  within time window k.
+** <font size="4"><math>df(tw_k)'</math></font>: the number of '''fresh''' documents  within time window k.
 ** <font size="4"><math>tf(tw_k)</math></font>: the number of all terms in all documents within time window k.
-** <font size="4"><math>tf(tw_k)'</math></font>: the number of all terms in fresh documents within time window k.
+** <font size="4"><math>tf(tw_k)'</math></font>: the number of all terms in '''fresh''' documents within time window k.
 By ''fresh'' documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.

Personal tools

Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions - Adapt

Search

General

Projects

Research

Tools

Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions

From Adapt

Revision as of 22:34, 20 November 2009

	This page was last edited on 20 November 2009, at 22:34. Privacy policy About Adapt
	Mozilla Cavendish skin modified by DaSch for the Web Community Wiki GitHub project page – Report a bug – Skin version: 3.0.0