Webarc:Combining Statistics in Multiple Time Windows: Difference between revisions
From Adapt
No edit summary |
No edit summary |
||
Line 3: | Line 3: | ||
* Term-dependent | * Term-dependent | ||
** <font size="4"><math>df(t, tw_k)</math></font>: the number of all documents containing term t within time window k. | ** <font size="4"><math>df(t, tw_k)</math></font>: the number of all documents containing term t within time window k. | ||
** <font size="4"><math>df(t, tw_k)'</math></font>: the number of fresh documents containing term t within time window k. | ** <font size="4"><math>df(t, tw_k)'</math></font>: the number of '''fresh''' documents containing term t within time window k. | ||
** <font size="4"><math>tf(t, tw_k)</math></font>: the number of occurrences of term t in all documents within time window k. | ** <font size="4"><math>tf(t, tw_k)</math></font>: the number of occurrences of term t in all documents within time window k. | ||
** <font size="4"><math>tf(t, tw_k)'</math></font>: the number of occurrences of term t in fresh documents within time window k. | ** <font size="4"><math>tf(t, tw_k)'</math></font>: the number of occurrences of term t in '''fresh''' documents within time window k. | ||
* Term-independent (index-wide) | * Term-independent (index-wide) | ||
** <font size="4"><math>df(tw_k)</math></font>: the number of all documents within time window k. | ** <font size="4"><math>df(tw_k)</math></font>: the number of all documents within time window k. | ||
** <font size="4"><math>df(tw_k)'</math></font>: the number of fresh documents within time window k. | ** <font size="4"><math>df(tw_k)'</math></font>: the number of '''fresh''' documents within time window k. | ||
** <font size="4"><math>tf(tw_k)</math></font>: the number of all terms in all documents within time window k. | ** <font size="4"><math>tf(tw_k)</math></font>: the number of all terms in all documents within time window k. | ||
** <font size="4"><math>tf(tw_k)'</math></font>: the number of all terms in fresh documents within time window k. | ** <font size="4"><math>tf(tw_k)'</math></font>: the number of all terms in '''fresh''' documents within time window k. | ||
By ''fresh'' documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows. | By ''fresh'' documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows. |
Revision as of 22:34, 20 November 2009
We maintain an index for each time window. In each index, we store four term-dependent statistics parameters and four term-independent (index-wide) statistics parameters as follows:
- Term-dependent
- <math>df(t, tw_k)</math>: the number of all documents containing term t within time window k.
- <math>df(t, tw_k)'</math>: the number of fresh documents containing term t within time window k.
- <math>tf(t, tw_k)</math>: the number of occurrences of term t in all documents within time window k.
- <math>tf(t, tw_k)'</math>: the number of occurrences of term t in fresh documents within time window k.
- Term-independent (index-wide)
- <math>df(tw_k)</math>: the number of all documents within time window k.
- <math>df(tw_k)'</math>: the number of fresh documents within time window k.
- <math>tf(tw_k)</math>: the number of all terms in all documents within time window k.
- <math>tf(tw_k)'</math>: the number of all terms in fresh documents within time window k.
By fresh documents, we mean the documents that were newly updated or appeared in the given time window. Thus, the fresh documents do not appear any previous time windows.
We combine statistics for time windows i~j as follows:
- <math>df(t, tw_{i\sim j}) = df(t, tw_i) + \sum_{k=i+1}^j df(t, tw_k')</math>
- <math>tf(t, tw_{i\sim j}) = tf(t, tw_i) + \sum_{k=i+1}^j tf(t, tw_k')</math>
- <math>df(tw_{i\sim j}) = df(tw_i) + \sum_{k=i+1}^j df(tw_k')</math>
- <math>tf(tw_{i\sim j}) = tf(tw_i) + \sum_{k=i+1}^j tf(tw_k')</math>