Choose your preferred view mode

Please select whether you prefer to view the MDPI pages with a view tailored for mobile displays or to view the MDPI
pages in the normal scrollable desktop version. This selection will be stored into your cookies and used automatically
in next visits. You can also change the view style at any point from the main header when using the pages with your
mobile device.

Abstract

:
In an asynchronous data stream, the data items may be out of order with respect to their original timestamps. This paper studies the space complexity required by a data structure to maintain such a data stream so that it can approximate the set of frequent items over a sliding time window with sufficient accuracy. Prior to our work, the best solution is given by Cormode et al. [1], who gave an
O(1∊logWlog(∊BlogW)min{logW,1∊}log|U|)-space data structure that can approximate the frequent items within an ∊ error bound, where W and B are parameters of the sliding window, and U is the set of all possible item names. We gave a more space-efficient data structure that only requires
O(1∊logWlog(∊BlogW)loglogW) space.

1. Introduction

Identifying frequent items in a massive data stream has many applications in data mining and network monitoring, and the problem has been studied extensively [2-5]. Recent interest has been shifted from the statistics of the whole data stream to that of a sliding window of recent data [6-9]. In most applications, the amount of data in a window is gigantic when compared with the amount of memory available in the processing units. It is impossible to store all the data and then find the exact frequent items. Existing research has focused on designing space-efficient data structures to support finding the approximate frequent items. The key concern is how to minimize the space so as to achieve a required level of accuracy.

1.1. Asynchronous Data Stream

Most of the previous work on data streams assume that items in a data stream are synchronous in the sense that the order of their arrivals is the same as the order of their creations. This synchronous model is however not suitable to applications that are distributed in nature. For example, in a sensor network, the sink collects data transmitted from sensors over a large area, and the data transmitted from different sensors would suffer different delay. It is possible that an item created at time t at a certain sensor may arrive at the sink later than an item created after t at another sensor. From the sink's viewpoint, items in the data stream are out of order with respect to their creation times. Yet the statistics to be computed are usually based on the creation times. More specifically, an asynchronous data stream (a.k.a. out-of-order data stream) [1,10,11] can be considered as a sequence (a1, t1), (a2, t2), (a3, t3), …, where ai is the name of a data item chosen from a fixed universe U, and ti is an integer timestamp recording the creation time of this item. Items arriving at the data stream are in arbitrary order regarding their timestamps, and it is possible that more than one data item has the same timestamp.

1.2. Previous Work on Approximating Frequent Items

Consider a data stream and, in particular, those data items whose timestamps fall into the last W time units (W is the size of the sliding window). An item (or precisely, an item name) is said to be a frequent item if its count (i.e., the number of occurrences) exceeds a certain required threshold of the total item count. Arasu and Manku [6] were the first to study approximating frequent items over a sliding window under the synchronous model, in which data items arrive in non-decreasing order of timestamps. The space complexity of their data structure is
O(1∊(log1∊)2log(∊B)), where ∊ is a user-specified error bound and B is the maximum number of items with timestamps falling into the same sliding window. Their work was later improved by Lee and Ting [7] to
O(1∊log(∊B)) space. Recently, Cormode et al. [1] initiated the study of frequent items under the asynchronous model, and gave a solution with space complexity
O(1∊logWlog(∊BlogW)min{logW,1∊}log|U|), where U is the set of possible item names. Later, Cormode et al. [12] gave a hashing-based randomized solution using
O(1∊2log|U|) space. The space complexity is quadratic in
1∊, which is less preferred, but that is a general solution that can solve other problems like finding the sum and quantiles.

The earlier work on asynchronous data stream focused on a relatively simpler problem called ∊-approximate basic counting [10,11]. Cormode et al. [1] improved the space complexity of basic counting to.
O(1∊logWlog(∊BlogW)) Notice that under the synchronous model, the best data structure requires
O(1∊log(∊B)) space [9]. It is believed that there is roughly a gap of logW between the synchronous model to the asynchronous model. Yet, for frequent items, the asynchronous result of Cormode et al. [1] has space complexity way bigger than that of the best synchronous result, which is
O(1∊log(∊B)) [7]. This motivates us to study more space-efficient solutions for approximating frequent items in the asynchronous model.

1.3. Formal Definition of Approximate Frequent Item Set

For any time interval I and any data item a, let fa(I) denote the frequency of item a in interval I, i.e., the number of arrived items named a with timestamps falling into I. Define f*(I) = Σa∈Ufa(I) to be the total number of all arrived items with timestamps within I.

Given a user-specified error bound ∊ and a window size W, we want to maintain a data structure to answer any ∊-approximate frequent item set query for any sub-window (specified at query time), which is in the form (ϕ, W′) where ϕ ∈ [∊, 1] is the required threshold and W′ ≤ W is the sub-window size. Suppose that τcur is the current time. The answer to such a query is a set S of item names satisfying the following two conditions:

(C2) For any item a in S, its frequency in interval I is at least (ϕ − ∊)f*(I),i.e., fa(I) ≥ (ϕ − ∊)f*(I).

The set S is also called an ∊-approximate ϕ-frequent item set. For example, assume ∊ = 1%, then the query (10%, 10, 000) would return all items whose frequencies in the last 10, 000 time units are each at least 10% of the total item count, plus possibly some other items with frequency at least 9% of the total count.

1.4. Our Contribution

This paper gives a more space-efficient data structure for answering any ∊-approximate frequent item set query. Our data structure uses
O(1∊logWlog(∊BlogW)loglogW) words, which is significantly smaller than the one given by Cormode et al. [1] (see Table 1). Furthermore, this space complexity is larger than the best synchronous solution by only a factor of O(logW log logW), which is close to the expected gap of O(logW). Similar to existing data structures for this problem, it takes time linear to the data structure's size to answer an ∊-approximate frequent item set query. Furthermore, it takes
O(log(∊BlogW)(log1∊+loglogW)) time to modify the data structure for a new data item. Occasionally, we might need to clean up some old data items that are no longer significant to the approximation; in the worst case, this takes time linear to the size of the data structure, and thus is no bigger than the query time. As a remark, the solution of Cormode et al. [1] requires
O(log(∊BlogW)logWloglog|U|) time for an update.

In the asynchronous model, if a data item has a delay more than W time units, it can be discarded immediately when it arrives. In many applications, the delay is usually small. This motivates us to extend the asynchronous model to consider data items that have a bounded delay. We say that an asynchronous data stream has tardiness dmax if a data item created at time t must arrive at the stream no later than time t + dmax. If we set dmax = 0, the model becomes the synchronous model. If we allow dmax ≥ W, this is in essence the asynchronous model studied above. We adapt our data structure to take advantage of small tardiness such that when dmax is small, it uses smaller space (see Table 1) and support faster update time (which is
O(log(∊Blogdmax)(log1∊+loglogdmax))) In particular, when dmax = Θ(1), the size and update time of our data structure match those of the best data structure for synchronous data stream.

Remark

This paper is a corrected version of a paper with the same title in WAOA 2009 [13]; in particular, the error bound on the estimates was given incorrectly before and is fixed in this version.

1.5. Technical Digest

To solve the frequent item set problem, we need to estimate the frequency of any item with relative error ∊f*(I) where I = [τcur − W + 1, τcur] is the interval covered by the sliding window. To this end, we first propose a simple data structure for estimating the frequency of a fixed item over the sliding window. Then, we adapt a technique of Misra and Gries [14] to extend our data structure to handle any item. The result is an O(f*(I))/λ)-space data structure that allows us to obtain an estimate for any item with an error bound of about λ logW. Here λ is a design parameter. To ensure λ logW to be no greater than ∊f*(I), we should set λ ≤ ∊f*(I)/logW. Since f*(I) can be as small as
Θ(1∊logW) (the case for smaller f*(I) can be handled by brute-force), we need to be conservative and set λ to some constant. But then the size of the data structure can be Θ(B) because f*(I) can be as large as B. To reduce space, we introduce a multi-resolution approach. Instead of using one single data structure, we maintain a collection of O(logB) copies of our data structure, each uses a distinct, carefully chosen parameter λ so that it could estimate the frequent item set with sufficient accuracy when f*(I) is in a particular range. The resulting data structure uses
O(1∊logWlogB) space.

Unfortunately, a careful analysis of our data structure reveals that in the worst case, it can only guarantee estimates with an error bound of ∊f*(H ∪ I) where H = [τcur − 2W + 1, τcur − W], not the required ∊f*(I). The reason is that the error of its estimates over I depend on the number of updates made during I, and unlike synchronous data stream, this number for asynchronous data stream can be significantly larger than f*(I). For example, at time τcur − W + 1, there may still be many new items (a, u) with timestamps u ∈ H, for which we must update our data structure to get good estimates when the sliding window is at earlier positions. Indeed, the number of updates during I can be as large as f*(H ∪ I), and this gives an error bound of ∊f*(H ∪ I).

To reduce the error bound to ∊f*(I), we introduce a novel algorithm to split the data structure into independent smaller ones at appropriate times. For example, at time τcur − W + 1, we can split our data structure into two smaller ones DH and DI, and we will only update DH for items (a, u) with u ∈ H and update DI for those with u ∈ I. Then, when we need to find an estimate on I at time τcur, we only need to consult DI, and the number of updates made to it is f*(I). In this paper, we develop sophisticated procedures to decide when and how to split the data structure so as to enable us to get good enough estimates when sliding window moves continuously. The resulting data structure has size
O(1∊(logW)2log(∊BlogW)) Then, we further make the data structure adaptive to the input size, allowing us to reduce the space to
O(1∊(loglogW)logWlog(∊BlogW)).

2. Preliminaries

Our data structures for the frequent item set problem depends on data structures for the following two related data stream problems. Let 0 < ∊ < 1 be any real number, and τcur be the current time.

The ∊-approximate basic counting problem asks for data structure that allows us to obtain, for any interval I = [τcur − W′ + 1, τcur] where W′ ≤ W, an estimate f̂*(I) of f*(I) such that |f̂*(I) − f*(I)| ≤ ∊f*(I).

The ∊-approximate counting problem asks for data structure that allows us to obtain, for any item a and any interval I = [τcur − W′ + 1, τcur] where W′ ≤ W, an estimate f̂a(I) of fa(I) such that | f̂a(I) − fa(I)|≤ ∊f*(I).

As mentioned in Section 1, Cormode et al. [1] gave an
O(1∊logWlog(∊BlogW))-space data structure
∊ for solving the ∊-approximate basic counting problem. In this paper, we give an
O(1∊logWlog(∊BlogW)loglogW)-space data structure
∊ for solving the harder ∊-approximate counting problem. The theorem below shows how to use these two data structures to answer ∊-approximate frequent item set query.

Theorem 1

Let ∊0 = ∊/4. Given
∊o and
∊o, we can answer any ∊-approximate frequent item set query. The total space required is
O(1∊logWlog(∊BlogW)loglogW).

Proof

The space requirement is obvious. Consider any ∊-approximate frequent item set query (ϕ, W′) where ∊ ≤ ϕ ≤ 1 and W′ ≤ W. Let I = [τcur − W′ + 1, τcur]. Since ∊o = ∊/4, the estimates given by
∊o satisfy
|f^∗(I)−f∗(I)|≤∊4f∗(I), and for any item a, the estimates given by
∊o satisfy
|f^a(I)−fa(I)|≤∊4f∗(I) To answer the query (ϕ, W′), we return the set

Sϕ={a|f^a(I)≥(ϕ−∊2I)f^∗(I)}

which satisfies the required conditions (C1) and (C2) because

for any item a with fa(I) ≥ ϕf*(I),
f^a(I)≥fa(I)−∊4f∗(I)≥(ϕ−∊4)f∗(I)≥(ϕ−∊4)(11+∊4)f^∗(I)≥(ϕ−∊4)(1−∊4)f^∗(I)≥(ϕ−∊2)f^∗(I), and a ∈ Sϕ; thus (C1) is satisfied, and

for every a ∈ Sϕ, we have
fa(I)≥f^a(I)−∊4f∗(I)≥(ϕ−∊2)f^∗(I)−∊4f∗(I)≥(ϕ−∊2)(1−∊4)f∗(I)−∊4f∗(I)≥(ϕ−∊)f∗(I); thus (C2) is satisfied.

The building block of
∊ is a data structure that counts items over some fixed interval (instead of the sliding window). For any interval I = [ℓI, rI] of size W, Theorem 4 in Section 4 gives a data structure
I,∊ that uses
O(1∊logWlog(∊BlogW)loglogW) space, supports
O(log(∊BlogW)⋅(log1∊+loglogW)) update time, and enables us to obtain, for any item a and any time t ∈ I, an estimate f̂a([t, rI]) of fa([t, rI]) such that

Observe that any item that arrives at or before the current time τcur must have timestamp no greater than τcur; hence fa([iW + 1, (i + 1)W]) = fa([iW + 1, τcur]) and f*([iW + 1, (i + 1)W]) = f*([iW +1, τcur]), and Equation (3) is equivalent to

Our data structure
∊ is just the collection of
I1,∊,
I2,∊, …. Note that we only need to physically store in
∊ the data structures
Ii,∊ and
Ii+1,∊ where [τcur − W + 1,τcur] ⊆ Ii ∪ Ii+1. The intervals of the earlier ones will no longer be covered by the sliding window and the corresponding
I,∊'s can be thrown away. Together with Theorem 4, we have the following theorem.

Theorem 2

The data structure
∊ solves the ∊-approximate counting problem. The space usage is
O(1∊logWlog(∊BlogW)loglogW) and it supports
O(log(∊BlogW)⋅(log1∊+loglogW)) update time.

3. A Simple Data Structure For Frequency Estimation

Let I = [ℓI, rI] be any interval of size W. To simplify notation, we assume that W is a power of 2, so that logW is an integer and we can avoid the floor or the ceiling functions. In this section, we describe a simple data structure
I,λ,κ that enables us to obtain, for any item a, a good estimate of a's frequency over I. The parameters λ and κ determine its accuracy and space usage. However, its accuracy is not enough for answering any ∊-approximate frequent item set query. We will explain how to improve the accuracy in the next section.

Roughly speaking,
I,λ,κ is a set of queues
QI,λa i.e.,
CI,λ,κ=[QI,λa∣a∈U]. For an item a, the queue
QI,λa keeps track of the occurrences of a in I. Each node N in
QI,λa is associated with an interval i(N), a value v(N), and a debit d(N); v(N) counts the number of arrived items (a, u) with u ∈ i(N), and d(N) is for implementing a space reduction technique. Initially,
QI,λa has only one node N with i(N) = I, and v(N) = d(N) = 0. In general,
QI,λa is a queue 〈N1, N2, …, Nk〉 of nodes whose intervals form a partition of I, i.e.,

if (|J| > 1 andλ units have been added to v(N) since J is assigned to i(N)) then

4:

/* refine J */

5:

create a new node N′ and insert it to the left of N;

6:

let i(N′) = [p, m], i(N) = [m + 1, q] where m = ⌊(p + q)/2⌋;

7:

let v(N′) = 0 and d(N′) = 0;

8:

/* we make no change to v(N) and d(N) */

9:

end if

Figure 1 gives an example on how
QI,λa is updated using the procedure.

Obviously, a direct implementation of
I,λ,κ uses too much space. We now extend a technique of Misra and Gries [14] to reduce the space requirement. For any
QI,λa, we say that
QI,λa is trivial if the queue contains only a single node N with (i) i(N) = I, and (ii) v(N) = d(N) = 0. Every queue in
I,λ,κ is trivial initially. The key for reducing the space complexity of
I,λ,κ is to maintain the following invariant throughout the execution:

(*) There are at most κ non-trivial queues in
I,λ,κ.

We call κ the capacity of
I,λ,κ. The invariant helps us save space because we do not need to store trivial queues physically in memory. To maintain (*), each queue
QI,λa supports the following procedure, which is called only when
v(QI,λa), the total values of the nodes in
QI,λa, is strictly greater than
d(QI,λa), the total debits of the nodes in
QI,λa.

QI,λa.Debit( )

1:

if (
v(QI,λa)≤d(QI,λa)) then

2:

return error;

3:

else

4:

find an arbitrary node N of
QI,λa with v(N) > d(N);

5:

/* such a node must exist because
v(QI,λa)>d(QI,λa) */

6:

d(N) = d(N) + 1;

7:

end if

Note from the implementation of Debit( ) that
v(QI,λa) is always no smaller than
d(QI,λa), and for each node N of
QI,λa,v(N)≥d(N). Furthermore, if
v(QI,λa)=d(QI,λa), then v(N) = d(N) for every node N in
QI,λa. To maintain (*),
I,λ,κ processes a newly arrived item (a, u) with u ∈ I as follows.

I,λ,κ.Process((a, u))

1:

update
(QI,λa) by calling
.Update((a, u));

2:

if (after the update the number of non-trivial queues becomes κ) then

3:

for eachQI,λx with
do.Debit( );

4:

for each non-trivial queues
v(QI,λx)=d(QI,λx) with
do

5:

delete all nodes of
QI,λx and make it a trivial queue;

6:

/* Note that each deleted node N satisfies v(N) = d(N). */

7:

end if

It is easy to see that Invariant (*) always holds: Initially the number m of non-trivial queues is zero, and m increases only when Process((a, u)) is on some trivial
QI,λa; in such case
v(QI,λa) becomes 1 and
d(QI,λa) remains 0. If m becomes κ after this increase, we will debit, among other queues,
QI,λa and its
d(QI,λa) becomes 1 too. It follows that
v(QI,λa)=d(QI,λa), and Lines 4–5 will make
QI,λa trivial and m becomes less than κ again.

We are now ready to define
I,λ,κ's estimate f̂a([t, rI]) of fa([t, rI]) and analyze its accuracy. We need some definitions. For any interval J = [p, q] and any t ∈ I, we say that J covers t if t ∈ [p, q], is to the right of t if t < p, and is to the left of t otherwise. For any item a and any t ∈ I = [ℓI, rI],
I,λ,κ's estimate of fa([t, rI]) is

f̂a([t, rI]) = the value sum of the nodes N currently in
QI,λa whose i(N) covers or is to the right of t.

For example, in Figure 1, after the update of the last item (a, 1), we can obtain the estimate f̂a([2, 8]) = 0 + 4 + 5 = 9.

Given any node N of
QI,λa, we say that N is monitoring a over J, or simply N is monitoring J if i(N) = J. Note that a node may monitor different intervals during different periods of execution, and the size of these intervals are monotonically decreasing. Observe that although there are about W2/2 possible sub-intervals of size-W interval I, there are only about 2W of them that would be monitored by some nodes: there is only one such interval of size W, namely I = [ℓI, rI], which gives birth to two such intervals of size W/2, namely [ℓI, m] and [m + 1, rI] where m = ⌊(ℓI + rI)/2⌋, and so on. We call these O(W) intervals interesting intervals. For any two interesting intervals J and H such that J ⊂ H, we say that J is a descendant of H, and H is an ancestor of J. Figure 2 shows all the interesting intervals for I = [1, 8], as well as their ancestor-descendant relationship. The following important fact is easy to verify by induction.

Fact 1

Any two interesting intervals J and H do not cross, although one can contain another, i.e., either J ⊂ H, or H ⊂ J, or J ∩ H = ∅. Furthermore, any interesting interval has at most logW ancestors.

For any node N, let
(N) be the set of intervals that have been monitored by N so far. The following fact can be verified from the update procedure.

Fact 2

Consider a node N in
QI,λa, where i(N) = J.

If J covers or is to the right of t, then all intervals in
(N) cover or are to the right of t.

If J is to the left of t, then all intervals in
(N) are to the left of t.

We say that N covers or is to the right of t if the intervals in
(N) cover or are to the right of t; otherwise, N is to the left of t. For any queue
QI,λa, let alive
(QI,λa) be the set of nodes currently in
QI,λa, dead
(QI,λa) be those nodes of
QI,λa that have already been deleted (because of Line 5 of the procedure Process( )), and node
(QI,λa)=alive(QI,λa)∪dead(QI,λa). Note that the estimate f̂a([t, ri]) is the value sum of the nodes in alive
(QI,λa) that cover or are to the right of t. For simplicity, we need to express it more succinctly. Let

alive(CI,λ,κ)=∪{alive(QI,λa)∣QI,λa∈CI,λ,κ}

be the set of nodes currently in
I,λ,κ. Define dead(
I,λ,κ) and node(
I,λ,κ) similarly. For any item a and any subset X ⊆ node(
I,λ,κ), let Xa be the set of nodes in X that are monitoring a (and thus are the nodes from
QI,λa). For any t ∈ I, let X≥t denote the set of nodes in X that cover or are to the right of t. Define v(X) = ΣN∈Xv(N) and d(X) = ΣN∈Xd(N). Then, f̂a([t, rI]) can be expressed as follows:

f^a([t,rI])=v(alive(QI,λa)≥t)=v(alive(CI,λ,κ)≥ta)

The following theorem analyzes its accuracy, as well as gives the size of
I,λ,κ.

Lemma 3

Proof

Recall that
f^a([t,rI])=v(alive(QI,λa)≥t). Consider any node N ∈ alive
(QI,λa)≥t. Note that v(N) = ΣJ∈
(N)vadd(N, J) where vadd(N, J) is the value added to v(N) during the period when i(N) = J. By Fact 2, we can divide it as v(N) = Σ{vadd(N, J) | J covers t} + Σ {vadd(N, J) | J is to the right of t}. It follows that

v(alive(QI,λa)≥t)=∑N∈alive(QI,λa)≥tv(N)=∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jcoverst}+∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jis to the right oft}

(5)

Note that
∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jis to the right oft}≤fa([t,rI]), because if an arrived item (a, u) causes an increase of vadd(N, J) for some J that is to the right of t, then u must be in [t, rI]. By Equation (5), to show the second inequality of the lemma, it suffices to show that
So=∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jcoverst}=vadd(N1,J1)+vadd(N2,J2)+⋯+vadd(Nκ,Jκ) is no greater than λ logW, as follows.

Without loss of generality, suppose |J1| ≥ |J2| ≥ ⋯≥ |Jκ|. It can be verified that once an interval J is assigned to a node, it will not be assigned to other nodes; thus the Ji's are distinct. Furthermore, note that for 1 ≤ i < k, Jκ ⊂ Ji because (i) t is in both Ji and Jκ; (ii) Jκ is the smallest interval; and (iii) interesting intervals do not cross; thus Jκ is a descendant of Ji, and together with Fact 1, k ≤ logW. By Line 3 of the procedure Update( ), vadd(Ni, Ji) ≤ λ for 1 ≤ i ≤ k. It follows that So ≤ λ logW.

For the first inequality of the lemma, it is clearer to use
f^a([t,rI])=v(alive(CI,λ,κ)≥ta). Note that every arrived item (a, u) with u ∈ [t, rI] increments the value of some node in node
(CI,λ,κ)≥ta; thus
f^a([t,rI])≤v(node(CI,λ,κ)≥ta) and

From Lines 4–6 of the procedure Process( ), when we delete a node N, v(N) = d(N). Hence,
v(dead(CI,λ,κ)≥ta)=d(dead(CI,λ,κ)≥ta), which is equal to the total number of debit operations made to these dead nodes. Since whenever we make a debit operation to
QI,λa, we will make a debit operation to κ − 1 other queues,

κ⋅d(dead(CI,λ,κ)≥ta)≤d(node(CI,λ,κ))≤v(node(CI,λ,κ))=f∗(I)

(6)

In summary, we have
f^a([t,rI])−f^a([t,rI])=fa([t,rI])−v(alive(CI,λ,κ)≥ta)≤v(dead(CI,λ,κ)≥ta)=v(dead(CI,λ,κ)≥ta)≤f∗(I)/κ, and the first inequality of the lemma follows.

For the space, we say that a node is born-rich if it is created because of Line 5 of the procedure Update( ) (and thus has λ items under its belt); otherwise it is born-poor. Obviously, there are at most f*(I)/λ born-rich nodes. For born-poor nodes, we need to store at most κ of them because every queue has one born-poor node (the rightmost one), and we only need to store at most κ non-trivial queues; the space bound follows.

If we set λ = λi = ∊2i/logW and
κ=1∊, then Lemma 3 asserts that
CI,λ,κ=CI,λi,1∊ is an
O(f∗(I)∊2ilogW+1∊)-space data structure that enables us to obtain, for any item a ∈ U and any timestamp t ∈ I, an estimate f̂a([t, rI]) that satisfies

fa([t,rI])−∊f∗(I)≤f^a([t,rI])≤fa([t,rI])+∊2i

If f*(I) does not vary too much, we can determine the i such that f* (I) ≈ 2i, and
CI,λ,κ1∊ is an
O(1∊logW) space data structure that guarantees an error bound of O(∊f*(I)). However, this approach has two obvious shortcomings:

f*(I) may vary from some small value to a value as large as B, the maximum number of items falling in a window of size W; hence, there may not be any fixed i that always satisfies f* (I) ≈ 2i

To estimate fa([t, rI]), we need an error bound of ∊f*([t, rI]), not ∊f*(I).

We will explain how to overcome these two shortcomings in the next section.

4. Our Data Structure for ∊-approximate Counting

The first shortcoming of the approach given in Section 3 is easy to overcome: a natural idea is to maintain
CI,λ,κ1∊ for different λi to handle different possible values of f*(I). The second shortcoming is more fundamental. To overcome it, we need to modify
I,λ,κ substantially The result is a new and complicated data structure
DI,∊Y, where Y is an integer determining the accuracy As asserted in Theorem 7 below, this data structure uses
O(1∊logWloglogW) space, supports
O(log1∊+loglogW) update time, and for any t ∈ I, it offers the following special guarantee:

When
f∗([t,rI])≤Y,DI,∊Y can return, for any item a, an estimate f̂a([t, rI]) of fa([t, rI]) such that |f̂a([t, rI])−fa([t, rI])|≤∊Y.

When
f∗([t,rI])>Y,DI,∊Y does not have any error bound on its estimate f̂a([t, rI]).

Before giving the details of
DI,∊Y, let us explain how to use it to build the data structure
I,∊ mentioned in Section 2 for the ∊-approximate counting problem. To build
I,∊, we need another
O(1∊logWlog∊BlogW)-space data structure
I,∊, which is a simple adaption of the data structure
∊ of Cormode et al. [1] for the ∊-approximate basic counting problem;
I,∊ enables us to find, for any t ∈ I, an estimate f̂*([t, rI]) of f*([t, rI]) such that

f∗([t,rI])≤f^a([t,rI])≤(1+∊)f∗([t,rI])

(7)

I,∊ is implemented as follows. During execution, we maintain the data structure
∊/4 of Cormode et al. to count the items in the sliding window. When τcur = rI, we duplicate
∊/4 and get
′. Then,
′ is updated as if τcur was fixed at rI. To get the estimate f̂*([t, rI]), we first obtain an estimate f′ of f*([t, rI]) from
′, which satisfies
|f′−f∗([t,rI])|≤∊4f∗([t,rI]). Then,
f^∗−([t,rI])=11−∊/4f′. It can be verified that f̂*([t, rI]) satisfies Equation (7). Our data structure
I,∊ is composed of (i)
I,∊, and (ii)
DI,∊/42i for each integer i from
log(1∊logW)+1tologB. It also maintains a brute-force
O(1∊logW)-space data structure for remembering the
1∊logW items (a, u) with the largest u ∈ I; this brute-force data structure will be used for finding f̂a([t, rI]) only when
f∗([t,rI])≤1∊logW.

Theorem 4

Given
I,∊, we can find, for any a ∈ Σ and t ∈ I, an estimate of f̂a([t, rI]) of fa([t, rI]) such that |f̂a([t, rI]) − fa([t, rI])| ≤ ∊f*([t, rI]).

Proof

Statement (i) is straightforward because there are
logB−log(1∊logW) different
DI,∊Y, each has size
O(1∊(loglogW)logW) and takes
O(log1∊+loglogW) time for an update. For Statement (ii), we describe how to get the estimate and analyze its accuracy.

First, we use
I,∊ to get the estimate f̂*([t, rI]). If
f^∗([t,rI])≤1∊logW, then
f∗([t,rI])≤f^∗([t,rI])≤1∊logW and we can use the brute-force data structure to find fa([t, rI]) exactly. Otherwise, we determine the i with 2i−1 < f̂*([t, rI]) ≤ 2i. Note that

i≥log(1∊logW)+1 and we have the data structure
DI,∊42i, and

f*([t, rI]) ≤ f̂*([t, rI]) ≤ 2i.

We use
DI,∊42i to obtain an estimate f̂a([t, rI]) with
|f^a([t,rI])−fa([t,rI])|≤(∊4)2i. By Equation (7), 2i−1 < f̂*([t, rI]) ≤ (1 + ∊)f*([t, rI]). Combining the two inequalities we have

We now describe the construction of
DI,∊Y. First, we describe an
O(1∊(logW)2)-space version of the data structure. Then, we show in the next section how to reduce the space to
O(1∊loglogWlogW). In our discussion, we fix λ = ∊Y/logW and
κ=4∊logW.

Initially,
DI,∊Y is just the data structure
I,λ,κ. By Lemma 3, we know that its size is
O(f∗(I)λ+κ)=O(f∗(I)∊YlogW+1∊logW), which is
O(1∊logW) when f*(I) ≤ Y. However, it is much larger than
1∊logW when f*(I) ≫ Y, and to maintain small space usage in such case, we trim
I,λ,κ by throwing away a significant number of nodes. This is acceptable because
I,λ,κ only guarantees good estimates for those t ∈ I with f*([t, rI]) ≤ Y. The trimming process is rather tricky. The natural idea of throwing away all the nodes to the left of t when we find f*([t, rI]) > Y does not work because the resulting data structure may return estimates with error larger than the required ∊Y bound. For example, let I = [1, W]. For each item ai ∈ {a1, a2, …, aκ−1}, there are m = Y/κ copies of (ai, t + 1) arrive at time W + t for every t ∈ [0, W − 1]. Also, there are m copies of (a, W) arrive at time W + t for every t ∈ [0, W − 1]. Hence, at each time W + t, there are mκ = Y items with timestamps in [t, W] arrives, m items for each of the κ item name in {a, a1, …, aκ−1}. We are interested in the accuracy of the estimate f̂a([W, W]). It can be verified that at each time W + t, Lines 4–5 of the procedure Process( ) will eventually trivialize
QI,λa and thus f̂a([W, W]) = 0. Since fa([W, W]) = (t + 1)m, |f̂a([W, W]) − fa([W, W])| = (t + 1)m. When t = 2∊Y/m − 1, the absolute error is 2∊Y which is larger than the required error bound ∊Y.

To describe the right trimming procedure, we need some basic operations. Consider any
J,λ,κ where J = [p, q]. The following operation splits
J,λ,κ into two smaller data structures
Jℓ,λ,κ and
Jr,λ,κ where Jt = [p, m] and Jr = [m+ 1, q] with m = ⌊(p + q)/2⌋.

DI,∊Y.Split(
J,λ,κ)

1:

for each non-trivial queue
QJ,λa∈CJ,λ,κdo

2:

if (
QJ,λa has only one node N monitoring the whole interval J) then

3:

/* refine J */

4:

insert a new node N′ immediately to the left of N with v(N′) = d(N′) = 0;

5:

i(N′) = Jℓ, and i(N) = Jr;

6:

end if

7:

divide
QJr,λa into two sub-queues
and
where

8:

QJℓ,λa contains the nodes monitoring some sub-intervals of Jℓ, and

9:

QJr,λa contains those monitoring some sub-intervals of Jr;

10:

put
QJr,λa in
Jℓ,λ,κ and
in
Jr,λ,κ.

11:

end for

12:

/* For a trivial
QJ,λa, its two children in
Jℓ,λ,κ and
Jr,λ,κ are also trivial. */

We say that
Jℓ,λ,κ and
Jr,λ,κ are the left and right child of
Jr,λ,κ, respectively. Figure 3 gives an example of Split(
[1,8],λ,κ), the split of
[1,8],λ,κ, which has three non-trivial queues
QI,λa,
QI,λb and
QI,λc, into
[1, 4],λ,κ and
[5, 8],λ,κ. Note that the queues for b and c in
[1, 4],λ,κ are trivial and we have not stored them.

Using Split( ), we can trim, for example,
[p,p+1],λ,κ into
[p+1,p+1],λ,κ as follows: Split
[p,p+1],λ,κ into
[p,p],λ,κ and
[p+1,p+1],λ,κ, and throw away
[p, p],λ,κ. The following recursive procedure LeftRefine( ) generalizes this idea for larger J: Given
J,λ,κ =
[p, q],λ,κ, it returns a list 〈
J0,λ,κ,
J1,λ,κ, …,
Jm,λ,κ〉 where the Ji's form a partition of [p, q], and J0 = [p, p]. Throwing away
J0,λ,κ, and the remaining
Ji,λ,κ's all together monitor [p + 1, q].

DI,∊Y.LeftRefine (
[p,q],λ,κ)

1:

if (|[p, q]| = |[p, p]| = 1) then

2:

return 〈
[p,p],λ,κ〉;

3:

else

4:

split
[p,q],λ,κ into its left child
[p, m],λ,κ and right child
[m+1,q],λ,κ

5:

/* where m = ⌊(p + q)/2⌋ */;

6:

L = LeftRefine(
[p, m],λ,κ);

7:

suppose L = 〈
J0,λ,κ,
J1,λ,κ, …,
Jk,λ,κ〉;

8:

return 〈
J0,λ,κ, …,
Jk,λ,κ[m+1,q],λ,κ〉;

9:

end if

For example, LeftRefine(
[1,8],λ,κ) gives us the list 〈
[1,1],λ,κ,
[2, 2],λ,κ,
[3, 4],λ,κ,
[5,8],λ,κ〉. Note that J0 = [p, p] because the recursion stops only when |[p, q]| = 1. The list returned by LeftRefine(
[p, q],λ,κ) has another useful property, which we describe below.

Given L = 〈
Z1,λ,κ, …,
Zk,λ,κ), we say that L is an interesting-partition covering the interval J if (i) the Zi's are all interesting intervals and form a partition of J; and (ii) for 1 ≤ i < k, Zi is to the left of Zi+1, and
|Zi|≤12|Zi+1|. The fact below can be verified by induction on the length of the list returned by LeftRefine( ).

Fact 3

Let J be an interesting interval, and L = 〈
J0,λ,κ, …,
Jm,λ,κ〉 be the list returned by LeftRefine(
J,λ,κ). Then, the list 〈
J1,λ,κ, …,
Jm,λ,κ 〉 (i.e., the list obtained by throwing away the head
J0,λ,κ of L) is an interesting-partition covering [p + 1, q].

For example, if [1, 8] is an interesting interval, then the list 〈
[2,2],λ,κ[3,4],λ,κ[5,8],λ,κ〉 obtained by throwing away the first element
[1,1],λ,κ from LeftRefine(
[1,8],λ,κ) is an interesting-partition covering [2, 8].

We now give details of
DI,∊Y. Initially, it is the interesting-partition 〈CI,λ,κ 〉 covering the whole interval I = [ℓI, rI]. Throughout the execution, we maintain the following invariant:

(**)
DI,∊Y is an interesting-partition covering some [p, rI] ⊆ I.

When
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 is covering [p, rI], it only guarantees good estimates of fa([t, rI]) for t ∈ [p, rI], and this estimate is obtained by

f^a([t,rI])=v(alive(CJh,λ,κ)≥ta)+∑h+1≤i≤mv(alive(CJi,λ,κ)a)

(or equivalently,
f^a([t,rI])=v(alive(QJh,λa)≥t)+∑h+1≤i≤mv(alive(QJi,λa)), where Jh is the interval in {J1, J2, …, Jm} that covers t. When an item (a, u) with u ∈ [p, rI] arrives, we find the unique
Ji,λ,κ in
DI,∊Y where u ∈ Ji, update it by calling
Ji,λ,κ. Process((a, u)). Note that this update has no effect on the other
J,λ,κ in
DI,∊Y.

During execution, we also keep track of the largest timestamp pmax ∈ I such that the estimate f̂*(pmax,rI]) given by
I,∊ is greater than (1 + ∊)Y (which implies f*([pmax,rI]) > Y because of Equation (7)). As soon as pmax falls in the interval covered by
DI,∊Y, we use the following procedure to trim
DI,∊Y to cover the smaller interval [pmax + 1, rI].

For example, Figure 4 shows that when
DI,∊Y=〈C[2,2],λ,κ,C[3,4],λ,κ,C[5,8],λ,κ〉,
Trim(DI,∊Y,3) return 〈
[4,4],λ,κ,
[5,8],λ,κ 〉. Based on Fact 3, it can be verified inductively that after
DI,∊Y←Trim(DI,∊Y,pmax), the new
DI,∊Y is an interesting-partition covering [pmax + 1, rI]; Invariant (**) is preserved. In the rest of this section, we analyze the size of
DI,∊Y and the accuracy of its estimates.

Let All be the set of all
J,λ,κ's that ever exist, i.e., if
J,λ,κ ∈ All, then either (i) it is currently in
DI,∊Y, or (ii) it has been in
DI,∊Y some time earlier in the execution, but is thrown away during some trimming of
DI,∊Y. For any p ∈ I, define

ALL≥p={CJ,λ,κ∣CJ,λ,κ∈ALL,andJcovers or is to the right ofp}

Let vadd(
J,λ,κ) be the total value added to the nodes of
J,λ,κ during its lifespan. We now derive an upper bound on ΣJ,λ,κ ∈ All≥pvadd(
J,λ,κ), which is crucial for getting a tight error bound on the accuracy of
DI,∊Y's estimates.

Recall that initially
DI,∊Y=〈CI,λ,κ〉 and thus
I,λ,κ ∈ All. For any other
J,λ,κ ∈ All,
J,λ,κ must be a child of some
H,λ,κ ∈ All (i.e.,
J,λ,κ is obtained from Split(
H,λ,κ))- Given
J,λ,κ and
H,λ,κ, we say that
J,λ,κ is a descendant of
H,λ,κ, and
H,λ,κ is an ancestor of
J,λ,κ, if either (i)
J,λ,κ is a child of
H,λ,κ, or (ii) it is a child of some of
H,λ,κ's descendants. Note that the original
I,λ,κ is an ancestor of every
J,λ,κ ∈ All, and in general, any
H,λ,κ ∈ All is an ancestor of every
J,λ,κ ∈ All with J ⊂ H. We have the following lemma. (Note that we are abusing the notation here and regard
DI,∊Y as a set.)

Lemma 5

Suppose that
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 is covering [p, rI]. Let
anc(DI,∊Y)=anc(〈CJ1,λ,κ,…,CJm,λ,κ〉) be the set
{CH,λ,κ∣CH,λ,κis an ancestor of someCJi,λ,κ∈DI,∊Y}. Then,

ALL≥p⊆DI,∊Y∪anc(DI,∊Y),

vadd(
J,λ,κ) ≤ (1 + ∊)Y for any
J,λ,κ ∈ All, and

|DI,∊Y∪anc(DI,∊Y)|≤2logW.

Therefore, Σ{vadd(
J,λ,κ) |
J,λ,κ ∈ All≥p} ≤ 2(1 + ∊)Y logW.

Proof

For (1), it suffices to prove that for any
CJ,λ,κ∈ALL≥p,CJ,λ,κ∈DI,∊Y∪anc(DI,∊Y). By definition, J covers or is to the right of p; thus J ∩ (J1 ∪ ⋯ ∪ Jm) = J ∩ [p, rI] ≠ ∅. Since the intervals are interesting and do not cross, there is an 1 ≤ i ≤ m such that either (i) J = Ji, and thus
CJ,λ,κ∈DI,∊Y, or (ii) Ji ⊂ J, which implies
J,λ,κ is an ancestor of
J,λ,κ, i.e.,
CJ,λ,κ∈anc(DI,∊Y). (It is not possible that J ⊂ Ji, otherwise
Ji,λ,κ would have been split and should not be in the current
DI,∊Y. Hence,
CJ,λ,κ∈DI,∊Y∪anc(DI,∊Y).

To prove (2), suppose that J = [x, y] and vadd(
J,λ,κ) has just reached (1 + ∊)Y. This implies f*([x, rI]) ≥ (1 + ∊)Y, and so does its estimate f̂*([x, rI]) given by
I,∊ (as f*([x, rI]) ≤ f̂*([x, rI]), by Equation (7)). Then, the procedure Trim( ) will be called and
J,λ,κ will be either thrown away or split, and no more value can be added to
J,λ,κ. It follows that vadd(
J,λ,κ) ≤ (1 + ∊)Y.

For (3), recall that
DI,∊Y=〈CJ1,λ,κ,CJ2,λ,κ,…,CJm,λ,κ〉. Among the intervals J1, …, Jm, interval J1 is the leftmost interval and its left boundary ℓJ1 = p. We now prove that
DI,∊Y∪anc(DI,∊Y)=DI,∊Y∪anc(CJ1,λ,κ) where anc(
J1,λ,κ) is the set of ancestors of
J1,λ,κ. Then, together with the facts that
|DI,∊Y|≤logW (by Property (ii) of interesting-partition) and |anc(
J1,λ,κ)| ≤ logW (as each Split operation would reduce the size of interval by half), we have

|DI,∊Y∪anc(DI,∊Y)|=|DI,∊Y∪anc(CJ1,λ,κ)|≤|DI,∊Y|+|anc(CJ1,λ,κ)|≤2logW

To show
DI,∊Y∪anc(DI,∊Y)=DI,∊Y∪anc(CJ1,λ,κ), it suffices to show that for any
CH,λ,κ∈anc(DI,∊Y),
H,λ,κ ∈ anc(
J1,λ,κ). Since
CH,λ,κ∈anc(DI,∊Y), it is the ancestor of some
CJi,λ,κ∈(DI,∊Y). Thus Ji = [ℓji, rji] ⊂ H = [ℓH, rH]. Since
H,λ,κ is already an ancestor, it no longer exists, and all the
J,λ,κ to its left have been thrown away. Thus,
DI,∊Y has no
J,λ,κ where J is to the right of ℓH. This implies ℓH ≤ p = ℓJ1 and ℓH ≤ ℓJ1 ≤ rJ1 ≤ rJi ≤ rH. It follows that J1 ⊂ H and
H,λ,κ is an ancestor of
J1,λ,κ, i.e.,
H,λ,κ ∈ anc(
J1,λ,κ).

Proof

Let alive
(DI,∊Y) be the set of nodes currently in
DI,∊Y,dead(DI,∊Y) the set of those that were in
DI,∊Y earlier in the execution but have been deleted, and
node(DI,∊Y)=alive(DI,∊Y)∪dead(DI,∊Y). It can be verified that
f^a([t,rI])=v(alive(DI,∊Y)≥ta). Below, we prove that

The proof of the second inequality of Equation (8) is identical to that of Lemma 3, except that we replace all occurrences of
J,λ,κ by
DI,∊Y. The proof of the first inequality is also similar. We still have

which equals
d(dead(DI,∊Y)≥ta). As in Lemma 3, we can derive the bound
d(dead(DI,∊Y)≥ta)≤1κv(node(DI,∊Y))=1κf∗(I), but we can do better here.

Observe that for any node
N∈dead((DI,∊Y)≥ta), N can only be in those
J,λ,κ ∈ All≥p (because t ∈ [p, rI]), and when we debit N, if it is in
J,λ,κ, then we debit κ − 1 other nodes in
J,λ,κ monitoring κ − 1 items other than a. Thus,
κ⋅d(dead((DI,∊Y)≥ta)) is no more than the total value available in the
J,λ,κ ∈ All≥p, which is Σ {vadd(
J,λ,κ) |
J,λ,κ ∈ All≥p}. Together with Lemma 5 we conclude

For the size of
DI,∊Y, similar to the proof of Lemma 3, we can argue that the number of born-rich nodes is only
O(Y/λ)=O(1∊logW), but the number of born-poor nodes can be much larger. A born-poor node of a non-trivial queue is created either when we increase the value of a trivial queue, or when we execute Lines 2-6 of procedure Split. It can be verified that every queue
QJ,λa has at most one born-poor node, which is the rightmost node in
QJ,λa. Since there are O(logW)
J,λ,κ's in
DI,∊Y and each has at most κ non-trivial queues, the number of born-poor nodes, and hence the size of
DI,∊Y, is
O(κlogW)=O(1∊(logW)2).

To reduce
DI,∊Y's size from
O(1∊(logW)2) to
O(1∊loglogWlogW), we need to reduce the number of born-poor nodes; or equivalently, the number of non-trivial queues in
DI,∊Y. In the next section, we give a simple idea to reduce the number of non-trivial queues and hence the size of
DI,∊Y to
O(1∊loglogWlogW). In Section 6, we show how to further reduce the size by taking advantage of the tardiness of the data stream.

5. Reducing the Size of
DI,∊Y

Our idea for reducing the size is simple; for every
CJ,λ,κ∈DI,∊Y, its capacity is no longer fixed at
κ=4∊logW; instead, we start with a much smaller capacity, namely
4∊loglogW, which is allowed to increase gradually during execution. To determine
J,λ,κ's capacity, we use a variable to keep track of the number f̄*(J) of items (a, u) with u ∈ J that have arrived since
J,λ,κ's creation. Let vJ be the total value of the nodes in
J,λ,κ when it is created (vJ may not be zero if
J,λ,κ is resulted from the splitting of its parent). The capacity of
J,λ,κ is determined as follows.

When
(c−1)logW≤vJ+f¯∗(J)<cYlogW for some integer c ≥ 1, the capacity of
J,λ,κ is
κ(c)=4c∊loglogW, i.e., set κ = κ(c) and allow κ(c) non-trivial queues in
J,λ,κ.

Note that when we increase the capacity of
J,λ,κ to κ(c), we do not need to do anything, except that we allow more non-trivial queues (up to κ(c)) in the data structure. Also note that when
J,λ,κ is created during the trimming process, its inherited capacity may be larger than the supposed capacity κ(c); in such case, we simply debit every non-trivial queue until some queue
QJ,λa has
v(QI,λx)=d(QI,λx) and we execute Lines 4 and 5 of the procedure Process( ) to make this queue trivial. We repeat the process until the number of non-trivial queues is at most κ(c). The following theorem asserts that
DI,∊Y maintains the accuracy of its estimates under this new implementation. It gives the revised size and the update time.

Theorem 7

Suppose that
DI,∊Y is currently covering [p, rI]. For any item a ∈ Σ and any timestamp t ∈ [p, rI], the estimate f̂a([t, rI]) of f̂a([t, rI]) obtained by the new
DI,∊Y satisfies |f̂a([t, rI]) − fa([t, rI])| ≤ ∊Y.

Proof

Suppose that
DI,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉. From the fact that we are using
Ji,λ,κ(ci) to monitor Ji we conclude
(ci−1)YlogW≤vJi+f¯∗(Ji). It follows that
∑1≤i≤mciYlogW≤∑1≤i≤m(vJi+f¯∗(Ji))+∑1≤i≤mYlogW, which is O(Y) because (i)
|DI,∊Y|=m=O(logW) and (ii)
∑1≤i≤m(vJi+f¯∗(Ji))=O(Y) (otherwise
DI,∊Y would have been trimmed). Thus,

∑1≤i≤mci=O(logW)

(9)

For Statement (1), the analysis of the accuracy of f̂a([t, rI]) is very similar to that of Theorem 6, except for the following difference: In the proof of Theorem 6, we show that
d(dead(DI,∊Y)≥pa)≤2(1+∊)YκlogW, and since κ is fixed at
4∊logW,
d(dead(DI,∊Y)≥pa)≤∊Y. Here, we also prove that
d(dead(DI,∊Y)≥pa)≤∊Y, but we have to prove it differently because the capacities are no longer fixed.

As argued previously, any node in
dead(DI,∊Y)≥pa is in some
J,λ,κ ∈ All≥p. Below, we show that for any
J,λ,κ ∈ All≥p, we can make at most
∊Y2logW debit operations to the queue
QJ,λa of
J,λ,κ during its lifespan. Together with the fact that |All≥p| ≤ 2 logW, we have
d(dead(DI,∊Y)≥pa)≤∊Y.

Consider any
J,λ,κ ∈ All≥p. Note that the smaller its capacity, the larger the number of debit operations can be made to the queue
QJ,λa of
J,λ,κ. To maximize the number of debit operations made to
QJ,λa, suppose that vJ = 0 and thus
J,λ,κ has the smallest capacity κ(1) when it is created. Before increasing its capacity to κ(2),
J,λ,κ can make at most
1κ(1)⋅YlogW debit operations to
QJ,λa. Then, during the next
YlogW arrivals of items (a, u) with
u∈J,YlogW≤vJ+f¯∗(J)<2YlogW, the capacity is κ(2), and at most
1κ(2)⋅YlogW debit operations can be made to
QJ,λa. In general, during the period when
(c−1)YlogW≤vJ+f¯∗(J)<cYlogW, at most
1κ(c)⋅YlogW debit operations can be made to
QJ,λa. If the largest capacity is κ(cmax), the total number of debit operations made to
QJ,λa is at most

which is smaller than
∊Y2logW because by Equation (9), cmax = O(logW), which implies ln(cmax) + 1 ≤ 2 log logW (suppose that W is larger than some constant).

We now prove (2). Note that the total number of non-trivial queues in
DI,∊Y, and hence the number of born-poor nodes, is at most
∑1≤i≤mκ(ci)=∑1≤i≤m4ci∊loglogW. By Equation (9),
∑1≤i≤mci=O(logW), and it follows that the size of
DI,∊Y is
O(1∊loglogWlogW).

For the update time, suppose that an item (a, u) arrives. We can find the
Ji,λ,κ in
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 with u ∈ Ji using O(log m) = O(log logW) time by querying a balanced search tree storing the Ji's. By hashing (e.g., Cuckoo hashing [15], which supports constant update and query time) we can locate the queue
QJi,λa∈CJi,λ,κ in constant time. Then, by consulting an auxiliary balanced search tree on the intervals monitored by the nodes of
QJi,λa, we can find and update the node N of
QJi,λa with u ∈ i(N) using
O(log(Y/λ))=O(log1∊+loglogW) time. At times we may also need to execute Lines 3 and 4 of the procedure Process( ), which debits all the non-trivial queues in
Ji,λ,κ. Using the de-amortizing technique given in [16], this step takes constant time.

Note that occasionally, we may also need to clean up
DI,∊Y by calling Trim( ); this step takes time linear to the size of
DI,∊Y, which is
O(1∊(loglogW)logW).

6. Further Reducing the Size of
DI,∊Y for Streams with Small Tardiness

Recall that in an out-of-order data stream with tardiness dmax ∈ [0, W], any item (a, u) arriving at time τcur satisfies u ≥ τcur − dmax; in other words, the delay of any item is guaranteed to be at most dmax. This section extends
DI,∊Y to a data structure
ℰI,∊Y that takes advantage of this maximum delay guarantee to reduce the space usage. The idea is as follows. Since there is no new item with stamps smaller than τCur − dmax, we will not make any further change to those nodes to the of left τcur − dmax and hence can consolidate these nodes to reduce space substantially. To handle those nodes with timestamps in [τcur − dmax, τcur], we use the data structure given in Section 5; since it is monitoring an interval of dmax instead of W, its size is
O(1∊(loglogdmax)logdmax) instead of
O(1∊(loglogW)logW).

To implement
ℰI,∊Y, we need a new operation called consolidate. Consider any list of queues
〈QJ1,λa,QJ2,λa,…,QJm,λa〉, where J1, J2, …, Jm are ordered from left to right and form a partition of the interval J1‥m = J1 ∪ ⋯ ∪ Jm. We consolidate them into a single queue
QJ1‥m,λa as follows:

Concatenate the queues into a single queue, in which the nodes preserve the left-right order.

Starting from the leftmost node, check from left to right every node N in the queue, if N is not the rightmost node and v(N) < λ, merge it with the node N′ immediately to its right, i.e., delete N, set v(N′) = v(N) + v(N′), d(N′) = d(N) + d(N′) and
(N′) =
(N) ∪
(N′).

Note that after the consolidation, the resulting queue
QJ1‥m,λa has at most one node (the rightmost one) with value smaller than λ.

Given the list 〈
J1,λ,κ(c1), …,
Jm,λ,κ(cm)〉, we consolidate them into
CJ1‥m,λ,1∊ by first consolidating, for each item a, the queues
QJ1,λa,…,QJm,λa in
J1,λ,κ(c1), …,
Jm,λ,κ(cm) into the queue
QJ1‥m,λa and put it in
CJ1‥m,λ,1∊. Then, we apply Lines 3–5 of procedure Process( ) repeatedly to reduce the number of non-trivial queues in the data structure to
1∊.

We are now ready to describe how to extend
DI1,∊Y to
ℰI,∊Y. In our discussion, we fix
λ=∊Ylogdmax, and without loss of generality, we assume that I = [1, W]. Recall that pmax denotes the largest timestamp in I such that f̂*([pmax, rI]) > (1 + ∊)Y (which implies f*([pmax, rI]) > Y). We partition I into sub-windows I1, I2, …, Im, each of size dmax (i.e., Ii = [(i − 1)dmax, idmax]). We divide the execution into different periods according to τcur, the current time.

During the 2nd period, when τcur = I2,
ℰI,∊Y maintains
DI2,∊Y in addition to
DI1,∊Y.

During the 3rd period, when τcur ∈ I3,
ℰI,∊Y maintains
DI3,∊Y in addition to
DI2,∊Y. Also, the
DI1,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉 is consolidated into
CI1,λ,1∊.

In general, during the ith period, when
τcur∈[(i−1)dmax+1,idmax]=Ii,ℰI,∊Y maintains
DIi−1,∊Y and
DIi,∊Y, and also
CI1‥i−2,λ,1∊ where I1‥i−2 = I1 ∪ I2 ∪ ⋯ ∪ Ii−2. Observe that in this period, there is no item (a, u) with u ∈ I1‥i−2 arrives (because the tardiness is dmax), and thus we do not need to update
CI1‥i−2,λ,1∊. However, we will keep throwing away any node N in
CI1‥i−2,λ,1∊ as soon as we know i(N) is to the left of pmax + 1.

When entering the (i + 1)st period, we do the followings: Keep
DIi,∊Y, create
DIi+1,∊Y, merge
I1‥i−2,λ,κ with
DIi−1,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉, and then get
CI1‥i−1,λ,1∊ by consolidating
〈CI1‥i−2,λ,1∊,CJ1,λ,κ(c1)…,CJm,λ,κ(cm)〉.

Given any t ∈ [pmax + 1, rI], the estimate of fa([t, rI]) given by
ℰI,∊Y is

f^a([t,rI])=v(alive(ℰI,∊Y)≥ta)

The following theorem gives the accuracy of
f^a([t,rI]),ℰI,∊Y's size and its update time.

Theorem 8

For any t ∈ [pmax + 1, rI], the estimate f̂a([t, rI]) given by
ℰI,∊Y satisfies

Proof

Recall that I is partitioned into sub-intervals I1, I2, …, Im. Suppose that t ∈ Iκ. Note that if we had not performed any consolidation,

v(alive(ℰI,∊Y)≥ta)=v(alive(DIκ,∊Y)≥ta)+∑κ+1≤i≤mv(alive(DIi,∊Y)a)

Note that for κ + 1 ≤ i ≤ m,
v(alive(DIi,∊Y)a)≤fa(Ii), and for
v(alive(DIκ,∊Y)≥ta) since |Iκ|= dmax, the same argument used in the proof of Lemma 3 gives us
v(alive(DIκ,∊Y)≥ta)≤fa([t,rIk])+λlogdmax. Hence

The consolidation step may add errors to
v(alive(ℰI,∊Y)≥ta). To get a bound on them, let N1, N2, … be the nodes for a in
ℰI,∊Y, ordered from left to right. Suppose that t ∈ Nh. Note that

the consolidation step will added at most λ units to v(Nh) before we move on to consider the node immediately to its right, and

for node Ni with i ≥ h + 1, any node N that has been merged to Ni must be to the right of of Nh, and thus is to the right of t; it follows that N is contributing v(N) to
v(alive(ℰI,∊Y)≥ta) in Equation (10) and its merging will not make any change.

In conclusion, the consolidation steps introduce at most λ extra errors, and Equation (10) becomes
v(alive(ℰI,∊Y)≥ta)≤fa([t,rI])+λlogW+λ≤fa([t,rI])+2∊Y, which is the second inequality of the lemma.

To prove the first inequality, suppose that we ask for the estimate f̂a([t, rI]) during the ith period, when we have
CI1‥i−2,λ,1∊,
DIi−1,∊Y and
DIi,∊Y. Recall that
I1‥i−2, λ,∊ comes from consolidating
DI1,∊Y,DI2,∊Y,…,DIi−2,∊Y. As in all our Previous analyses, we have

(Note that the merging of nodes during consolidations would not take away any value). To get a bound on
d(dead(ℰI,∊Y)≥ta), suppose that pmax ∈ Iκ. Then, all the nodes to the left of Iκ have been thrown away. Among
DIκ,∊Y,DIκ+1,∊Y,…,DIm,∊Y, only
DIκ,∊Y may have been trimmed. Note that

d(dead(ℰI,∊Y)≥ta)≤d(dead(DIκ,∊Y)≥pmaxa)+∑κ+1≤ℓ≤md(dead(DIℓ,∊Y)a),

as in the proof of Theorem 7, we can argue that
d(dead(DIκ,∊Y)≥pmaxa)≤∊Y, and

for the other
DIℓ,∊Y, since their capacity is at least
1∊

∑κ+1≤ℓ≤md(dead(DIℓ,∊Y)a)≤∑κ+1≤ℓ≤mf∗(Iℓ)/(1/∊)≤∊f∗([pmax+1,rI])≤∊Y

Thus,
d(dead(ℰI,∊Y)≥ta)≤2∊Y, and the first inequality follows.

For Statement (2), note that both
DIi−1,∊Y and
DIi,∊Y have size
O(1∊loglogdmaxlogdmax) (by Theorem 7, and |Ii−1| = |Ii| = dmax), and for
CJ1‥i−2,λ,1∊, it has size
O(Y/λ+1∊)=O(1∊logdmax); thus the size of
ℰI,∊Y is
O(1∊loglogdmaxlogdmax). For the update time, it suffices to note that it is dominated by the update times of
DIi−1,∊Y and
DIi,∊Y.

Figure 1.
Suppose that λ = 4. (i) shows the queue
QI,λa before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.

Figure 1.
Suppose that λ = 4. (i) shows the queue
QI,λa before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.

Figure 2.
Interesting intervals for I = [1, 8].

Figure 2.
Interesting intervals for I = [1, 8].

Figure 3.
Split of
[1, 8], λ,κ.

Figure 3.
Split of
[1, 8], λ,κ.

Figure 4.
Trim(〈
[2, 2],λ,κ,
[3, 4],λ,κ,
[5, 8],λ,κ〉, 3).

Figure 4.
Trim(〈
[2, 2],λ,κ,
[3, 4],λ,κ,
[5, 8],λ,κ〉, 3).

Table 1.
The space complexity for answering ∊-approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume
B≥1∊logW; otherwise, we can always store all items in the window for exact answer, using
O(1∊logW) words. Similarly, for the result with tardiness, we assume
B≥1∊logdmax.

Table 1.
The space complexity for answering ∊-approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume
B≥1∊logW; otherwise, we can always store all items in the window for exact answer, using
O(1∊logW) words. Similarly, for the result with tardiness, we assume
B≥1∊logdmax.