2
Introduction Counting distinct objects: Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications traffic management, call centers, wireless communication, stock market etc.

4
FM Algorithm FM SKETCH Let h(x) be a uniform hash function Let “pivot” p(y) be the position of left most 1- bit of h(x) FM be an array of size k initialized to zero For each record x in dataset FM[pivot] = 1; Let B=FM min be the position of left most 0-bit of FM Number of distinct elements = α * 2 B where α = Each bit i of h(x) has 1/2 probability to be one FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) FM min = 1 k = 4 P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985

5
FM Algorithm 1010 FM r1r2r1r3r1 h(r1) 0010 h(r2) 1101 h(r3) 1010 FM min = 1 Each bit i of h(x) has 1/2 probability to be one A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2 i+1 Let n be the number of distinct elements FM[0] is accessed appx. n/2 times FM[1] is accessed appx. n/4 times …. FM[i] is accessed appx. n/2 i+1 times If i >> log 2 n FM[i] will almost certainly be zero If i << log 2 n FM[i] will almost certainly be one If i ≈ log 2 n FM[i] may be zero or one Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n.

12
K-Skyband Technique Main Idea Let h() be a hash function to hash D to [1,m 3 ] where m = |D| For each record (x,t’) we generate h(x) and store record (x, h(x), t’) Answering a query q(t): Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t Get the k-th smallest distinct hashed value and apply BJKST algorithm Limitation: Requires storing all records

13
K-Skyband Technique For any time t, we need to find k-th smallest hash value arriving no later than t A record x dominates another record y if x arrives after y and has smaller hash value K-Skybands keeps only the objects that are dominated by at most (k-1) records Maintaining K-Skyband: Keep a counter for each record When a new element (x,t) arrives, increment the counter of all records dominated by it Remove the records with counter at least equal to k We increment the counters of groups to improve efficiency (Domination aggregation search tree) a e d c b h(x) t k = 2