This site uses cookies to deliver our services and to show you relevant ads and job listings.
By using our site, you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service.
Your use of Stack Overflow’s Products and Services, including the Stack Overflow Network, is subject to these policies and terms.

Join us in building a kind, collaborative learning community via our updated
Code of Conduct.

What's going on here? For a, say, we can decompose X = Y + Z, where Y is the change in the sum for the as, and Z is the sum for the non-as. By the linearity of expectation, we have

E[h(a) X] = E[h(a) Y] + E[h(a) Z] .

E[h(a) Y] is a sum with a term for each occurrence of a that is h(a)^2 = 1, so E[h(a) Y] is the number of occurrences of a. The other term E[h(a) Z] is zero; even given h(a), each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.

In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.

Wow! Just a few hours of posting the question someone came up with a clearer explanation of the algorithm! Thanks so much!!! :D
– neilmarionJul 25 '11 at 5:26

Hello @insomniac. Does this mean that we need to know beforehand the set, say O, where a, b and c are elements of O?
– neilmarionJul 25 '11 at 7:55

@neilmarion It suffices to know a superset – there may be too many different items to keep a uniform random hash function. For example, if the data items are n-bit vectors, then at the outset we can choose a random n-bit vector r and let h(x) = 1 if r.x = 0 mod 2 and h(x) = -1 if r.x = 1 mod 2, where . denotes dot product.
– insomniacJul 25 '11 at 13:38

(I'm not sure if pairwise randomness suffices to make the arguments about variance work, but that's the flavor of the hash functions that one could use.)
– insomniacJul 25 '11 at 13:40

Reading a stream of elements a1, a2, a3, ..., an where there can be a lot of repeated elements, in any time it will give you the answer to the following question: how many ai elements have you seen so far.

You can clearly get an exact value at each time just by maintaining the hash where keys are your ai and values is how many elements you have seen so far. It is fast O(1) add, O(1) check and it give you an exact count. The only problem that it takes O(n) space, where n is the number of distinct elements (keep in mind that the size of each element has a big difference because it takes way more space to store this big string as a key than just this.

So how Count sketch is going to help you? As in all probabilistic data structures you sacrifice certainty for space. Count sketch allows you to select 2 parameters: accuracy of the results ε and probability of bad estimate δ.

To do this you select a family of dpairwise independent hash functions. These complicated words mean that they do not collide to often (in fact if both hashes map values onto space [0, m] the probability of collision is approximately 1/m^2). Each of these hash functions maps the values to a space [0, w]. So you create an d * w matrix.

Now for when you read the element you calculate each of d hashes of this element and update the corresponding values in the sketch. This part is the same for Count sketch and Count-min sketch.

Insomniac nicely explained the idea (calculating expected value) for count sketch, so I will just tell that with count-min everything is even simpler. You just calculate d hashes of the value you want to get and return the smallest of them. Surprisingly this provides a strong accuracy and probability guarantee, which you can find here.

Increasing the range of hash functions, increase the accuracy of results, increasing the number of hashes decreases the probability of bad estimate:
ε = e/w and δ=1/e^d. Another interesting thing is that the value is always overestimated (if you found the value, it is most probably bigger than the real value, but surely not smaller).