Probabilistic Counting

I recently got my hands on the common-crawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s log-log paper.

The algorithm operates on a stream and produces an estimate for the cardinality of the input stream. Here’s the algorithm itself: