Discussion: This is one of many techniques used to solve a problem called reservoir sampling. We often encounter data sets that we’d like to sample elements from at random. But with the advent of big data, the lists involved are so large that we either can’t fit it all at once into memory or we don’t even know how big it is because the data is in the form of a stream (e.g., the number of atomic price movements in the stock market in a year). Reservoir sampling is the problem of sampling from such streams, and the technique above is one way to achieve it.

In words, the above algorithm holds one element from the stream at a time, and when it inspects the -th element (indexing from 1), it flips a coin of bias to decide whether to keep its currently held element or to drop it in favor of the new one.

We can prove quite easily by induction that this works. Indeed, let be the (unknown) size of the list, and suppose . In this case there is only one element to choose from, and so the probability of picking it is 1. The case of is similar, and more illustrative. Now suppose the algorithm works for and suppose we increase the size of the list by 1 adding some new element to the end of the list. For any given among the first elements, the probability we’re holding when we inspect is by induction. Now we flip a coin which lands heads with probability , and if it lands heads we take and otherwise we keep . The probability we get is exactly , as desired, and the probability we get is . Since was arbitrary, this means that after the last step of the algorithm each entry is held with probability .

It’s easy to see how one could increase the number of coins being flipped to provide a sampling algorithm to pick any finite number of elements (with replacement, although a variant without replacement is not so hard to construct using this method). Other variants, exist, such as distributed and weighted sampling.

Python’s generators make this algorithm for reservoir sampling particularly nice. One can define a generator which abstractly represents a data stream (perhaps querying the entries from files distributed across many different disks), and this logic is hidden from the reservoir sampling algorithm. Indeed, this algorithm works for any iterable, although if we knew the size of the list we could sample much faster (by uniformly generating a random number and indexing the list appropriately). The start parameter given to the enumerate function makes the variable start at 1.

Let me just get this clear: you have to iterate through the entire stream, right?

Well, if this is the case, this algorithm is only a good choice if this stream is highly mutable in size, otherwise maybe it is better to keep track of the size in the first pass and then use the “known size” solution…

I’m just asking because I don’t really know the concept of Big Data (although something can be guessed by its name =D)

No. This algorithm has half the expected runtime of what you suggest precisely because it doesn’t iterate through the entire stream (except when it picks the last element). This algorithm is used in cases where you don’t necessarily want to (or can’t) go back to an earlier point in the stream. The whole point of streaming algorithms is that the sequences involved do not have random access.

As an analogy, if you have an assembly line of parts going by on a conveyor belt and you want to sample them uniformly at random for testing, you don’t wait until they’ve all passed by before choosing which to sample. You pick them off the assembly line as they go by using an algorithm like this (this algorithm can be extended to pick a set of k elements uniformly at random). Similarly, you can use this algorithm to sample from infinite lists. This could occur, for example, during database insertions over the lifetime of a web store.

Then I completely missed the behaviour of this algorithm…
I don’t really know python, but, to me, this function will only return the chosen element after going through all the elements (that is what I understood by “for k,x in enumerate(stream, start=1)”), because there isn’t any “break” condition inside the if statement… But you told me that you shoudn’t look at them all.

My apologies, I’ve made a terrible mistake. I was replying to this comment on my mobile device and clearly forgot what the algorithm does. You do have to get through the entire sequence, which makes the algorithm less useful than I said. But the “read-only” restriction is still relevant, and it’s a paradigm that shows up often in the study of streaming algorithms.

for an element arriving at time ‘t+1’ add it with probability ‘m/(t+1)’ where m is the number of samples.
if element arriving at ‘t+1’ was indeed added then select a random element form set {1…k} uniformly and delete that element. to maintain cardinality of the set to |m|.

proof:

for any element x(i) where i<=t to remain in the set the probability is sum of

a) x(t+1) was selected and x(i) was not removed from the set {m}
b) x(t+1) was not selected and x(i) was not removed.

we are given that probability that x(t+1) is selected = m/(t+1)
probability that x(i) was not removed = (1-1/m)*(m/t)

Write code, not cover letters
Triplebyte's common application lets talented programmers skip resume and recruiter screens while applying to multiple top tech companies at once. Beat their online coding quiz to get started. People interested in math and physics tend to do well.