Streaming Median

May 29, 2012

The median of a set of numbers is the number in the middle when the items are arranged in sorted order, when the number of items is odd, or the average of the two numbers in the middle, when the number of items is even; for instance, the median of {3 7 4 1 2 6 5} is 4 and the median of {4 2 1 3} is 2.5. The normal algorithm for computing the median considers the entire set of numbers at once; the streaming median algorithm recalculates the median of each successive prefix of the set of numbers, and can be applied to a prefix of an infinite sequence. For instance, the streaming medians of the original sequence of numbers are 3, 5, 4, 3.5, 3, 3.5, and 4.

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Your task is to write a function that computes the streaming medians of a sequence. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Like this:

Related

8 Responses to “Streaming Median”

Python’s heapq library implements min-heaps. I push and pop the smaller numbers in the negative, so that min and max switch their roles. Must be recent Python (possibly 3, which I use, or 2 with the proper incantation) for division to to work as intended.

(I have been confused about the associativity of median of three in the past. It wasn’t suitable for this problem after all, so I followed the instructions quite humbly this time. No major blunders, hopefully :)

I keep the heaps so that they are the same size or the right heap is 1 element bigger than the left heap. That way, the median is the top of the right heap if the right heap is bigger; otherwise, it’s the average of the top of both heaps.

Mike, very nice. Here’s the same in Scheme, assuming mutating min-heap operations make-heap, size, least, push!, pop!. It applies a procedure to each consecutive median of a list. (Not tested. I have neither the heap operations nor the time.)

It might also be useful to have a function that is “fed” one element at a time from the input stream and that returns the median of all the elements that have been fed to the function so far. That can be done like so:

This is kind of an ill-posed (albeit clever) solution. If we were really dealing with a stream of numbers, it’d would be infeasible to store the entire stream in memory, and this solution requires us to store the entire history as we go along. Of course, if the process generating the numbers doesn’t change over time, then we can abort this computation after a while and be confident that our streaming median has converged to an appropriate approximation of the true median. But the point of a streaming median is to allow the median to change over time when the process generating the numbers changes.

Here is a more appropriate solution, which is used in practice and only requires O(1) space (the earliest reference I’ve seen to this method is in McFarlane, “Segmentation and tracking of piglets in images”). Keep a number m representing the current median. Perhaps initialize this to the median of the first few elements of the sequence, or alternatively the first element in the sequence, or some reasonable guess. Then when a new element x arrives, if x is greater than m you increment m by some fixed amount, and if x is less than m you decrement by that fixed amount. It’s clear that as more elements arrive, m will converge to the true median, up to the choice of the increment size.

I imagine one can add additional spot-checking techniques to update the increment dynamically, but in my applications I only work with integer sequences, so an increment size of 1 suffices. Also, as expected, this technique does not converge nearly as fast as the heap method, especially with a poor initial choice for m. On the other hand, it’s as fast as a streaming median algorithm can possibly be.