I presume everyone is familiar with the problem of finding a median of a set of numbers (a median of a set is an element of the set so that at most half of the elements of the set are smaller than the median, and at most half of the elements are larger than the median. For instance, of the set 1, 2, 3, 4, 5, 6, the number 3 is a median (and so is 4)). A simple algorithm sorts the set of numbers, and then takes the middle element of the sorted list. But this takes Ω(N log N) worst case, as it needs to sort.

Warm-up problem: write a subroutine that takes a set of numbers (without implied order, and possible duplicates) and returns a median. The sub should run in O(N) time.

Generalizing this to 2-dimensions is easy - to find a line that separates a set of points in 2-d into two subsets each at most half the size of the original set. You'd just ignore the y-coordinate, find the median of the x-coordinates, and pick a line that has this x-coordinate a constant.

But it's more interesting if you have two sets of points: a set of red coloured points, and a set of green coloured points (both in 2-d). Now, there is a line (or more than one) that simultaneously divides the set of red points and the set of green points such that on the left of the line you have at most half of the red points and half of the green points, and on the right of the line, you have at most half of the red points, and at most half of the green points. (This means that if there are an odd number of red points and an odd number of green points to start with, the dividing line will contain at least one red, and at least one green point).

Challenge 1a: Write a subroutine that accepts two sets of points (in 2-d), and returns a line simultaneously dividing the sets as described above.Challenge 1b: Can you do it in O(N log N) time, where N is the total number of points in both sets?Challenge 1c: Can you do it in linear time, or prove it to be impossible?

Now, what holds of 2-d, holds for 3-d (and higher dimensions as well). Given sets of red, green and yellow points, there exists a plane that divides the three sets such that at most half of the red, green and yellow points are above the plane, and at most half of each set is below the plane. (This even holds for sets with an infinite amount of points, leading to the 'ham-cheese sandwich theorem' that states you can cut a ham-cheese sandwich into two parts with a single cut such that both halves at equal amounts of ham, cheese and bread - even if you leave the cheese in the fridge).

Challenge 2a: Write a subroutine that takes three sets of points in 3-d, return a plane dividing all the sets as described above.Challenge 2b: Prove that your solution is optimal.

The warm-up problem isn't so trivial. Two possible solutions are decribed in Cormen – Leiserson – Rivest – Stein: Introduction to Algorithms. One of these is randomized and runs in expected O(n) time. (Update: the other one runs in guaranteed O(n) time, as Perl Mouse has noted in his reply. I'm sorry this wasn't clear from my original post.)

I've recently implemented this randomized algorithm for perl, although my implementation is not a very efficent one, as it would be possible to do all its operations in place (with only O(n) extra memory and more importantly less time).

Nice, but the worst case running time is Ω(n2). It suffers from the same problem as Quicksort: picking a random pivot works well often enough to get a good expected running time, but if you're unlucky, it's really slow.

There is an algorithm to do it in garanteed linear time (although when done in Perl, the constants are so high that for most practical situations, one can better use sorting in C and picking the middle element).

Perl Mouse,
I haven't started working on any of the challenges yet, because I wanted to raise a question first. When I learned about means, modes, and medians in statistics - I thought I remembered learnING that the median of an even list is the average of the two middle numbers.

It depends how you look at the problem. If you look at it as the 1-d variant of the "divide sets using a simplex" problem, any number between the two middle numbers will do. However, if you want to write a Quick Sort whose running time is garanteed to be O(N log N), you need to find a median in linear time, and you want to find an element of the set - not something in between.

Wether you find one of the middle elements, or pick a number in between, I'll accept both solutions. ;-).

Are you sure in this? Once you have found a number so that exactly half of the numbers are to the left and half are to the right, couldn't you separate these two classes of numbers, sort them separately, and still get an O(N log N) time sort this way?

If I have to define median, I'd say that if you have an even number of data, any number between the two middle one is a median. This way the definition is equivalent then if you say that the median is a number whose total distance from the given numbers is minimal. This latter definition has paraleles: the mean is the number for which the square sum of its distance from the given numbers is minimal. More clearly, given the sequence (x_1, ..., x_N), the mean is the number A that minimizes the expression |x_1 - A|^2 + ... + |x_N - A|^2; the median is M if it minimizes |x_1 - M| + ... + |x_N - M|. Furthermore, informally speaking, the modus C minimizes |x_1 - C|^epsilon + ... + |x_N - C|^epsilon, where epsilon is a very small positive number.

Observation 1 (or maybe it's just obvious):
For even numbers of points, there may be more than one correct answer (even aside from trivially jittering the dividing line back and forth a little). Example: two red points (0,0),(1,1) and two green points (1,0),(0,1). Plotting them:

They can be divided vertically or horizontally. Declaring an odd number of points of each color and requiring that no two points be at the same spot may force a unique solution, but I'm not sure. Wow! This is a tough problem!
Update: Turns out there are at least some graphs with an odd number of each color of node and where there are multiple correct answers.
Example:

I really doubt it because any point can have a median drawn through it. Remember, the median can go in any direction. For any point, you can find some angle to draw a line through that point that will separate all other points of that color into two equally-sized groups. So if it's a tree structure, it's not a standard one where divisions are made parallel to the axis of the graph.

I've not yet convinced myself that this is soluble in the general case.

In the 2D case, if all the points in both groups have one coordinate in common, and there are an odd number of points in each group or in the more general case of all the points lying on a straight line at any arbitrary angle.:

(1b) I have an n*lg(n) solution for 2-D. Sorry, BrowserUK: I was SO wrong in saying no tree. My solution (which I have not coded) uses a PR quadtree. My solution is certainly not the only approach that will work.

(1c) I have no idea (yet) how this might be brought up to O(n), nor do I have a clue (yet) how to prove it impossible.

(2a) Also, while a PR quadtree can be used in 3D, my approach does not extend easily to 3D. More thought required.

Question: For a puzzle, ought I post pseudocode/code on completion, or is it polite to instead leave the solution unposted, giving others the fun of solving it?

No, the proof doesn't need an expected running time. The running time T(N) is expressed as:

T(N) = T(N/5) + T(7N/10 + 10) + Ο(N);

which has T(N) = Ο(N) as a solution.

However I believe it will be much harder to prove that the push is O(1) - indeed I suspect it is not - and without that the algorithm as a whole cannot be O(n).

It doesn't have to be. What's needed is that the push has an amortized running time of Ο(1) - that is, if we perform N pushes, the total running time is still bounded by Ο(N). And from what I understand of how allocation of array sizes work (an addition extra 20% memory is being claimed), a push has an amortized Ο(1) performance. A single push may take Θ(N) running time, but N pushes average it out.

1, was the solution to this ever posted? It seems like in the exchange with ambrus there was a pretty strong hint to the solution for challenge 1, but as to everything else - huh?

2, for future reference, a pretty unintimidating guide to sorting (just some class notes) is at a sorting

I'm kind of reading up on that during breaks. I take it that sorting is at the core of the problem space here, but maybe I'm wrong on that. Anyway, thanks for posting an interesting problem. A solution would be nice though :)

If I calculate the median point of both datasets, (using the minimised Euclidian distance method, 2D for now), I get two points, one for each set of colors. These rarely match up with any of the given points.

If I project the line through those two points, it appears to divide the dataset as required. Is this the correct approach?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

I think you're suggesting take four medians: the median of the X coordinates of the reds, the median of the Y coordinates of the reds, and the same for the greens. I don't understand what you might be doing with minimised Euclidean distance, though. Mind explaining? Then maybe I can tell you if it's on track with my aproach (which is NOT yet O(n)).

Whilst the top point is the median in the X axis (looking up). The bottom right point is the median if you are looking in from the top left. Equally it's the bottom left point, if you look in from top right. Which would be the "correct median" depends upon the relative positioning of the other set of three points; or more correctly, their median. And the above three points can be rotated through 0->120°, giving an infinite number of directions to view the dataset, (or transformations you could apply), in order to access the median.

Which I think means that the warm-up problem is an almost complete red herring!

As you cannot work out which direction to look in (or which transformation of the coordinate system to apply), to determine the median for this dataset, until you know the median of the other. And vice versa. You cannot use a 'sort and take the middle' or K'th ordered element approach to determining the median as you would use for an R1 dataset; for an R2 dataset. Nor for the higher dimensions.

That leads you, (led me?), to think about how to determine the median of a set of points in R2, without reference to the other dataset. And that's when I found the Euclidian distance method.

The premise is that the median of a R2 dataset is that point at which the sum of the Euclidian distances between that point and the points in the datast is minimised.

There are other methods, including the point that minimises the sum of the areas of the sets of triangles formed between that point and pairs of points of the dataset, but that seems much harder to calculate.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.