I have a list of items from which I want to randomly sample a subset, but each item is paired with a histogram over D bins and I want to sample the items in such a way that the summed histogram is approximately uniform.

The absolute values of the summed histogram are not important nor does it need to be exactly uniform, it just needs to be approximately uniform. Also, I don't care if the returned sample size is not exactly the specified sample size. The sampling should be without replacement.

As an aside, the items that I want to sample are image patches and the histogram are label histograms from a manual segmentation of the image.
–
CvWFeb 9 '13 at 21:44

1

What you could do is first choose weights for your items so as to make the weighted sum (approximately) uniform, and then take a weighted sample of the items. The first part is a multivariate optimization problem, the second is relatively straightforward e.g. using cumsum() to compute the CDF and searchsorted() to sample it.
–
Ilmari KaronenFeb 11 '13 at 14:30

2 Answers
2

To expand on @Ilmari Karonen's solution, what you want to do is compute weights for each histogram and then sample according to those weights. It appears to me that the most efficient way to do this, given your goal, would be with a linear program.

Let D_ij be the weight of the jth bin in the histogram of the ith item. Then if each item is weighted with weight w_i, the "summed histogram" would have weight sum(i in items) w_i D_ij. One way to get your "approximately uniform" distribution would be to minimize the maximum difference across bins, so we would solve the following LP:

The above is basically saying that z >= absolute value of difference across all weighted pairs of bins. To solve this LP you will need a separate package since numpy does not include a LP solver. See this gist for a solution using cplex or this gist for a solution using cvxpy. Note that you will need to set some constraints on the weights (e.g. each weight is larger or equal to 0) as these solutions do. Other python bindings for GLPK (GNU Linear Programming kit) can be found here: http://en.wikibooks.org/wiki/GLPK/Python.

Finally you just sample from histogram i with weight w_i. This can be done with an adaptation of roulette selection using cumsum and searchsorted as suggested by @Ilmari Karonen, see this gist.

If you wanted the resulting weighted distribution to be "as uniform as possible", I would solve a similar problem with weights, but maximize the weighted entropy across the weighted sum of bins. This problem would appear to be nonlinear although you could use any number of nonlinear solvers such as BFGS or gradient-based methods. This would probably be a bit slower than the LP method but it depends on what you need in your application. The LP method would approximate the nonlinear method very closely if you have a large number of histograms, because it would be easy to reach a uniform distribution.

When using the LP solution, a bunch of the histogram weights may bind to 0 because the number of constraints is small, but this will not be a problem with a non-trivial number of bins, since the number of constraints is O(n^2).

Could you draw a number of complete random samples (of 500), and then pick the one that is most uniform (i.e. lowest sample.sum(axis=0).std())? That avoids weird biases when drawing incremental samples.

The problem with this is that the probability of any one of those samples having a distribution that is very different from the dataset distribution is extremely small. The number of samples I would have to draw in order to have a good chance at drawing an approximately uniform sample would be way too large.
–
CvWFeb 12 '13 at 9:40