I ran into an interesting mini-problem recently. I had a collection of items, where each item had a value and I wanted to select one item in a way where the probability of being selected was proportional to the value.

Suppose there are 4 items (with IDs 0, 1, 2, 3) and the values of the four items are 20, 40, 10, 30. The sum of the four values is 20 + 40 + 10 + 30 = 100. I want to select item 0 with probability = 20/100 = 0.2, and item 1 with probability 40/100 = 0.4, etc. In other words, items with higher values have a higher likelihood of being selected.

This mini-problem occurs in genetic algorithms where the items are genetic individuals and the values are their fitness. Individuals with higher fitness values are selected for breeding. The mini-problem, sometimes called fitness proportionate selection, occurs inside other algorithms too.

There are several ways to perform fitness proportionate selection. When I implement genetic algorithms, I almost always use a technique called tournament selection because it is very easy to code.

But I was looking at a different problem — k-means++ initial means selection. Here the items are data tuples and their values are squared-distance to the closest existing mean. I wanted to select a data item that has a high distance-squared value as the next initial mean, because you want the means to be different from each other.