On Sun, Dec 27, 2009 at 5:47 PM, David Cournapeau <cournape@gmail.com> wrote:
> On Mon, Dec 28, 2009 at 10:37 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>> The kmeans function has two modes. In one of the modes the initial
>> guesses for the centroids are randomly selected from the input data.
>> The selection is currently done with replacement:
>>>> guess = take(obs, randint(0, No, k), 0)
>>>> That means some of the centroids in the intial guess might be the
>> same. Wouldn't it be better to select without replacement?
>> I think you are right, but random sampling without replacement for
> floating point values is a bit hard to use here: if two values are
> different but very close, you would see the same effect, right ?
>> Generally, for clustering algorithms, I think you'd you want to start
> with centroids as far from each other as possible, so maybe the code
> could be improved taking this into account.
That's a good point. And it sounds like an interesting problem to find
the set of k points that are farthest apart. But the time it takes to
find far apart points could be used to do more iterations of randomly
selected points. I see that kmeans2 has a few different methods for
selecting the initial points. A method like the one you suggest could
be added there. (Maybe use pdist to select the two farthest apart
points and then select the point that is furthest away from the first
two points and so on. It would get slow if a lot of points were
needed.) For kmeans I'll file a ticket to draw without replacement.