[SciPy-User] kmeans

On Thu, Jul 22, 2010 at 12:07 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
>>> On Thu, Jul 22, 2010 at 10:24 AM, Benjamin Root <ben.root@ou.edu> wrote:
>>> On Thu, Jul 22, 2010 at 10:42 AM, alex <argriffi@ncsu.edu> wrote:
>>>>> On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <ben.root@ou.edu> wrote:
>>>>>>> On Wed, Jul 21, 2010 at 3:25 PM, alex <argriffi@ncsu.edu> wrote:
>>>>>>>>> Hi,
>>>>>>>>>> I want to nitpick about the scipy kmeans clustering implementation.
>>>>> Throughout the documentation
>>>>>http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code,
>>>>> the 'distortion' of a clustering is defined as "the sum of the distances
>>>>> between each observation vector and its dominating centroid." I think that
>>>>> the sum of squares of distances should be used instead of the sum of
>>>>> distances, and all of the miscellaneous kmeans descriptions I found with
>>>>> google would seem to support this.
>>>>>>>>>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and
>>>>> the old center is 3, then the centroid updating step will move the centroid
>>>>> to 4. This step reduces the sum of squares of distances from 55 to 50, but
>>>>> it increases the distortion from 11 to 12.
>>>>>>>>>> Alex
>>>>>>>>>>>>> Every implementation of kmeans (except for SciPy's) that I have seen
>>>> allowed for the user to specify which distance measure they want to use.
>>>> There is no right answer for a distance measure except for "it depends".
>>>> Maybe SciPy's implementation should be updated to allow for user-specified
>>>> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
>>>>>>>> Ben Root
>>>>>>>>>> While the best distance might depend on the application, I think that of
>>> all these distance measures only the sum of squares of Euclidean distances
>>> is guaranteed to monotonically decrease at each step of the algorithm. If
>>> the scipy kmeans implementation depends on this monotonicity, which I think
>>> it currently does, then this assumption could be the source of subtle bugs
>>> when error measures like distortion are used.
>>>>>> Maybe I can nitpick more effectively if I frame it as "scipy is doing
>>> something strange and possibly buggy" instead of as "scipy is not using the
>>> distance function I want".
>>>>>>>> It has been a while since I played with kmeans, but I believe that the
>> distance measure merely has to satisfy Minkowski's inequality which is that
>> the norm of a + b <= norm of a + norm of b (or was it the Cauchy-Schwarz
>> inequality?)
>>>> That's the triangle inequality and is a required property of anything
> called a norm.
>> <snip>
>> Chuck
>>>Right... why is it that I first think of the most complex concepts first...?
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/377f618d/attachment-0001.html