I have bunch of data points with latitude and longitude. I want to use R to cluster them based on their distance.

I have already taken a look at this page and tried clustTool package. But I am not sure if clust function in clustTool considers data points (lat,lon) as spatial data and uses the appropriate formula to calculate distance between them.

I mean I cannot see how they differentiate between spatial data and ordinal data. I believe the distance calculation between two points on map (spatial) and two normal numbers is different. (Is it not?)

Also what happens if I want to consider a third parameter in my clustering? Like say if I have (lat,lon) and one other parameter. How is the distance calculated?

The other problem I have with clustTool is that it is designed with a GUI in mind. I don't know how I can skip the GUI overhead in the library because I don't need it.

So basically I want to know what options I have in R for cluster analysis of spatial data?

tnx whuber. I have a question. Is there a specific package for spatial clustering in R? I mean, as far as I understand the distance should be calculated differently for spatial data. Is this correct?
–
kaptanDec 7 '11 at 0:06

Almost every general-purpose clustering package I have encountered, including R's Cluster, will accept dissimilarity or distance matrices as input. This makes them perfectly general and applicable to clustering on the sphere, provided you can compute the distances yourself, which is straightforward.
–
whuber♦Dec 7 '11 at 16:23

I am facing a very similar problem for a long time but can't find a nice solution, you can take a look at my post in stack-exchange. I have a set of monthly sea surface temperature data (lon,lat,sst). Have you found the way to find clusters for such spatial data? I can't find the proper R package/function. Thanks in advance Paco
–
pacometJul 31 '12 at 11:44

5 Answers
5

I'd take a look at the Spatstat package. The entire package is dedicated to analysing spatial point patterns (sic). There's an excellent ebook written by Prof. Adrian Baddeley at the CSIRO which contains detailed documentation, how-to's and examples for the entire package. Take a look at chapter 19 for "Distance methods for point patterns".

That said, I'm not sure that even spatstat differentiates between spatial and ordinal data, so you might want to reproject your points into something with consistent x and y values - possibly try using rgdal (a R library for GDAL and OGR).

tnx. That's a great ebook. But I am not sure how clustering can be done using this Spatstat because I don't see any specific function for clustering. Can you explain a bit?
–
kaptanDec 7 '11 at 0:53

1

Actually, to be fair, having looked at it again I'd look at the DCluster package - a package also by Bivand on analysing disease clusters. Also, apologies for the wait on the reply!
–
om_hennersDec 20 '11 at 13:05

There are functions for computing true distances on a spherical earth in R, so maybe you can use those and call the clustering functions with a distance matrix instead of coordinates. I can never remember the names or relevant packages though. See the R-spatial Task View for clues.

The other option is to transform your points to a reference system so that the distances are Euclidean. In the UK I can use the OSGrid reference system:

data = spTransform(data,CRS("+epsg:27700"))

using spTransform from package 'rgdal' (or maybe maptools). Find a grid system for your data (the relevant UTM zone will probably do) and you'll be computing distances in metres no problem.

This is only good if your data is a small-ish area - if you have global data then you really do need to compute the spherical distance, and that's somewhere in one (or more) of the packages discussed in the R Spatial Task View:

While not an R package, geoda might be an interesting program to examine as it is written by Luc Anselin who has contributed to spatial clustering theory, and I believe it enables some clustering (though it has been some time since I have explored it).

spdep looks like a promising R package for performing some initial calculations. It is described as follows:

A collection of functions to create spatial weights matrix objects
from polygon contiguities, from point patterns by distance and
tesselations, for summarising these objects, and for permitting their
use in spatial data analysis, including regional aggregation by
minimum spanning tree; a collection of tests for spatial
autocorrelation, including global Moran's I, APLE, Geary's C,
Hubert/Mantel general cross product statistic, Empirical Bayes
estimates and Assunção/Reis Index, Getis/Ord G and multicoloured join
count statistics, local Moran's I and Getis/Ord G, saddlepoint
approximations and exact tests for global and local Moran's I; and
functions for estimating spatial simultaneous autoregressive (SAR) lag
and error models, impact measures for lag models, weighted and
unweighted SAR and CAR spatial regression models, semi-parametric and
Moran eigenvector spatial filtering, GM SAR error models, and
generalized spatial two stage least squares models.

You can at least test if your points are randomly distributed spatially (presumably a useful test pre-clustering when considering spatial distances), but it can also generate other useful measures that you could input to your clustering algorithm. Finally, perhaps you might find useful questions on http://stats.stackexchange.com/ dealing with spatial clustering issues (though, more from a theoretical perspective).

To my knowledge, spatial clustering requires a defined neighborhood to which the clustering is constrained, at least at the beginning. The kulldorf function in the SpatialEpi package allows for spatial clustering based on aggregated neighborhoods.

further the DBSCAN statistic available from the fpc package could be useful.