2012/2/17 Raúl Raja Martínez <raulraja@gmail.com>
> Hello everyone,
>
> I'm working on a application that uses Cassandra and has a geolocation
> component.
> I was wondering beside the slides and video at
> http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php that
> simplegeo published regarding their strategy if anyone has implemented
> geohash storage and search in cassandra.
> The basic usage is to allow a user to find things close to a geo location
> based on distance radius.
>
> I though about a couple of approaches.
>
> 1. Have the geohashes be the keys using the Ordered partitioner and get a
> group of rows between keys then store the items as columns in what it would
> end up looking like wide rows since each column would point to another row
> in a different column family representing the item nearby.
>
That's what we did early on at SimpleGeo.
> 2. Simply store the geohash prefixes as columns and use secondary indexes
> to do queries such as >= and <=.
>
This seems like a reasonable approach now that secondary indexes are
available. It might even address some of the hotspot problems we had with
the order preserving partitioner since the indices are distributed across
all hosts. Of course there are tradeoffs there too. Seems like a viable
option for sure.
> The problem I'm facing in both cases is ordering by distance and searching
> neighbors.
>
This will always be a problem with dimensionality reduction techniques like
geohashes. A brief bit of pedantry: it is mathematically impossible to do
dimensionality reduction without losing information. You can't embed a 2
dimensional space in a 1 dimensional space and preserve the 2D
topology. This manifests itself all sorts of ways, but when it comes to
doing kNN queries it's particularly obvious. Things that are near in 2D
space can be far apart in 1D space and vice versa. Doing a 1D embedding
like this will always result in suboptimal performance for at least some
queries. You'll have to over-fetch and post-process to get the correct
results.
That said, a 1D embedding is certainly easier to code since
multidimensional indexes are not available in Cassandra. And there are
plenty of data sets that don't hit any degenerate cases. Moreover, if
you're mostly doing bounding-radius queries the geohash approach isn't
nearly as bad (the only trouble comes when you want to limit the results,
in which case you often want things ordered by distance from centroid and
the query is no longer a bounding radius query - rather, it's a kNN with a
radius constraint). In any case, geohash is a reasonable starting point, at
least.
The neighbors problem is clearly explained here:
> https://github.com/davetroy/geohash-js
>
> Once the neighbors are calculated an item can be fetched with SQL similar
> to this.
>
> SELECT * FROM table WHERE LEFT(geohash,6) IN ('dqcjqc',
> 'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8')
>
> Since Cassandra does not currently support OR or a IN statement with
> elements that are not keys I'm not sure what the best way to implement
> geohashes may be.
>
Can't you use the thrift interface and use multiget_slice? If I recall
correctly, we implemented a special version of multiget_slice that stopped
when we got a certain number of columns across all rows. I don't have that
code handy but we did that work early in our Cassandra careers and,
starting from the thrift interface and following control flow for the
multiget_slice command, it wasn't terribly difficult to add.
Mike