Tuesday, January 13, 2009

I just saw this blog post in rebuttal to Martin Fowler's DatabaseThaw article. It's a typically well-thought out response from Typical Programmer. The short summary is that he rejects Fowler's criticism of using RDB's as the mechanism for application integration. (And don't dismiss him as simply a RDB bigot - he's not shy about nailing relational gurus, as in Relational Database Experts Jump The MapReduce Shark).

TP's arguments ring true to me, although in my experience it takes a very concientious DB shop to fully maintain the "encapsulation of data and operations" across many applications, many custodians, and many developers.

It would be great if Fowler's vision of the integration point being HTTP rather than SQL solved all our problems. Loose coupling is a wonderful thing to have. But so far I'm not seeing how this provides equivalent performance for all cases, and how it avoids the problem of data model mismatch.

Friday, January 9, 2009

Martin Fowler has a great blikiTM post on DatabaseThaw. The soundbite is something like "What happens after relational DBs?". He has some interesting links to things like Drizzle.

His opinon of relational DBs is that

their dominance is due less to their role in data management than their role in integration

This makes sense, but I think they also have offered the benefit of providing a fairly powerful, standard way of modelling and querying data. Over the lifetime of this technology this has fought off some powerful challengers, notably object-oriented data modelling. (Although perhaps this battle was only won by becoming more like the "loser"). The latest challenger is semantic data modelling (RDF, SPARQL, etc). Perhaps the world is ready to make this transition - although I'm not rushing to sell my Oracle stock.

He also makes a good point that the focus of integration is shifting from the DB to the Web. Hard to argue with this... the clouds are moving in!

Thursday, January 8, 2009

The GEOS team is working on porting the JTS CascadedPolygonUnion algorithm. Reasonably enough they asked whether there were unit tests that they could use to check the correctness of their code. I was a bit dismayed to realize that in fact there weren't any unit tests for this class (although having used the code extensively I'm pretty confident that it's correct.)

So fine, some unit tests are needed. This should be easy, right? Just create or obtain some polygonal datasets, union them using both CascadedPolygonUnion (fast) and Geometry.union (slow), and compare the results.

But there's a catch (in geometric processing there usually is). Due to rounding effects, there's no guarantee that the results of the two algorithms will be vertex-by-vertex identical. Of course, they should be very, very similar... but how to check this?

In geometry processing it's usually much harder to test "similar" than it is to test "exact", and this case is no exception. In fact, there's many different tests that could be used, depending on how you want to define "similar". In the context of unit tests, similarity testing is essentially checking for discrepancies which might be introduced into a theoretically correct result. So the tests to use may depend on the operation(s) that are being performed, since different kinds of algorithms can produce different kinds of discrepancies. A further issue which needs to be considered is to detect gross differences which might result from catastrophic algorithm failure.

So now the task is to write a SimilarityValidator for geometries. For polygons there's various characteristics which can be compared:

A simple thing to do is to compare basic quantities like the area and perimeter length of the polygons. Paradoxically, these are good for detecting small discrepancies, but not for detecting gross ones (e.g. two identical rectangles which are offset by a large distance)

One obvious thing to do is to compute the symmetricDifference of the polygons and inspect the area of the result. This has a couple of issues. One is what area tolerance to use? A more serious issue is that the symmetric difference operation is itself an approximation to an exact answer. This is especially true in situations where the inputs are very close - which is exactly the situation we are trying to check! So there may be a certain lack of confidence that this is detecting all discrepancies

A more robust test is to compute the Hausdorff distance between the rings of the polygons. This allows expressing the tolerance in terms of distance, which is more intuitive. It's also fairly robust and easy to confirm the answer. There's a challenge in developing an efficient Hausdorff algorithm - but that's a topic for another blog post.

Other tests might include comparing structural things like comparing the number of rings in the polygons.

Hausdorff Distance between two polygons

With this validator written and some test datasets run I can finally confirm that CascadedPolygonUnion actually works!