Wednesday, July 15, 2015

Recently I fixed the JTS KD-Tree implementation so that it works as advertised with a distance tolerance to provide point snapping. This gives a fast way to produce random point fields with even distribution (i.e. no points too close together).

First, generate a batch of random points using RandomPointsBuilder. As is well known, this produces a very "lumpy" distribution of points:

Then, put them in a KD-Tree using a snapping distance tolerance. Querying all points in the final tree produces a nice even distribution of points:

Using the Concave Hull algorithm available here with the same distance tolerance produces a random polygon with a very pleasing appearance:

I suspect that these kinds of polygons might be useful for generating stress tests for geometric algorithms.

UPDATE: Adding a bit of Bezier Smoothing produces an even cooler-looking polygon:

Wednesday, June 12, 2013

OpenLayers provides the useful WMSGetFeatureInfo control. It's designed to work with the standard WMS GetFeatureInfo request. As per the standard, the control supports querying multiple layers via setting the layers property.

It's often necessary to define client-side filters for WMS layers, to display only a subset of the layer data in the backing feature type. Usually the filters need to be defined dynamically, based on the application context. When using GeoServer as the web mapping engine a convenient (but non-standard) way of doing this is to use the CQL_FILTER WMS parameter. (One might reasonably ask why there isn't an equally simple way to do this in the WMS standard itself, but that's another story). In OpenLayers this parameter can be added dynamically to a layer via the mergeNewParams method:

lyr.mergeNewParams({'CQL_FILTER': "filter expression" });

Naturally it is necessary to have the GetFeatureInfo control respect the layer filters as well. This is straightforward in the case of a single layer. The GeoServer CQL_FILTER parameter can be supplied using the vendorParams property on the WMSGetFeatureInfo control:

infoControl.vendorParams = { 'CQL_FILTER': 'filter expression'};

Since the CQL_FILTER parameter supports a list of filters, it's also straightforward to filter multiple layers as long as the list of layers queried is static:

infoControl.vendorParams = { 'CQL_FILTER': 'filt-1; filt-2; filt-3'};

But WMSGetFeatureInfo also provides the useful ability to query only visible layers (via the queryVisible property). This makes things much trickier, since the list of filter expressions must match the list of layers provided in the QUERY_LAYERS parameter. There's no built-in way of doing this in OpenLayers itself (not surprisingly, since the CQL_FILTER parameter syntax is specific to GeoServer only).

One way to do this is to build the CQL_FILTER parameter value dynamically uisng the CQL_FILTERs defined for the visible layers. This can be done when the control is invoked, via hooking the beforegetfeatureinfo event.

Although climbing up the OpenLayers learning curve often feels like a big struggle, it's important to recognize the very wide set of requirements that the library is trying to address. Due to the nature of spatial data, user interfaces and protocols dealing with it are inherently complex. The more I work with OpenLayers, the more appreciation I have for the fine balance between simplicity and flexibility the designers have achieved. (And if that sounds like I do not subscribe to the "Spatial is not special" canard, you're hearing me right!).

Markieta's blog post outlines his workflow using ArcGIS. It's a bit cumbersome - apart from having to jump through hoops to read the data from the original DAT files, apparently the dataset has to be split into six parts to be able to process it. (For a measly 58K rows?!)

No details are provided about styling, which is the key part of the exercise. The images apparently use alpha blending to show flight density. Also, the coordinate system seems to be more curvaceous than the squaresville Plate Carree I used (so much more haute couture than saying Lat/Long). Both of these are easy to do in JEQL. Here's some samples of the improved output, using the alpha channel and a Mollwiede projection.

I find Open JUMP to be the most vector-friendly open source GIS software. The preparation of the datasets (rivers, lakes, sea, roads, borders) was really [a] piece of cake...

It's great to see the small but dedicated OpenJUMP community steadily adding new features and improving the software quality. 10 years after it was launched, OpenJUMP continues to be the "Little Open-Source GIS that Can".

Tuesday, February 12, 2013

It's no longer novel to observe that Open Source is, if not the dominant software paradigm of the era, at least one of the most significant innovations
in the history of software practice. Recently it struck me how downright bizarre the Open Source paradigm really is. I can't think of another field of human endeavour where the fundamental paradigm mandates giving away the product of one's labour. Consider a few sweepingly-generalized examples:

Business - Fugedaboudit! It's all about the money. Apart from the Diggers of 60's Haight-Ashbury notoriety there aren't too many examples of businesses whose model consists of giving away their stock.

Arts - Hah! Obviously the big media companies are doing everything they can to squeeze money out of artistic endeavour. But even among the less mercantile stakeholders the main discussion is about how artists can be compensated for their creations. No-one seriously advocates that artists give away all their work for free.

Sport - Don't get me started on the gross discrepancy between compensation and value in professional sport. And at the amateur level, sponsorship and funding organizations are recognized to be essential to promoting the continued generation of sporting "product". (Wouldn't it be great if there was a similar system of sponsorship for software developers?)

Science - You might think this would be the exception that proves the rule. After all, sharing research results is a revered principle of scientific progress. The domain relies on publishing information openly to an even greater extent than in software development. But in my (admittedly limited) experience many scientists are actually quite protective of their intellectual property, since their livelihood depends in a direct way on amassing it and monetizing how it is dispensed. And it's well known that academic institutions pay very close attention to licensing the IP generated by them (or their employees).

Just to be clear, I am not suggesting that the open source paradigm is flawed or wrong. In fact, I spend the major part of my professional life living and breathing Open Source geospatial software (JTS/GEOS, JEQL, Proj4J, GeoServer, PostGIS, etc). As a means of increasing the velocity and quality of software development it's by far the best model. And it's much more democratic and self-actualizing than the semi-feudal alternatives. But it really is a subversive concept. Marxist, even. It's no wonder that it's taking so long for the suits to wrap their heads around how to deal with it.

Wednesday, February 6, 2013

Ragnvald Larsen has an interesting post on ways to mitigate the poor performance and stabilty of Dissolve computations in ArcGIS. Dissolve is the Arc term for the geometric union of a collection of polygons (possibly grouped by attribute, although that capability was not used in this case).

He found that using the ArcGIS Dissolve method took about 150 sec to process the dataset. In an effort to reduce this time, he experimented with partitioning the dataset and doing the union in batches. After a (presumably lengthy) series of experiments to find the optimal batch size, he was able to get the time down to 25 sec using a batch size of 110 features.

Improving union performance by partioning the input is the basic idea behind the Cascaded Union function in JTS (which I blogged about back in 2007). Cascaded Union uses a spatial index to automatically optimize the partitioning. Ragnvald doesn't mention whether he used a spatial index, but I suspect this might be quite time-consuming to code in ArcPy.

I thought it would be interesting to compare the performance of the JTS algorithm to the ArcGIS one. To do this I used JEQL, which provides an easy high-level way to read the data and invoke the JTS Cascaded Union. The entire process can be expressed as a very simple JEQL script:

geomUnionMem is a JEQL spatial aggregate function which is implemented using the JTS Cascaded Union algorithm. (Although not needed in this case, note that the more general Dissolve use case of unioning groups of features by their attributes can easily be achieved by using the standard SQL GROUP BY clause.)

Running this on a (late-model) PC workstation produced a timing of about 1.5 sec!

Thursday, January 3, 2013

Meh. "Basics" is the key word in that title - the article is pretty light and fluffy. Fine if you don't know squat about FP, but it's also accompanied by a whole lot of starry-eyed razzle-dazzle which isn't really justified by the content (and note that I'm not saying it's wrong, just not substantiated).

To be fair, TB does have a few gripes. Here's a few more:

The example used to show how FP wonderfully avoids variables and side-effects is that hoary old one of computing squares of integers. (I mean really hoary - this was the first program I ever wrote, in WATFIV. And I at least had cool line printer output!) How about using something that's a bit more representative of an actual computational problem? Like say, red-black trees - withdeletion!

As TB points out, the people who really need to make algorithms run fast across 64 cores are a small percentage of current coders. For everyone else, scale-out is a more mundane but pressing problem. And it's not clear to me whether FP will make that easier.

As someone who spends his leisure hours trying to make spatial algorithms more performant, I'm suspicious of anything that promises to automagically make code go faster across multiple cores. In spatial most interesting problems are not "pleasantly parallel", and many of them are memory-bound as well as being compute-bound. So advances in performance would seem depend on better algorithms, not a different choice of language.

Back in the day I was pretty keen on FP languages - but I realized after being exposed to Smalltalk and later Java, a lot of their appeal was due to their (necessary) provision of automatic memory management (which was painfully lacking in the "mainstream" languages such as FORTRAN, Pascal, C - oh, and even C++).

But I'm not trying to prove a negative here. Certainly the FP features of no side-effects and lazy evaluation would seem to offer a lot of benefit for the right class of problems. And FP or FP-ish languages are more mainstream than ever before. So perhaps they really will become the mainstream language paradigm. I just hope I don't have to be coding using layers of inconveniently situated parentheses.

I've heard before about the postulated link between atmospheric lead levels (courtesy of the leaded gasoline used through the middle decades of the 20th century) and crime levels. This Mother Jones article America's Real Criminal Element: Lead is the best explanation I've seen so far (and has links to the original papers). It really sounds like this hypothesis is fully confirmed - and the best thing about this story is that it has a happy ending. (Unless you're trying to get elected as mayor - or Prime Minister - on a tough-on-crime platform).

There is a nice geospatial connection here. As with many epidemiological issues, spatial locality is an important aspect of the analyses that lead (ahem) to the conclusion. The article is chock-full of references to the spatial nature of the problem, such as:

We now have studies at the international level, the national level, the state level, the city level, and even the individual level

and my favourite:

a good rule of thumb for categorizing epidemics: If it spreads along lines of communication, he says, the cause is information. Think Bieber Fever. If it travels along major transportation routes, the cause is microbial. Think influenza. If it spreads out like a fan, the cause is an insect. Think malaria. But if it's everywhere, all at once—as both the rise of crime in the '60s and '70s and the fall of crime in the '90s seemed to be—the cause is a molecule.

Wednesday, January 2, 2013

A look back at 2012 from a software technology perspective by some of my favourite blogs:

Inspired By Actual Events - a wide-reaching roundup. I found the Java and friends links especially interesting, since the Java/JVM world is so big now it's hard to keep up with and distill the really significant events.

Tim Anderson - A strong focus on Microsoft, but also fairly even-handed assessment of the rest of the "A"-team (Apple, Android/Google, and Amazon). (And a not-very-optimistic mention of the "B" team - BB/RIM). I always appreciate Tim Anderson's reading of the internal and external tea-leaves of MS technology. It's always fascinating to see the elephant trying to jump, in a schadenfreudal sort of way.

Tim Bray - Not really a roundup, and not all that tech-focussed, but always a good read.

Bug Fixes

Thursday, December 20, 2012

The always-readable Stephen O'Grady has an insightful post titled Do Not Underestimate the Power of Convenience. He proposes that an increasingly important factor driving the uptake of software is the developer's drive for convenience. "Convenience" is an, er, convenient term for things like ease-of-use, power, portability and low barrier to entry. In the case of software the lowest possible barrier is provided by free open source software, but the principle applies to infrastructure as well (reflected by the rapid uptake of things like cloud computing and BYOD). As evidence he lists a set of technologies whose prevalence has been driven bottom-up by developers, rather than top-down by corporate fiat. It reads like the bill-of-materials for IT infrastructure in a start-up: AWS, Linux, dynamic languages, Git, Eclipse, etc.

One reason he gives for this situation is the pleasing (to my ears) observation that "Developers are the new Kingmakers" (which he discusses in detail in another post). I suspect that these are mutually-reinforcing phenomena. Developer's drive for convenience has led to the rapid evolution of open source software (and not coincidentally the improvement of the tools which enable its development). This has led to the current situation where in many cases OSS surpasses commercial offerings, or at least is "good enough" to be used in demanding production environments. Since the developers are the ones making the software, or at least are the ones who understand it, they need to be involved in the decisions to adopt it (if they aren't spearheading the adoption themselves). (And especially since there are usually no marketing people in sight...)

In fact, I contend that the drive for convenience is the motivation for many of the advances in software and computation. A classic example is the ongoing quest to increase the concision of computer languages, by making them more expressive and provide a richer computation model. Another is the evolution of systems which can easily automate mundane tasks (with examples including the rich shell and command-line capabilities of *nix and the increasingly powerful array of build tools - the best of which of course are free and open source.) Developer's preference for software which is portable and standard is also driven by convenience - everyone wants their current favourite set of software available wherever they happen to be are working (and as expected, open source software typically exhibits the highest degree of portability).

As Larry Wall said, laziness and impatience are virtues for a programmer.

Taken to the extreme, the drive for convenience is simply another way of stating the ultimate goal of all compute science - to free computation from any limitations of space, time, money and power. Put this way, O'Grady's thesis is almost a tautology. But it's a valuable reminder of the constant need to push against the strong opposing forces of commercial interest and bureaucratic inertia.

Wednesday, November 14, 2012

The usual US election map is a starkly simplistic collection of red and blue blobs. This does reflect the esoteric (to a Canadian) Electoral College first-past-the-post arrangement. But after seeing how close the actual vote counts were in most states, it seemed to me like this doesn't really reflect the actual political reality of the US. Really it's pretty much a purely purple country. John Nelson has a nice map that elegantly visualizes this reality at a county level (and Brian Timoney explains why it's the only map that isn't a lie).

As another attempt at map truthiness, I used JEQL to produce the following map. It shows actual vote numbers at a state level, color themed along two dimensions:

The hue shows the relative proportion of Democrat VS Republican votes (using the now-canonical blue and red). For reference, Florida is almost exactly 50-50. This nicely shows that really the US is just varying shades of purple.

The saturation is proportional to the relative population of the state. California is fully saturated, since it's the most populous state. The the inland Western and the far Northeastern states are pretty pale, since they have fairly low populations. This is roughly proportional to the weight of the state's Electoral College votes, although there are amusing anomalies.

I make no claim that this map represents any valid statistics - it's just a fun exercise in using JEQL to do spatial visualization. For reference, the script is:

Tuesday, November 13, 2012

Don Meltz initiated a fascinating flurry of performance evaluation with his post on Is QGIS a Viable Alternative to ArcGIS? and its followup ArcGIS vs QGIS Clipping Contest Rematch. He looked at a spatial processing task involving clipping a large dataset of contour lines against a fairly simple polygon. His conclusion was that QGIS was a lot faster than ArcGIS at performing this task. His final testing produced a time of 6 min 27 sec for QGIS, versus a time of 1 h 35 min for ArcGIS - and which failed with a topology error! (Note: subsequently ESRI reported that they have improved their algorithm to provide much better results for this case - and presumably to enable it to actually complete!).

This inspired a lot of other people to dive in and run the same test (since Don helpfully provided the test data here). Systems tested include many commercial (ArcGIS, GlobalMapper, Manifold) and FOSS systems (QGIS, PostGIS, SpatialLite, GRASS, OGR, uDig, OpenJUMP, etc). There's a summary of some of the timing results here (and the many comments to these posts provide lots of different timings on various software and hardware configurations).

On the Java side, Andrea Antonello of jGrass provides an in-depth description of his optimized implementation using uDig, GeoTools and JTS here. His best result was about 80 sec, using 4 cores and including data I/O. (He later tested on Amazon AWS using 32 cores, giving a 25 sec time).

The SpatialLite wiki has a nice page showing how this problem is tackled in SQL here. It also has an excellent analysis of what this contest actually demonstrates. The key conclusion is that almost all the open source systems are using JTS or GEOS, so what is really being measured is the effectiveness of the JTS overlay algorithm.

The one exception is GRASS, which uses a completely different topological algorithm. From the results the performance of this seems similar to or only a bit slower than JTS/GEOS. This is actually quite impressive since it sounds like the algorithm is computing full topology of the data. But it's hard to know without a more detailed understanding of how it works. And in any case, comparing the contest timings is difficult since many different hardware configurations were used.

One aspect of this task is that it is "pleasantly parallel", so implementations which can multi-thread over many cores should see linear performance improvement. Only some of the systems tested were able to take advantage of this. In particular, PostGIS and SpatialLite execute single queries in a sequential fashion, which is a shame. Perhaps this will be rectified in future releases?

It's great to see JTS and GEOS used effectively in so many different applications, and that they hold their own against the well-funded competition (and in fact often doing much better).

And here's the kicker - it could run even faster! The JTS overlay algorithm was designed for the single geometry/geometry case. It has not been optimized for iterated operations against a fixed query geometry. By using a caching approach similar to the existing JTS PreparedGeometry API, it will be possible to avoid recomputing polygon topology and thus provide a significant speedup. Also, the overlay algorithm was designed to handle all geometry types, and in particular the polygon/polygon case, which is the most demanding situation. It could be optimized to handle simpler cases (such as the polygon/line overlay of this task) in a faster way. All it will take is some time, money, and some hard thinking...