Darren Kopp

Navigating through software, line by line

Working around the performance problem of STUnion

When life gives you an O(nk) algorithm, it's time to get creative
and respond with your own logarithmic algorithm.

At DevResults we do a lot of work with maps. One such activity involves kml
shape files of things like countries, states / provinces, cities, etc. While learning how to set up a new site,
we grabbed the shape files for the United States off of GADM and started the import process.

An hour later, the process still hadn’t finished the first (and largest) shape file.
Something was definitely wrong and so I dug in and found out where the code was stuck at and found that
the code was iterating through all the points and combining them together into one big SqlGeography instance
via the STUnion method.
Below is an example of the code that was doing this

When I started searching around for some alternatives to STUnion, I found this graph:

Classic O(nk) problem from the look of it.

Taming the beast

From the graph, we know that we can combine small shapes very quickly, and large shapes take a very long time.
This means for our algorithm we want to minimize the number of operations we do on large shapes.
With this optimization goal in mind, we can make a divide and conquer algorithm that will combine shapes
into increasingly larger shapes where finally at the end we only have two shapes to merge.

First, lets get the implementation out of the way.

publicstaticSqlGeographyCombinePolygons(List<SqlGeography>polygons,intstart,intend){if(start>end)returnnull;if(start==end)returnpolygons[start];intmidpoint=(start+end)/2;SqlGeographyleft=CombinePolygons(polygons,start,midpoint);SqlGeographyright=CombinePolygons(polygons,midpoint+1,end);// if both values have shapes, combineif(left!=null&&right!=null)returnleft.STUnion(right);// if only one has a shape, pick whichever has a valuereturnleft??right;}

So, the way this works is it recursively splits the list of polygons until it gets down to two single polygons,
unions them together, then returns that shape, which then gets unioned with another, until in the end we have one shape.
This means that most of our operations are done on small objects, thus minimizing the large objects we have to merge.

For example, consider 512 polygons.

Operation Count

Left Size

Right Size

256

1

1

128

2

2

64

4

4

32

8

8

16

16

16

8

32

32

4

64

64

2

128

128

1

256

256

We can easily see that we are now minimizing the number of operations that we are doing on large shapes.

Reviewing the solution

So how well does this work out? Well the first time I ran this, it never finished the first shape
file in over and hour and a half while this algorithm lets me process this shape file
in just under 4 minutes. A great result from a simple change in how we process the
polygon shapes.