Purpose of BostonGIS

BostonGIS is a testbed for GIS and Web Mapping solutions utilizing open source, freely available and/or open gis technologies. We will be using mostly Boston, Massachusetts data to provide mapping and spatial database examples.

If you have some thoughts or comments on what you would like to see covered on this site, drop us a line on our Feed Back page.

A generic solution to PostGIS nearest neighbor

After some heavy brainstorming, I have come up with a faster and more generic solution to calculating nearest neighbors than my previous solutions. For the gory details on how I arrived at this solution - please check out the following blog entries Boston GIS Blog entries on Nearest Neighbor Search.

In the sections that follow I will briefly describe the key parts of the solution, the strengths of the solution, the caveats of the solution, and some general use cases.

The parts of the solution

The solution is composed of a new PostgreSQL type, and 2 PostgreSQL functions. It involves a sequence of expand boxes that fan out from the bounding box of a geometry to as far out as you want. Below are the descriptions of each part.

pgis_nn - a new type that defines a neighbor solution - for a given near neighbor solution it gives you, the primary key of the neighbor and the distance it is away from the reference object

expandoverlap_metric - this is a helper function that returns the first fanned out box that an object is found in relative to the reference geometry. It returns an integer where the integer varies from 0 (in bounding box) to maxslices where maxslices is how many slices you want to dice up your larger expand box.

_pgis_fn_nn - this is the workhorse function that should never be directly called - it is a pgsql function that returns for each geometry k near neighbors

pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))) - this is our interface function written in language sql to fool the planner into thinking we are using an sql function. This is to get around certain anomalies of PostgreSQL that prevent pgsql set returning functions from being used if they reference fields in the from part of an SQL statement. Description of the inputs is as follows.

geom1 - this is the reference geometry.

distguess - this is the furthest distance you want to branch out before you want the query to give up. It is measured in units of the spatial ref system of your geometries.

numnn - number of nearest neighbors to return.

maxslices - this is the max number of slices you want to slice up your whole expand box. The more slices the thinner the slices, but the more iterations. I'm sure there is some statistically function one can run against ones dataset to figure out the optimal number of slices and distance guess, but I haven't quite figured out what that would be.

lookupset - this is usually the name of the table you want to search for nearest neighbors. In theory you could throw in a fairly complicated subselect statement as long as you wrap it in parethesis. Something like '(SELECT g.*, b.name FROM sometable g INNER JOIN someothertable b ON g.id = b.id)' . Althought I haven't had a need to do that yet so haven't tested that feature.

swhere - this is the where condition to apply to your dataset. If you want the whole dataset to be evaluated, then put 'true'. In theory you can have this correlated with your reference table by gluing parameters together, so can get pretty sophisticated.

sgid2field - this is the name of the unique id field in the dataset you want to check for nearest neighbors.

sgeom2field - this is the name of the geometry field in the dataset you want to check for nearest neighbors

Key features

My objective in creating this solution is to give PostGIS the same expressiveness that Oracle's SDO_NN and SDO_NN_DISTANCE functions provide Oracle. Unlike the Oracle solution, this solution is not implemented as an Operator, but instead as a set returning function.

Features are outlined below

Can handle more than one reference point at a time and return a single dataset. For example you can say give me the 5 nearest neighbors for each of my records in this query and get the result back as a single table.

You can limit your search to a particular distance range, slice it up as granularly as you want in order to achieve the best resulting speed. Note I implemented this function as linearly expanding boxes to utilize GIST indexes. In hindsight it might have been more efficient processor wise to have the boxes expand like a normal distribution or something like that.

You can apply where conditions to the sets you are searching instead of searching the whole table - e.g. only give me the 5 nearest hospitals that have more than 100 beds. You can also have the where conditions tied to fields of each record of interest. E.g. include all near neighbors, but don't include the reference point.

This solution works for all geometries - e.g. you can find point near neighbors to a polygon, polygon near neighbors to a polygon etc. Nearest distance to a line etc.

Caveats

This function will only work if you give it a primary key that is an integer. I figured an integer is what most people use for their primary keys and is the most efficient join wise.

This function currently doesn't handle ties. E.g. if you specify 5 neighbors and you have 6 where the last 2 are the same distance away, one will arbitrarily get left out. I think it will be fairly trivial to make it handle this case though.

If there are no neighbors for a particular reference that meet the stated criteria, that record would get left out. There are workarounds for this which I will try to provide in more advanced use cases.

Find nearest fire station for first 1000 buildings in Boston and sort in decreasing order of distance from the fire station such that
the building furthest away from its nearest firestation is at the top.

Data set used: Fire stations from above and Massachusetts census tiger roads -http://www.mass.gov/mgis/cen2000_tiger.htm
For each street segment in zip 02109 list the 2 nearest fire stations and the distance in meters from each fire station. Order results by street name, street id and distance from nearest fire station

More Advanced Queries

Example of two simultaneous neighbor queries in one query

Now for this, I think we can safely assume that no street we care about should be further than 30,000 meters from its nearest school and hospital
so we will limit our search to
30,000 meters and dice up our box into 10 expand boxes - so boxes expand at a rate of 3000 meters out per expansion.

Give me the nearest elemenary school and the nearest hospital emergency center for street segments in select zipcodes
and how far away they are from these street segments.

Caching data by storing closest neighbor in a table

For this exercise, we will create 2 new fields in our schools table called nearest_hospital and nearest_hospital_gid
and update the fields with the closest hospital to that school.
We will set our maximum distance search to 100,000 meters carved out into 10 10,000 meter expansions.

Later on we realize we need an alternate hospital in case the first hospital is filled or because of traffic situation we can't get there,
so we create a new
set of fields to hold the second closest hospital. This particular exercise demonstrates where clause dependent on the
additional attribute data of the reference record.