Using A Rough Box Model To Gather Near-By Zip Codes

Yesterday, I explored the mathematics behind finding the approximate distance between two latitude/longitude points on a map. This worked well, but don't be mislead - this calculation is still a rough estimation. Because the static values used in the equation are less accurate as you move away from the equator (the longitudinal lines get closer together), the distance becomes less accurate the closer you get to the poles. However, since we usually only care about these calculations in populated areas (ie. away from the Poles) and in a small radii, we can somewhat discount the outlier inaccuracies.

Originally, I was going to use these mathematical equations to write up a demo on finding all zip codes within an arbitrary radius to a given zip code. But based on some articles that I read and on some comments that were made to my previous post, I am feeling that this is completely unnecessary. Once you realize that a zip code can cover a large area, you realize that getting latitude and longitude from a zip code builds in a high degree of inaccuracy right from the very start. And, since this is going to be somewhat inaccurate to being with, we might as well use a fuzzier calculation that is much easier to use.

This fuzzier calculation is the bounding box model. Imagine that we have a starting zip code (10016):

Now, given this origin, rather than worry about a circular radius, we are going to create a bounding box that is plus and minus the radius in each plane of movement:

From the graphic, you can see that the areas covered by the circular radius (dotted black line) and the bounding box are roughly the same. Sure, the bounding box covers a greater area, but since the whole latitude/longitude-from-zip-code reading is somewhat inaccurate to begin with, I don't think this larger area is much cause for concern.

Now that we have settled on giving the bounding box model a try, how do we figure out what are our radius is? Well, we have the rough estimation that each degree of latitude on the map represents 69.09 miles. By using some simple algebra, we can find the degree-radius by taking a percentage based on this previously stated static value:

RadiusInDegrees = (RadiusInMiles / MilesPerDegree)

So, let's say we want to get the +/- radius in degrees for a 10 mile radius, the equation would be:

RiD = (10 / 69.09) = 0.14474 degrees

Ok, cool. Now, let's put this to a test - I am going to gather all zip codes in a three mile radius to my origin (10016) and then I'm going to map those using the Google Maps API:

As you can see, we use our radius to calculate the minimum and maximum latitude and longitude values of our rough box model. Then, rather than messing around with any complicated mathematical formulas, we simply gather all zip codes whose latitude and longitude fall within this fuzzy box model. When we run this code, we get the following map:

Now, Manhattan is not the best example as it is surrounded by water (which cannot have zip codes on it obviously); but, I hope you can see that the large majority of the zip codes covered by the box model fall within the more accurate circular radius. This seems pretty darn good to me.

Based on the graphic, we can obviously see that some of the locations fall outside what would have been the mathematically calculated coverage. But, is that true? In the code above, you'll notice that after I gathered the zip codes, I then went and stored the mathematical distance calculation based on latitude and longitude back into the query. What happens when we check our zip code query to get returned zip codes that are outside the more accurate circular radius:

When we run this code, checking for bad zip codes, we get the following output:

All zip codes within calculated radius!

Very interesting! Even though our Google map indicates that some of the locations would fall outside our circular area, all zip codes returned by the fuzzy box model (in my particular example) are just as accurate as the mathematical calculations would have been. I think that this demonstrates two things:

The mathematical approach is not fully accurate to begin with (at least at our level of mathematical complexity), and should not be thought of as such.

The fuzzy box model approach is accurate enough, when compared to the mathematical calculations, to justify using it for the sake of much greater simplicity.

So, in conclusion, when you need to find all the zip codes in proximity to a given zip code, the bounding box approach is going to be accurate enough for most of your everyday use cases, and faster to run than any mathematical calculations.

Then, run a SQL query to get the zip codes within a 50-mile radius of start_zip and populate the table. Yes, the table would be larger than a zip code table, but with that primary key in there it would still probably be faster than doing the math each and every time.

Build a Zip_50mi table, a Zip_25mi table, etc, and you are good to go. The Zip_25mi table would be a subset of the Zip_50mi table, etc.

You could try just a Zip_100mi table and no others, then use the distance_mi field as part of the WHERE clause, but I'd be willing to bet that it wouldn't be nearly as fast. Maybe the optimizer is smart enough, but it would take some testing.

Having done extensive work in this area of mapping, keep in mind that Zip codes are postal delivery routes. Specifically, they are street/road segment demarkations to optimize mail delivery. The lat-lon values associated with Zip Codes are the values for the centroid for that mail delivery area, so you do have some zipcodes ("C" shaped & "L" shaped) where it's centroid isn't even within the zipcodes "boundaries" (I use that term loosely with zipcodes).

That being said, if gross approximations of nearby zip-codes is the task at hand, then by all means, the proposed methods are fine. If something more accurate is expected, zipcode centroids shouldn't be relied upon.

Lastly, what is the application? In other words, why use zipcodes? In my opinion zipcode postal delivery routes for locating geographies on a map are now somewhat irrelevant. Unless you really love your postal mail(wo)man, it usually makes a lot more sense to search by neighborhood, subdivision, intersection, etc.

Good work though. One of the signs of best-of-breed software development is to simplify things. Well done.

Seriously, I'm amazed! The ColdFusion community needs more people like you. Thank you for contributing your knowledge and time into the CF community. If you put this much effort into an "Ask Ben" question, I could only imagine the work you must do for your company. It really shows how much you love what you do.

Hmm, interesting. Yeah, that would be a huge table :) But, I suppose you could leverage it in a bi-directional way; meaning, that if a start zip is 50 miles to a near zip, then we know that the near zip is 50 miles to the start zip. This would not require double storage of this knowledge. Of course, based on indexing, you might want to double up on the storage. Would be interesting to see.

@Paul,

When I think about this type of application, I think about "Store Locators" on vendor web sites; as in, find a Home Depot within 10 miles of me. With that said, I think the gross estimation is totally fine. Especially when you think that people aren't ignorant about their own surroundings. After all, people aren't *really* using these types of applications to calculate distance - rather they are simply using zip code location as a way to find locations. Once they see a location, they can determine on their own if its within a reasonable distance and how to best go about getting there.

As far as why we use zip codes - I think it's the easiest thing that everyone knows that doesn't have to be location specific. For example, in Manhattan, sure it would be nice to say, look up locations by "Midtown" or "Downtown", but that makes sense to me only cause I know the area. If I was in the country, maybe I could look up "County"? But even then, what if I don't know the area? What if I am visiting? It always easy to grab a piece of mail and see the zip code on the address label - the information is so readily available.

@Brian,

Always glad to help out! Plus, I learn something new myself every time :)

You could also argue that, in many cases, doing any of this on the db server at all is now a moot point. If you use the Google Maps API correctly, you can have it send an AJAX request with the Lat/Lon bounding box. No math needed!

In that app, the database is just a set of Lat/Lon coords and annotations. As the user drills down into the map, the browser requests the correct bounding box, so the SQL query is just a BETWEEN statement -- no icky trig.

@Rick O : Your lazy man's method (which I heartily applaud) reminds me of some advice given to me by my father-in-law many years ago. He told me, "Hal, the person who can do the work of ten men is amazing, but it's far better to be the person who can get ten men to do your work."

Great idea - I was trying to figure out how to make this function faster for my site. This approximation brought my query time from 3 seconds down to about 0.2 seconds, with very little loss of "accuracy".

One thing I changed - distance per degree of latitude will always be about 69.09 miles, but the distance per degree of longitude is different than the distance per degree of latitude. And also, distance per degree of longitude will vary as you move away from the equator. What I found was that in the continental US, it's about 45 miles per degree of longitude in the north and about 63 miles in the south. So I went to the next level of detail and saved the "Miles per degree longitude" on my zip codes table (a one-time UPDATE statement using the "true" distance calculation), and then used that in my queries.

I think it's great. I'm going to use a variation of this in an app I'm building to determine location of kids for sports teams. In other words, if they live close together, put them on the same team to keep travel time to a minimum.

Thanks again because you did it in CF!!! Rarely do I come across something this unique using CF.

Long time reader, first time poster. Yeah, not to make your head any bigger, but every time I Google a somewhat complicated CF issue...there you are. I just used REMatch() after reading your post on it. (I gave you credit in my blog:)

I can usually adjust your code to make it work for me, but was wondering if you had a "right" solution before I tried.

Is it possible to "reorder" the qZipCode query after you calculate the distances using GetLatitudeLongitudeProximity? I want to show a list of stores ordered by distance. The results would produce several pages. I've done this before with ASP, except instead of using SQL I just shoved the results into an array and sorted that. That worked fine on 1 page, but displaying multiple pages might get messy.

I apologize for not trying it myself first, but I thought your readers might benefit from this too.

Thanks my man - that is really flattering to hear :) As far as the sorting, if you have a query that you want to re-order, a lot of times, I'll go ahead and actually add a new column to the query using queryAddColumn(). Then, I'll loop over the query to populate that new column with the "test" value that I want to sort on. Then, it's simply a matter of query of queries to get the query in the right order.

I hope that helps point you in the right direction if you haven't already solved this (sorry - I'm behind on my blog commenting).

Awesome! I was searching for a solution to this and when I saw that you had posted something on your blog, I got a little giddy (admittedly) because your posts are always so clear and concise. Thanks for taking the time to document your discoveries.

With my code, I ran a query of queries to resort the results by distance and included an additional WHERE clause to limit the distance to the original "intMileRadius" value. It ended up being a quick and easy way to refine the accuracy of the rough box model.

One more thing to add.... everyone has a different style of coding. I took this code along with the functions from the previous blog post and converted them into a single CFC. Here's an example call to the CFC:

Can you outline how to add a different color icon to the mix for another set of data, for example I have a second query I am going to run but wish to change the marker to blue instead of red. Could you outline how to go about it?