I'm designing a system which needs to report basic distance between zip codes. It will use the Google Distance Matrix API but I want to cache the results in the database so that multiple requests for the same data points don't result in duplicate API calls.

A basic Distance class might look like this:

A basic data structure for the cache might look like this:

Since I'm not using fine-grained detail concerned with changes due to one-way streets or divided highways, interchanges, etc., it therefore doesn't matter to me whether I'm getting the data from point A to point B or from point B to point A. But I'm not sure how to express this with full normalization in the database. With this primary key, it would be perfectly legal for the same distance to exist in two separate row, i.e.

"The Google Distance Matrix API is a service that provides travel distance and time for a matrix of origins and destinations." therefore distance(zip1, zip2) not necessary equals distance(zip2, zip1), because they would be always equal if you always can get back from zip2 to zip1 in same way as from zip1 to zip2, but that's not true (for example if address is in on way street you will need to travel in different way)
–
DainiusJul 29 '13 at 8:33

2

The OP is only calculating the road distance from post office 1 to post office 2 or from the some central point of the zip code area to another central point. Small differences due to road rules may be small compared to the fact that the zip code may encompass many square miles.
–
mhoran_psprepJul 29 '13 at 15:17

4 Answers
4

If the location is being represented by a zipcode (fine for the cases where most distance estimates are within a tolerance for the same zipcode), then I'd impose a CHECK constraint on the fields Zip1 and Zip2 that Zip1 =< Zip2.

Of course this imposes a corresponding burden on the middleware logic to understand that the convention to assure only one record is putting the smaller zipcode in Zip1. This is mainly a concern for INSERTing rows, as queries can be agnostic about which field is first, i.e. WHERE (Zip1 = myZip1 and Zip2 = myZip2) OR (Zip1 = myZip2 and Zip2 = myZip1).

However normalization is not violated by storing both orders in the compound key. It uses something less than twice the number of records, but these look to be fairly small rows, so I'm not sure you can justify the logic complexity on efficiency grounds (least of all in terms of speed).

+1 At first I though a stored procedure to check whether the opossite pair exists in the table would be required, but a check constraint that checks that z1 <= z2 solves the problem. Smart.
–
user61852Jul 29 '13 at 13:24

According to this Zip code FAQ, there are about 43,000 zip codes in the US. The figure fluctuates a couple thousand per year. Note the value returned by distance(zip1, zip2) may change over time possibly due to new road construction, zip code boundary changes, etc.

If you store the return values indefinitely, you might wind up with 43,000^2 (approx 1.8 billion) values. In that case, avoiding the inverses (zip2, zip1) makes some sense.

If you let your cache records expire (keep them for a day? a week? a month?) then it may make more sense to not worry about the inverses.

As long as you're not coding for mobile or embedded devices, disk space is really cheap.

Consider departing from normalization. For every departure from one of the normal forms, there is a corresponding anomaly to be overcome. For 2NF through 5NF the anomalies are generally in the area of harmful redundancy.

What makes this kind of redundancy harmful is the possibility that the database will contain mutually contradictory facts, and therefore yield erratic results. You have to program in such a way that these contradictions don't occur.

For routine database work, it's usually better to normalize, and thereby obviate contradiction than to try to be careful with updates.

But if you were to carry this line of reasoning to its logical conclusion, you would never do any cacheing whatsoever. A cache always contains data that is redundant to the underlying data it caches. Whenever you build a cache, you have to guard against the cache delivering results that are no longer valid. Programming to cope with harmful redundancy involves the same kind of disciplined thinking.

At the end of the day, you may decide to normalize after all. But don't treat normalization as some sort of holy grail. It isn't.

There is an alternative. Whenever you add a data point to the cache, arrange the data so that zip1.lt.Zip2. Whenever you do a lookup, arrange the two zips the same way. It's a little extra arithmetic, but it may beat adding every point twice.

I got an answer from Tony on Google+ that helped a lot. The key isn't to throw out the idea of normalization altogether, but to simply get over the concept of two columns (zip1 + zip2) as the primary key--which was, for some reason, the concept that I started with. Instead, we use a relational table that associates the two locations, without caring what order they come in.

The relational table maps two zips to one RouteId, and then the Routes table will contain distance data and the cached date.