I am working on an application that requires the storage of location information such as city, state, zip code, latitude, and longitude. I would like to ensure:

Location data is accurate

Detroit, CA

Detroit IS NOT in California

Detroit, MI

Detroit IS in Michigan

Cities and states are spelled correctly

California not Calefornia

Detroit not Detriot

Cities and states are named consistently

Valid:

CA

Detroit

Invalid:

Cali

california

DET

d-town

The D

Also, since city/zip data is not guaranteed to be static, updating this data in a normalized fashion could be difficult, whereas it could be implemented as a de facto location if it is denormalized.

A couple thoughts that come to mind:

A collection of reference tables that store a list of all states and the most common cities and zip codes that can grow over time. It would search the database for an exact or similar match and recommend corrections.

Use some sort of service to validate the location data before it is stored in the database.

Is it possible to fulfill these requirements without normalization, and if so, should I denormalize this data?

4 Answers
4

Yes. You can always make sure the data conforms to the rules inside the application program(s) before adding the data to the database. There are two things that make this hard to do.

The first is the imperative nature of almost all programming languages. There is no simple way you can declare that a given field should never be left null or should always refer to an existing referent or should always be in such and such a domain. Such declarative rules would generally be counterproductive when dealing with program variables anyway.

So you have to check every execution path to make sure that it doesn't break the rules. Encapsulation helps, but not enough.

The second is that multiple application programs can interact with the same database. I've seen databases where dozens of programs, parts of several applications, share data by reading and writing to the same database. Making sure none of those programs goes rogue is next to impossible.

By contrast, DBMS constraints are declarative, relatively simple, and get applied whenever necessary in the context of transactions that manipulate the data. Triggers can make the enforcement more flexible, at the cost of going down the imperative route rather than the declarative one.

Normalizaton is only relevant to certain types of violation of data integrity. The focus of normal forms 2NF through 5NF is to obviate the possibility of leaving the database in a self contradictory state. The normal forms do this by ensuring that a single fact is only stored in one place. It's either right or it's wrong, but it it isn't contradicted by another fact stored elsewhere in the database.

This can be enormously helpful in reducing the incidence of violations of integrity. But there are plenty of cases of integrity that have nothing to do with normalization. Referential integrity is one such case.

In nearly every case where a programmer opts out of DBMS enforced integrity because "my programs never have logic errors" or "the DBMS runs too slow", that programmer or his/her successor has lived to regret it. Remember, the value of data often outlasts the system that captured it.

In the case that you gave us, where Detroit, CA is known to be wrong, how can you be sure? How do you know there isn't some tiny little suburb, somewhere in California, that has been given the name "Detroit"? You can assume that, if it isn't in the database, then it can't be true. But that assumption can easily come back to bite you.

You can always implement a value transformer to normalize (or attempt to normalize) the input, similar to how a spell-check would work. Check out Damerau-Levenshtein edit distance if you need ideas on how to get the nearest match for free-form input strings from a list of known strings. As for ensuring data accuracy, once the data is normalized you can simply store your criteria in some sort of associative data structure (a dictionary would suffice here) like you've described.

This way you can still allow free-form input, while still allowing consistency and accuracy within your model. Consider if Google search required normalized input, you would miss tons of matches if your spelling/phrasing isn't 100% accurate.