From An Expert's Guide to Oracle Technology I am working on a work history/job application kind of application. Nothing fancy. It's just a side line I am playing with. Anyway, I figured that since I was working on this, I would share my thoughts as I model the database portion. I may even follow up with the coding of the application but I am undecided on that as of yet. I've been trying to figure out a good way to approach this. I don't want to write yet another dry treatise on data modeling. You know, the one where they explain what a crow's foot is and what the difference between a relation and an entity is. I don't really want to go there. Do a google on data modeling tutorials and you can find plenty. What I plan to do instead is just kind of document my thought processes and then answer any questions you might have as I write these. I am going to be working mostly in a logical model and showing a potential physical representation of that model at times. I'm not going to be strict about the terms I use and I won't be digging into the meaning behind the concepts, per se. My brain doesn't work like that when I'm working and this is really to show how my brain works, and what I'm thinking about, when I do this kind of work. I will put some conceptual stuff in the first logical model and I will put some physical stuff in the logical model; it's just the way I work. I never consider a model final until I have a complete model AND I have spoken about it and gone over it with someone who knows the subject matter (hopefully another modeler but that's not always possible). The reason for this is that if you are looking at something from a skewed viewpoint, you probably will through your entire model. Ideally, you will have someone looking at it and discussing it with you from the very beginning. That's not always possible so at least get someone at the end. You will be my reviewer and bouncer-offer. Thank you for that. Questions and comments are heartily encouraged. Today, I am going to start with a tiny piece of the model as a warm up. It's not actually the first piece I started with but it's stand alone (mostly) and small enough to cover in a single post. It will be a long post though. It's also a nice segue to my next post which is actually a more important aspect of the model. We are going to start with the address portion of our model. When discussing work history, we have plenty of addresses: work, home, school, references, etc. Because of that, I can model an address as part of all of those entities. I won't spoil my final model by spilling the beans just yet but I will start off with address as a stand alone entity. It may get merged with other entities later but right now I just want to see what is appropriate for an address. What is an address? Generally, when we say address, we mean a street number, street name, city, state, zip and country. In some countries, state might not be called state; it might be called province or something else. For now, we will just call it state and let the front end worry about what to display to the user. The image above is a basic entity definition of an address. Sure, we could have 5 address lines instead of 3 or we could worry about the state/province thing. For now, let's agree that for our purposes, we have an address defined. We have a problem though. When normalizing data structures, we want to get rid of repeating data. This makes it easier to keep the data current when changes are required, it keeps the tables smaller and it follows normalization best practices. Repeating values and internal dependencies make updating the data, or at least keeping the data consistent, very hard. Do you see the repeating values in this structure? Country will be repeated many, many times. State will be repeated many times with-in each country (that's also a dependency on a non-key field). City repeats in a state and postal_code repeats for a city and possibly for several cities. I added a new entity for each of my suspected multi-value sets. I haven't removed the column from address yet. I will before I'm done but one thing I have learned over the years is that modeling is an iterative process and I may start over from scratch. I generally keep a lot of backups too. ;-) Digging deeper into my new entities, I want to assign some keys. I want to identify my primary key and any unique keys. If I start with country, then technically, I have two candidate keys, the country name and the country code. As a space saver, I will assign the code as my primary key and the name as a unique key. This could change later but for now it works. I'll do the same exercise with city and state. Postal_code has a single attribute (currently) so I will assign that as the primary key. This is looking good. Now that I have some keys and I understand how the entities are related, I am getting a bit more confidence that I am not chasing a rabbit down a hole. I'll remove the redundant columns from the address table and assign what I think may be the candidate key. I'll also take a first shot at some relationships. So what does the above image tell me? Looking at the address entity, I have assigned the three remaining attributes as my primary key. Remember that a primary key MUST be not null and it is highly likely that line 2 or line 3 may be null. We will revisit this key but I will accept it, as is, for now. More interesting, are the relationships. A country may have many states, a state may have many cities and a city may have many postal_codes. A country can't have the same state twice but two countries could have states with the same name (but would in fact be different states, it would not be the same state in two countries). Looks OK for now. Note: In this ERD, I am using Barker notation. Here is a quick run down of the meaning of the relationship lines. Crows foot means MANY. Solid line means at least 1. Dashed line means 0 or more. Short bar across the line means an identifying relationship (the key from the source table becomes part of the key in the target table). How do I tie that to my addresses table? Looking at the paragraph above, how would we write the relationship? An street address (lines 1+2+3) might have many postal_codes? No, that combination should have a single postal code. Is it a 1 to 1 relationship? An street address has one and only one postal_code. That's true but to complete the relationship (which I skipped above), we also need to read the other way. A postal_code has one and only one street address? Nope. Not it. A street address has one and only one postal code but a postal code may have many addresses? That looks about right. From country down, it runs A country may have many states (and it also means that IF I have a state, I will have a country). A state may have many cities, a city may have many postal codes and a postal code may have many street addresses. Going the other way, a street address will have a postal code, a postal code will have a city, a city will have a state and a state will have a country. You can actually read a lot more meaning from the lines but that is the gist. If you want to understand a model, that the least you need to know. I also believe that saying the relationships out loud will help you find bad relationships. I frequently typo and put relationships on backwards. If I read the model out loud, that forces me to process the image it's showing me and I will get a big duh and fix my error. The above image is physical representation of the first model we did (which was a logical model). It's got the basic idea down but I see one thing that is just plain ugly. Remember above when I mentioned the primary key of addresses being the entire address? Well now that address includes the 3 address lines and country, state, city and postal_code. While those columns would make a valuable unique key, it is extremely cumbersome as a primary key. You will find no end of debate on the web about composite keys, natural keys (aka smart keys) and surrogate keys. I don't plan to return to that debate. I use surrogate keys where I find them valuable. I will also model them in a logical model (some say to hold off on that until the physical model) when it makes sense (which honestly, is most of the time). I am going to replace the cumbersome composite key on addresses with a surrogate key (address_id). Depending on the target database, that key may be populated by a sequence, it might be defined as an identity column or it might use some other method. The key point here is that it should be system generated and not derived from any application data. You could also use a COUNTRY_ID, STATE_ID, CITY_ID, and/or POSTAL_CODE_ID (or any combination of those). I'd probably just use the codes myself. Since I'm not yet ready to move to a final model, I will postpone that decision until I have decided exactly what entities will exist at the end of this process. The model looks a bit cleaner now doesn't it? And if I regenerate the physical model, I get a nice clean model there also. So having undertaken this exercise, I want to make a few points. You DO NOT need to implement an address in this way. In the real world, you will see many instances of address using many different models. To a purist, only one can be right. I call BS on that. The right one is the one that works best in your situation. There are some databases that are not as performant on joins as other databases. If you have infrequently updated data on one of these databases, it is very likely that you will implement a model that denormalizes some or all of these attributes. You might want to compromise. For example, maybe you combine the countries and states into a single table. Or maybe even country, state and city. And technically, the postal code could be added to the address lines table. In the above model, I created a locality consisting of country, state and city and I rolled postal code into addresses. I also created a locality_id surrogate key and used that as my primary key. I made the name columns a unique key. I could do that with the code columns also. I have yet to decide exactly which I will go with. I will say that I like to fully normalize (to a realistic point) before I think about denormalizing. If you don't normalize before denormalizing, then you aren't denormalizing. You're just not doing your job. One final note, for something as simple as an address table, you really could just go snag one off the internet or out of a book. However, if you are interested in modeling, you should do your own models and then see how they compare to what you can find. Eventually, you will have modeled some things so many times that you just pop them into new models without really thinking about it (taking into consideration any unique business rules of course). Addresses is pretty much one of those. The logical model won't really ever change much. The thing that will change is how you want to physically implement that model. That's it for the basic address modeling. Next up, I will talk about integrating address into the larger model by discovering the relationships between a person and an address. Until then, LewisC

Some name

Lewis is an Oracle Ace Director, Oracle Certified Professional, published author, frequent conference speaker and Database Architect. Lewis's specialties revolve around databases, data warehousing, business intelligence and most anything having to do with databases. He has two decades of multi-vendor and open source database experience in a variety of industries and capacities and has worked with Oracle since 1993.

"At the risk of being too US-centric, I'd start with the
USPS standard, then broaden outward, looking for any international standards. "

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 20, 2010 06:23 AM

"Martin, good catch. As a standalone entity, City name can be unique (like a lookup) but as soon as I create the dependency on state, that is no longer true. The locality version should be ok as I made a unique key out of country name, state name, and city name. In my latest iteration, I've decided to go twith locality and I've dropped city code (so there is just a name).

Now I need to figure out how to get county in there. I guess in locality it will just be another column. It should be unique inside of a state and it can contain multiple cities. Can a city be in multiple counties? Argh.

On your comment about structured vs unstructured, I tend to agree with the caveat that it all comes down to the requirements. One size doesn't fit all.

Thanks for the comments.

LewisC
"

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 20, 2010 06:16 AM

"DJDJDJ, I didn't think about capturing counties though. I will need to think about adding that in. Thanks.

BTW, for everyone's elucidation, I cannot actually see the model from where I am right now. I can only read the text. Makes me reconsider using flickr for my screen shots. "

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 20, 2010 06:14 AM

"SpatialDBAAdvisor,

2 things.

As I mentioned in the entry, if someone just wants to get a model, they can google for it and probably find one. Thanks for the link. Readers can see if that's a fit for their buisness requirements. The point of this entry is more to share my way of thinking rather than provide a generic model.

The second thing is that I think we need to go back to what I stated were my requirements. This is a fairly simple application to track a person's history (work history, home addresses, etc). Academically, any model can be taken to the nth degree. However, for this business case, geocoding goes way beyond reasonable.

As always, I try to keep the business in mind and not try to provide solutions that won't be needed. Like you mention on the downsides of structured vs unstructured, in this case, geocoding is more academic.

Thanks for the great comments.

LewisC
"

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 20, 2010 06:05 AM

"DJDJDJ, the way it is designed a zip CAN span multiple cities. The primary key would be country, state, city and zip.

"City_name cannot be a unique key in City, unless state_name and possibly country_name are included. For example, Perth is a city in Scotland, a city in the state of Western Australia and a town in the state of Tasmania in Australia.
The same problem may occur with State, if two countries use the same State_name.

In my experience, any structured address model will encounter a real address that doesn't fit, while unstructured addresses will cope with all addresses, fit on an address label, but not be of much use in data mining unless you are prepared to wear processing costs and a 95% success rate.

Martin "

Sorry! Something went wrong on our end. Please try again later.

Simon Greener

July 19, 2010 08:42 PM

"The spatial community have put a lot of effort into structured addressing. The relevant US work can be accessed at: http://www.urisa.org/about/initiatives/addressstandard

The reason for the amount of effort of the spatial community on this over the years is because address is the basis of the vast majority of geocoding that nowadays drives Google, Yahoo and Bing etc Maps.

Personally, I believe unstructured storage provides a more optimal approach, but the issue of structured vs unstructured data processing is a discussion item that would take too long to outline or resolve. Sadly the sort of linguistic processing and analysis techniques required for quality unstructured data processing always seems to remain in the realm of academic papers and not software.

Simon "

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 18, 2010 06:17 PM

"Chet,

Thanks for the comments. I think you would always want to store the entire address. How would you be able to print out a street address if you didn't?

There are quite a few physical denormalizations you might make for a read only vs hot table but generally your logical model would be the same.

Whether you store both a cleansed address or one or the other is more of an application question. If you have a requirement for that, I'd think the model would be close to the same, just maybe with an iteration sequence or possibly active date? Hmmm. Something to think of.

Of course in dimensional modeling, you would create a type 2 but that address entity would look completely different that it would for oltp.

Thanks again,

LewisC
"

Sorry! Something went wrong on our end. Please try again later.

Lewis Cunningham

July 18, 2010 06:12 PM

"John,

I get your point about the repeating values. You mean where if you have a street address, you might have repeating values for, say an apartment or office building? Then again, in some countries the extra lines are for additional directions, like past the 3rd tree and after the feed store.

I think trying to normalize out the lines is going way too far in normalization. Comes back to my line ""I will say that I like to fully normalize (to a realistic point)"". I think going past the address lines is no longer realistic. There is a point where the database is unusable and I think that is getting there.

Just my opinion, though.

I think the function to join the lines for printing is very appropriate. I wouldn't recommend combining all the lines in the database because you may eventually want to mine the data. The function solves that nicely.

Where I sometimes give myself the hmmmm's is splitting out the street number into a different column. That is very useful information for grouping households in some financial data mining. xxx street dr is much less affluent than xxxxxx street dr. I almost never do though unless I know I have that requirement (and I would still hesitate to move the lines out into different entities).

Thanks for the comments.

LewisC
"

Sorry! Something went wrong on our end. Please try again later.

JJFlash

July 18, 2010 04:40 PM

"I've often struggled with the ADDRESS_LINE1, ADDRESS_LINE2 ... construct. Not that I haven't used it myself. For one thing, it violates the normalization rules that suggest that repeating fields are verboten. For another, how many address lines are enough?

Separating the city, state and postal code is okay, because I might want to use these as query conditions, but rarely ADDRESS_LINE2. And I'm constantly writing code like:
Print ADDRESS_LINE1 (required field). If ADDRESS_LINE2 is not null then print a line feed and ADDRESS_LINE2. Same for ADDRESS_LINE3. In fact, I wrote a function that takes multiple address lines and returns a single VARCHAR2 with the appropriate lines of address, with CRLF, LF, or <br/> between.

I wonder if a single VARCHAR2(2000) field for the STREET_ADDRESS, and allowing for the appropriate new line characters might be better.
"