Sticky Areas, Sweet Solutions in Data Vault Modeling

First off, happy new year to everyone. This post dives in to two of the most sticky problem areas for Data Vault Modelers, and suggests some basic workable solutions for these. I encourage you to post your solutions as comments to the bottom of this entry. The two most troublesome areas to model are: addresses and people.

The problems…

Both of these topics share the same basic problem: lack of “real-world” business keys. Before you jump on me and tell me you have business keys for these, let me explain…

Issues with addresses…

Addresses consist mostly of non-unique data. The issue isn’t that the real-world address itself is non-unique, the issue is how the address is represented in computer systems (and entry applications) that make it non-unique. In general, a physical place usually has a single address, including mail-stops in large corporate buildings. However, getting a clean copyof all those addresses from all those systems in to a master data set is the tough part.

Yes, if you are going to assign unique business keysto addresses you mustbegin working with master data for them.

So what is the problem? Unfortunately the problem is in the manner in which the addresses are entered in to the systems. In other words:

RD and ROAD may mean the same thing

WY and W and WAY may mean “way”

S, ST, STR, STREET all mean the same thing

Sometimes, WY and RD are used when STREET is meant (or correct)

You get the idea, the first problem with addresses is: entry of non-standardized data sets. Basically the address entry screens are free-form, and allow each individual in the company to enter anythingthey please. Thus, resulting in duplicate address entries that differ in standard form one way or another – in other words from a data perspective they are unique, but from a human perspective, they are the sameaddress.

So, address standardization is a huge issue that must be corrected before any sort of business key can be applied.

The next issue with addresses is: spelling errors, and address corrections, and don’t forget corrections to latitude and longitude. So even if you standardize the addresses, they will still have differences across multiple source applications. Furthermore, I’ve seen incorrect latitude and longitude ratings which complicate matters further.

So you say, well – we will just use an address cleansing / house holding solution. To that, I say great, wonderful – and if your business runs on addresses, even better. Hopefully the solution you choose will provide you with two things:

The “as-was” (before cleansing) address, the “as-is” address (post-cleansing). Without these two address sets you do not have traceability / auditability back to the source

The “assigned address key” – which should stay consistent every time the address cleansing software is run.

The services that I’m aware of that do this kind of thing for international addresses include MelissaData. I believe the US Postal service even outsources to them (If I’m not mistaken).

Yes, after cleansing, after standardization – then and only then could you think to assign a proper business key.

Issues with People…

Unique Identification of people in a general sense goes against the very grain of my beliefs. However, in business – as an employee, or a part of a company, it becomes important. Otherwise, analytics go haywire. Imagine if I owned a telephone company and couldn’t identify my customers uniquely? Wow, that would be a mess.

As it turns out, it IS a mess. It’s a huge mess. Employees (and sometimes contractors) on the other hand are generally assigned numbers that are associated with all the things they do within the corporation, so they are easier to track – and of course, the business key becomes the identifier that is assigned.

Now with customers, or people visiting a free medical clinic, or people in a hospital, prospects on a web site, the problem is exponentially more difficult. With data mining (and some basic location information), you can get a pretty decent result. However keep in mind that running the mining algorithm (if its a neural net instead of a heuristic / statistically driven space) may produce a different “key” for the same inputs based on learning new things.

So, neural networks are good for “possible unique identification” but bad for “absolute business key assignment”.

So putting together a “people HUB” becomes a bad idea.

Ok, these are the issues, where are the solutions?

Really, this is the crux of the problem with both of these “sticky spots”. Nearly every product, and every service that business engage in can or should or does have an identifiable business key.

Solution #1 is the easiest of all: Put addresses and “people” records as Satellites across your data model. Yes, you will have replication of data sets, yes you will increase the number of Satellites, BUT this information will be tied to a real business key that is auditable and has a trail. AND you can then mine the data, and use same-as links, and standardize/correct the data to produce results for your Business Vault.

This will give you an “as-was (raw auditable)” and “as-is (business)” image of both elements. It will keep you from getting hung up on trying to assign business keys for “people” identifiers when you don’t have one coming in from the source applications.

Solution #2: is far more difficult. For Addresses: use latitude, longitude, and altitude (high rise buildings, then office number or mail stop). MAKE SURE your latitude and longitude and altitude coordinants are correct! Otherwise, you will have a night mare on your hands IF these change, but the address stays the same.

For people: there is NO good solution. But if you MUST, then – use full name, address, gender, and if you have it: Social Security number (or something equivalent). Just remember, if any of these things change, you will be creating a new business key – and you will be forced to mine the data post-loading to try to “paste it together again” in a same-as link structure.

“Solution #1 is the easiest of all: Put addresses and “people” records as Satellites across your data model.”

and

“you can then […] use same-as links,”

Sticking to your example of “customers, or people visiting a free medical clinic, or people in a hospital, prospects on a web site”, are you suggesting each of these would get a separate hub, and each a satellite to hold name, gender and birthdate data? And then, to add “same-as” links between these hubs to assert identity?

If yes, then what key should be used for those hubs? It seems to me the fundamental problem – choosing the key – is irrespective of the number of hubs. In addition, I think it’s safe to say the matching process to discover the same-as links is likely to result in some mismatches. Would it be ok to delete those link records? Or should same-as links also never be deleted?

Also, with regard to addresses – it seems to me that even if the data was collected correctly in the source system, one can never guarantee immutability of addresses. For instance, what was once Berlin was divided in East- and West Berlin, and then, in due time, re-joined again. Still, it seems likely that for, say, a package delivery service, the address really is whatever the business uses to identify a package destination. Wouldn’t this suggest that sometimes, it is possible to have a mutable business key? Or would it in your opinion make more sense to say that sometimes, a business key is not the appropriate means for identification, and that a surrogate key serves the EDW better?

a) Hubs for People is a difficult proposition, unless you have a master data source, or a source application that already attaches a unique business key, or you have a people “cleansing” process that puts the information together in a source system / system of record.
b) otherwise, hubs for people should be avoided. instead, opt for putting your people information in as Satellite data

I am not suggesting that each of these get their own separate hub and satellite. I am suggesting that you attach the people to a different, already identified business key. Let the “people” be attribute descriptors rather than based on their own business key.

I am also suggesting that putting together a business key for people is very difficult if all the sources don’t share a consistent key. Not to mention the simple act of identifying people is just plain difficult at best.

Yes, it is safe to say that the process that matches same-as may produce invalid results. In fact, what you need to record in those same-as links is the confidence and strength coefficients of the resulting match. That way, you can delete the ones that are truly not good. This same-as link is a part of the logical Business Vault, therefore, it is to be managed as such (post business rule results).

Yes, it is always possible to have mutable business keys – this is the nature of business keys. To say otherwise would be nuts. Mutable business keys are a fact of life, hence the definition of hubs: business keys have a low propensity to change… (part of the standards doc). Anyhow, the idea is that addresses generally are prone to far more errors and far more duplication than most other “business keys”, so constructing a business key for address is just plain hard to begin with.

Now, if your business (or your customers’ business) makes money off of addresses (like postal services), then they better have a source system application that cleans them up BEFORE they reach the data warehouse. In other words, a master-data system for addresses *should* be priority #1 for that particular business, and assigning business keys *at that point, in that master source* should be part of their business processes.

No, The discussion does not make the point that surrogates are better for addresses, in fact the opposite is what I am suggesting. I NEVER try to construct “surrogate made up keys” for addresses in Data Vaults. It just is not a good idea, and will come back to bite you later in the implementation / life cycle of the EDW. What I am suggesting is: as just stated either find a way to get a business key FROM the source, or leave it alone, and post-process it on the way to the business Data vault.

Thank you Dan and Roland for an interesting discussion. Your second piece, Dan, clarifies couple of notes I had.
Anyways, my initial approach is to make the best effort to create a “master” list of people or addresses that includes distinct business keys that I gather from various sources. This is done on pre-DV staging area. Then, I’d use this list to create a “master hub” with multiple satellites in the vault, ideally, will use of atomic level info such as SSN or 3d coordinates. But since it almost never available, I’d use the best to identify the unique keys. Of course, there is always a risk of adding another source with “unconventional” bus keys structures later; the decision has to be made accordingly. Having such “master” hub, I found it easier to use it (or views) along with its satellites to build a presentation layer down the road. IMO, ease of build/rebuild of a presentation layer worth some work spent at the foundation layer (DV).
Thanks,
–Raphael

Remember: ANY time you change data UPSTREAM of the EDW / DV, it had better become a system of record, otherwise auditability is compromised. Master Data or Master Systems should be a source to the EDW anyhow.

Just remember, to always keep the PRE/before modified and POST/after modified information, so that audits can see and trace how this data got transformed.

Be warned, applying business rules on the way in to the raw data vault is neither advocated, nor condoned.

In the Netherlands we now have the BAG (basic registration addresses and buildings) which is a government sponsosered basic registration service that contains ALL addresses and buildings and assigns them a public ‘immutable’ technical id. It is publicly accessible data service, so there is currently no excuses for NOT having a dutch address HUB 😉

For persons, this will always be AT BEST a hub with a context of say country residents (you can use the country’s SSN for that).

a) when I build Data Vaults, I try to set standards for international data storage. IF your Data Vault is country specific, then the approach you speak of may work. If however your Data Vault houses international data, then address keys from a specific country may then constitute their own Hubs or, as I have mentioned, the only international approach is to put all address information in to Satellites, and key them on the way in to Business Vault tables.
b) With persons, again, the same thing applies – some countries don’t have unique identifiers or even SSN.

My point is: when building Data Vaults or Enterprise Data Warehouses it is extremely important to think globally (unless you know for a fact that the data set will never house global data). Too many people forget this point, and when that happens, it leads to re-engineering when the company DOES go global.

This is indeed something to be aware of, although a lot of DV’s over here are mainly local and cannot/will not go global.

The problem is that even in global Data Vaults you might want to use/require country specific knowledge to gain DQ in your DV. This means that both the global and local perspective need to be available/maintained in your DV. This is important for most Dutch DV’s I know of. It is typically European to always honor all local data rulings (and that is something basic generalization destroys), which IMO is different than US thinking about data (models).

Some of these issues can be resolved in a Data vault using specific entity generalization and specialization rules, but it is something that requires attention, knowledge and modeling skills.

Forgive me if I’m restating what may be obvious, but regarding solution #1, does that mean that a sales database (for example) would have no Customers hub, but rather the Orders hub would have a Customers satellite, with fields such as CustomerID, FirstName, LastName, StreetAddress, etc.? I’d be concerned about the size of this table, as each order would have an entry in this table; if typical customers place many orders, as in an industrial setting, your SatCustomers table could be 1000x larger than it needs to be.

If my understanding is correct, how would you record customers who do not yet have any orders? Would you allow NULL in the satellite’s HubKey field, or use a dummy Hub record representing “no orders yet,” perhaps?

Not quite the right understanding, and not quite the right nature of the post.

The real problem? is a business problem.
Data Quality, synchronization of addresses should be addressed on the SOURCE systems. Especially if addresses are that important to your business. They should “never” be addressed in the warehouse. Address cleansing is a BAND-AID to a business problem that needs to be fixed up-stream.

That said, if you have multiple systems delivering multiple addresses, then they need to be represented in the warehouse. How do you do this if the addresses themselves HAVE NO REAL BUSINESS KEY??? The answer lies in the blog post.

That said, the same question goes for identifying people, and corporations: HOW do you identify (unqiquely and consistently over time) a SINGLE person, or corporation? Especially when their name changes? That, also is part of the intent of the blog post.

The blog post is not about size, or scalability, and if you are struggling with that aspect of the data or the model, then you are focusing on the wrong problem. The focus really needs to be turned back to the organization to solve the business problem.

The good news is that my friend: Nols Ebersohn will be presenting this information at http://wwdvc.com 2017 especially when he talks about Data As An Asset combined with Data Quality.

Furthermore, 1000x is quite a multiplier. If your business really has this problem, they are hemorrhaging money on the source system, and I would have to ask the CFO why they haven’t addressed it yet in the source systems?

Technically speaking: make your Raw DV insert only, and turn on compression. Should solve this problem with 70% reduction in the “duplication” of data as it were.

It is good to see that this thread is still active – since I just now am experimenting with DV modeling and if/how it would fit our needs – and I was puzzling over what to do with our Student information (aka Persons). We have developed the ability to link together the various versions of student attributes over time and are currently using this in our systems. We establish these links downstream from the source systems.

Since we actually have these links established – I am wondering if it would be a good approach to still model the student attributes as satellites to a student hub in a DV model (with what as the business key??) – and then use the links we have established to build same-as links for that student hub. The other alternative would be to use our internal student GUID as the business key for a student hub – but since that is not a true business key – I’m leaning against that idea.

Persons – well, if you do have a business key for persons generated by the source system, maintained by the source system, then you need to use that as your identifier / BK in your Hub in the DV. If you don’t – then that raises the question / issues I put in to this blog entry.

Regarding the approach, there are three or four different modeling solutions that all “technically” will work, one of them you’ve already got “downstream of your source system” as you put it. The next, is what I suggested: separate Satellites, The next is: develop some sort of “master person key” in the source system, get the source application to apply it – usually requires significant rework of the source application to close this GAP (but the benefit is TREMENDOUS value). The next is a variant of DV modeling: where the Raw DV uses satellites for different attributes, then – there is a mining or correlation or statistical process that populates a “business master Hub” with a “business master Satellite”, and coalesces data together by business rules…. I would tend to go with the last option here.

Yes, it would be a good approach to still model the student attributes as satellites to a student hub in a DV model IF the source system has a BK assigned to the concept of Student in the source system. IF the source system does not, then go with the “business DV approach” I just finished outlining.

I don’t like the idea of using a GUID, because if you move platforms, or upgrade, or migrate – the GUIDs will all change for the “same students” – then, how do you trace the “old student record to the new one?” – the only way to trace, is to capture the KEY changes from the old GUID to the new one, and load that to a Hierarchical Link.