The definition of “big data” is as with many buzzwords not crystal clear as examined in a post called It’s time for a new definition of big data on Mike2.0 by Robert Hillard. The post suggests that big may be about volume, but is actually more about big complexity.

As I have worked intensively with large amounts of rich reference data, I have a homemade term called “big reference data”.

Big Reference Data Sets

Reference Data is a term often used either instead of Master Data or as related to Master Data. Reference data is those data defined and (initially) maintained outside a single organization. Examples from the party master data realm are a country list, a list of states in a given country or postal code tables for countries around the world.

The trend is that organizations seek to benefit from having reference data in more depth than those often modest populated lists mentioned above.

An example of a big reference data set is the Dun & Bradstreet WorldBase. This reference data set holds around 300 different attributes describing over 200 million business entities from all over world.

This data set is at first glance well structured with a single (flat) data model for all countries. However, when you work with it you learn that the actual data is very different depending on the different original sources for each country. For example addresses from some countries are standardized, while this isn’t the case for other countries. Completeness and other data quality dimensions vary a lot too.

Another example of a large reference data set is the United Kingdom electoral roll that is mentioned in the post Inaccurately Accurate. As told in the post there are fit for purpose data quality issues. The data set is pretty big, not at least if you span several years, as there is a distinct roll for every year.

Big Reference Data Mashup

Complexity, and opportunity, also arises when you relate several big reference data sets.

Post navigation

4 thoughts on “Big Reference Data Musings”

Lot of discussion around big data these days and it’s often scary as you said. Seems like a tsunami most of the time but I think we MDM and data management folks just need to relax and see only the “things that matter”.

As you correctly mentioned, in MDM world, our urgent need is to look at the reference data lists which are getting bigger. But, if we really look at it, we have been dealing with growing list all the time. It isn’t as big as the hype is.

Prash, you are right, we have grown up with reference data lists that now have become bigger like MDM has grown from CDI (customer data integration) and PIM (product information management)that now have become perhaps more complex.

One of the problems with these commercial reference data sets is that in the end, they’re just one opinion and in many cases they are designed to be somewhat out of date. Take the D&B list, for example. Their claim is that they are 100% accurate according to the latest verified data sources. The problem is that their verified data sources are things like a state’s corporation division that publishes incorporation lists. So while people in your own company may have contact with a customer and know that they’ve changed addresses, the D&B list will insist for years that it’s still the old address according to their “verified” source.

It’s conundrum as to why we tend to believe sources like D&B rather than what our own people are telling us, but it’s true. Perhaps it’s the search for enterprise social proof — we’d like to know that someone outside of our organization has the same understanding that we do. It takes someone pretty bold to stand up in an enterprise and say our data is correct, even though it disagrees with the outside reference data we just paid big $ for.