#datavault 2.0 hashes versus natural keys

There’s been some discussion, nay, arguments of late – around whether or not to replace surrogates in DV2 models with Hashes, or to simply use natural keys… ok – perhaps Natural keys is too strong a word, maybe business keys is a softer side (as frequently found, source system models these days usually contain surrogates acting as business keys, and natural keys are no where to be found). This blog entry (albeit short) will explain the pros and cons of each. I welcome you to add your thoughts to this post (on LinkedIn “Data Vault Discussions” group).

Let’s take a look at how we define each.

Natural keys: data elements or attributes (may be composite, may not be composite) which when viewed by the human intellect appear to contain some form of descriptive metadata, enabling deciphering of the true meaning or representation without “adding” additional context (ie: without looking up additional information to figure out what it means).

Business Keys: Any “unique identifier” that is presented to the business user, for the purposes of uniquely finding, locating, and understanding data sets. It *may* be a sequence number, or it *may* be a natural key. Frequently these days, Business keys (as bad as it sounds) are generally sequences provided by, generated by, and maintained by a single source system application.

One part of the Data Vault 2.0 Standard requires changing from Surrogates in the Data Vault Model, over to Hash Keys as the primary key for Hubs, and Links. The question is: why not simply use the Business Keys OR the natural keys? Why go through the trouble of hashing the Natural or Business Keys to begin with?

Let’s examine this a little deeper:

#1) What is driving the need to switch OFF sequences to begin with?

Sequences cause bottlenecks in Big Data Solutions.

Sequences require dependencies in loading cycles, slowing down real-time feeds, regardless of whether or not referential integrity is turned on in the database physical layers.

The bottlenecks are to a Hybrid system, one that uses Hadoop (or NoSQL) for staging and deep analytics. These bottlenecks are because of the dependency chain mentioned in item #2 above. The load to “hadoop” or NoSQL requires a lookup on the Hub or Link sequence in the relational system before it can “insert” or attach it’s data. Sure, you can copy the file in to Hadoop, but the “sequence” has to be attached before it can be joined back to the relational system. This defeats the purpose of having a Hadoop or NoSQL platform to begin with. The simple fact is, it has to “LOOKUP” one or more sequences from the relational database, what a pain!

In a fully relational system, larger loads require “longer” cycles for loading. Why? because the Hubs must be inserted BEFORE the links can use the sequences, and BOTH must be inserted BEFORE the Satellites can be inserted. Why? because the Satellites depend on the Sequences generated by the parent tables. This happens EVEN WITH referential integrity shut off at the database level.

Neither of these situations are sustainable, nor even advisable in big data or high velocity systems, forcing the data architects to ultimately “re-engineer” the system for growth or arrival speed. Oh, there’s one more issue: MPP data distribution. If data is distributed by “ranges” of sequences, then there is a chance of forcing a Hot Spot to occur in an MPP environment. Nothing that anyone wants, and again defeats the purpose of having MPP in the first place.

Enter: Hashing.

Hashing is currently in use (behind the scenes) by many database engines in the MPP world (both relational and non-relational MPP systems) to distribute data along “buckets”, and preferably, reach an average equal distribution in order to avoid the hot-spot problem. That, my friends, still leaves us with the bottleneck and dependency issues to deal with…

Enter Data Vault 2.0

Data Vault 2.0 Modeling (the modeling component ISN’T the only thing that has changed for Data Vault 2.0), provides a standard that states: replace all sequence numbers with the results of a Hash Function. Then, goes on to suggest several hash functions to choose from: MD5, MD6, SHA1, SHA2, etc… And the standards document / 5 page white paper I’ve made available (to my students only on LearnDataVault.com) describes the best practice of how to implement it properly, and cross-platform.

The result of the suggest hash: MD5 is a 128 bit quad-word. In reality, it’s two separate Big Int’s (if you will). Unfortunately in order to handle this kind of data-type natively, the Operating system and / or the Database engine would need to declare and provide a numeric capable of handling 128 bits in length.

Due to the fact that this is simply not feasible today, we change the result of the hash FROM the binary representation in to a CHAR(32) (128 bits converted to ASCII HEX STRING). Again, all of this is covered for the students on LearnDataVault.com – those who buy a class, or buy the Super Charge book PDF from the site.

Back to the point and the comparison….

Hashing instead of sequencing means that we can load in complete 100% parallel operations to ALL HUBS, ALL LINKS, ALL SATELLITES, and enrich all Hadoop based or NoSQL documents IN PARALLEL at the same time (without dependencies). Providing of course that the proper hash values have been “attached” to the staging data. It also then allows us to join across multiple heterogeneous platforms (From Teradata to Hadoop to IBM DB2 BCU to SQLServer, to CouchBase to MySQL and more). (there is more to this discussion, about collisions, and so on – that are beyond the scope of this article at this time).

Back to the original question…

So why not simply use “business keys” or “natural keys”? Why Hash at all?

Not all database engines (relational or non-relational) have the capability or capacity to use natural or business keys for data distribution

Not all database engines (relational or non-relational) have the capability to execute EFFICIENT joins on natural or business keys

And the BIGGEST reason of all: in a Data Vault Model we leverage a many to many relationship table called a Link. It is made up of multiple keys (from different Hubs). To join ALL these keys together, would mean replicating the business keys to the Link – resulting in (most cases) a variable length multi-part, multi-datatype key set, which would ultimately perform slower than a concisely measured, precise length field. For Satellites, it means replicating the business keys to each of the Satellites as well.

Another statistic: 80% of the “business keys” are variable length character strings (most in Unicode these days, making them twice as long).

The reality of it is?

The business keys that are “longer” than 32 characters in length on average, will perform slower over the life of large data sets than those which are 32 byte hex represenations. Of course, the 32 byte ASCII Hash hex strings will perform “slightly” slower than the sequences (which are smaller yet), but more than make up the query performance difference by resolving the other issues mentioned above.

Hashes, like it or not, serve a purpose. Natural keys and business keys, as good as they are cause additional join issues (like case sensitivity, code-set designation, and so on). Believe me, solving a heterogeneous join (from SQLServer for instance to Hadoop) and dealing with two different code-pages, can cause other problems that JDBC won’t solve.

When to use Natural or Business Key joins instead of Hashes?

When you are embedded on a single system (like Teradata, or GreenPlum for instance) for your entire data warehouse solution, and don’t have “cross-technology” or heterogeneous technology joins to accomplish.

When to use Hashes instead of Natural or Business Keys?

When you need to join heterogeneous environments, or resolve Unicode and case sensitivity issues, or the length of the keys exceeds 32 characters (non-unicode) in length.

In the end, it’s your call if you want to “replicate” business keys/natural keys to Link Structures and do away with Hashes altogether. It would certainly solve the problem, but if you go this route, please do NOT return to using sequences, oh – and be aware of the issues mentioned above.

Yes, the reason is “availability” across multiple platforms. MD5 is embedded and provided by most ELT and ETL engines, as well as available in most RDBMS cryptographic libraries these days. It is most ubiquitous.

Regarding “better” or “faster” to use Murmur3 (to be accurate – you must select the 128 bit output option), you would have to run those tests yourself, as the site you pointed to does not contain comparative tests with MD5.

IF you prefer to use Murmur3 128 bit, be my guest – BUT – make absolutely certain that it is available cross-platform where your data warehouse will store it’s information. This would be the ONLY way to ensure heterogeneous platform join capabilities.

You might also want to check Perfect Hash:

I also did not choose a “perfect hash” function, for the same reason I did not choose Murmur3 – availability.

However – remember that I stated in “Data Vault Discussions” on linkedIn, the Data Vault 2.0 standard only requires a COMMON HASH value and common hash function to replace the sequence number. It does not dictate WHICH hash function you “must” use.

”
The result of the suggest hash: MD5 is a 128 bit quad-word. In reality, it’s two separate Big Int’s (if you will). Unfortunately in order to handle this kind of data-type natively, the Operating system and / or the Database engine would need to declare and provide a numeric capable of handling 128 bits in length.

Due to the fact that this is simply not feasible today, we change the result of the hash FROM the binary representation in to a CHAR(32) (128 bits converted to ASCII HEX STRING).
”

Depending on the implementation, hash functions like MD5, SHA1 etc return either the “raw” (binary) value (16 bytes for MD5, 20 bytes for SHA1 and so on), or the hex representation of that value as a characterstring (which would be twice the length of the binary value when expressed in ASCII, since each binary byte gets coded to two characters in hex representation)

Since the hash codes are used as keys, and are primarily used to join data sets together, why would you want to store it as an ASCII string? Instead of converting the binary value to a hex representation as a character string, I would simply store the binary value. And if the hash code implementation you have at disposal returns a hex characterstring, first convert that to its corresponding binary value, which will be half the length of the original)

Because the binary value is shorter, it’s bound to be somewhat faster efficient in joins and lookups, and will require less storage.

I guess hashing makes sense sometimes. But I think there are to many scenarios where it won’t. Most ‘enterprise’ datawarehouses are still on SMP systems and many companies don’t need a hadoop cluster.

Due to the randomness of the hashes a lot of fragmentation will occur at the leaf level of the B-tree. Which will impact the (bulk) load of new records (more pages need to be in memory) More maintenance has to take place (index rebuilds) And query response will be slower.

Suppose you have a small table with 200 records divided over 10 pages in the leaf level. When you insert 20 new rows all rows will be inserted in the last (10th page) or a new (11th page). That is when you use a sequence. When you use hashes rows will have to be inserted in pages 1 to 10 causing page splits and you’ll probably end up with 20 pages that are only partially filled.

In these cases you’ll probably be better off when you drop the index and rebuild it after the load. (depending of course on the number of rows inserted and the total number of rows.)

So I would be careful to propagate hashing in these scenarios. Hashes as a key is a nice addition in some scenario’s but I wouldn’t present it as the next standard.

Hashing isn’t _just_ about Hadoop clustering, it’s about Big Data, and it’s about high velocity data – even in relational systems. The problem is a conundrum faced by all systems that use sequences for loading cycles. I will explain this in detail in yet another blog entry, and then I will record a full hour long class walking everyone through the pros and cons of hashing, the details on hashing and hashing vs sequences.

By the way, there will not be fragmentation – unless you are using clustered indexing, and in which case, clearly the system built with clustered indexes does not use or need “big data or high volumes”. Clustering on indexes defeats the purpose of MPP and data co-location, or data distribution across multiple nodes. So, to put it quite frank: I’ve avoided and shut off clustering for many many years – because of the problems when it comes to “force-fitting” data in to blocks on disk where they don’t necessarily belong. But this is a long discussion that we would need to settle in the class room.

Now, regarding drop & rebuild of index – no, I don’t recommend you do that either, except on SINGLE partitions, where the volume is managed. Otherwise, it can overwhelm the I/O (particularly on large data sets across the entire EDW).

No, hashing IS required if you are going to do Data Vault 2.0 – BUT the choice of which hashing algorithm you will use, is up to you. Otherwise, you are still building DV1, and the system can and will suffer the consequences of bottlenecked loads.

The MD5Key exists only in the Link tables. Its primary purpose is to replace all the subsequent joins to the “parent Hubs” when loading from the staging tables. This column replaces the primary surrogate in the Link Structure.

I’m implementing for the first time, and I’ve also cremated hash keys in the hub table as the PK. If the business key is composite, then the key is concatenated and hashed.

Is this the proper practice per DV2.0? Seems to contradict the statement that hash keys only belong in link tables.

The article is incorrect and needs to be fixed. MD5 Key exists in Hubs and Links both as the PK of the respective tables. I am in the process of re-writing the article completely / cleaning it up as they say. DV2 standard: Hash Keys in Both Hubs and Links.

Hash Differences are optional and appear in ONLY the satellites (non-pk).

Great article on hash keys, you mention “when not to use them in joins”, here you refer to not using a hash key is you are on a single platform and specifically you mention Greenplum. I was interested to know what the reasoning behind this is? You stated previously that sequences (BK) can cause issues in Big Data solutions, the Greenplum distribution system appears to prefer hash keys for getting the most out of it’s buckets. Your thoughts would be appreciated.

Thanks for the feedback! I am glad to hear from you (my readers). Helps me see which articles are good for the community, and what I should be writing about.

Sure, Systems like Greenplum and Teradata actually hash Business Keys / (known as primary indexes) already to perform data layout across MPP environments. These systems also offer optimizations with Business Keys that other systems don’t – so hashing one or more BK’s can become a redundant exercise – BUT only on these systems, and ONLY if the Business Keys are “shorter” than CHAR(32).

The latest round of standards (not yet published) is to attempt to use Binary(16) instead of Char(32) – but as I indicated earlier, this only works on platforms that support a FIXED BINARY as a primary key, indexible, etc… We still have to convert it back to Hex ASCII Char(32) in order to “join” it to heterogeneous platforms (like most NoSQL platforms are).

The other issue is: Hashing a Hash can cause additional “collisions”, however, the original PK hash I am suggesting will be used for uniqueness, the “hash of a hash” that Teradata and Greenplum execute are bucketizing hashes – they don’t care about collisions. So it’s a moot point. Even Hadoop with Hive and HBase use Hashing to bucketize data and spread it over MPP nodes.

So in the end, the hashing is all about eliminating “load state dependencies”, as well as providing “heterogenous platform / environment” joins with ease. So sequences are OUT, especially if you are building to DV2 specifications.

With regard to Hashing VS Business Keys: again, if 90% or more of the BK’s are “smaller in length” than a fixed CHAR(32), then most likely, you will get better performance simply using the BK’s – but that means, your BK’s are filled in properly, and are generally single field values (not composite, and not multi-field composites). Now that said, it also means that your BK’s are NOT UNICODE!!! A BK of length 32, when changed to UNICODE is actually 64 bytes in length, even though the “text” is 32 visible characters. The hash is a fixed ASCII length of 32 bytes, and now – can be compressed on RDBMS (in some) to fixed binary(16).

First of all I find the usage of hashes as PK to be an exciting concept especially due to the parallel loading aspect, have not been working too much with cross platform joining of data.

My only concern is the collision part, in my view that aspect is a complete deal breaker. It would be nice if you provided a hint on what the official stance is on that topic as it seems to be pretty central.

Another small thing I would suggest is that if CHAR(32) is to be used as they PK column, at least for SQL Server, that a *_BIN2 collation is used for that column at least. It uses binary comparison algorithms which also makes it faster in joins, group by, sort etc.

A question I have is, if I am correct in assuming that the source system is included in the HASH (to avoid identical keys from two different source systems)?

Ok, This post was a while ago. Collisions are an issue, and yes, you have to deal with it. The standards state (once again) that you MUST use a hash as a PK to be DV2 compliant. They do not state WHICH hash function you have to use. Hashes as PK’s are fairly new in the relational DB world, they’ve been around as FILE identifiers for as long as I can remember (the NoSQL World…) I am working on a cross-platform collision handling solution that I hope to publish in a year or two. it takes a LOT of research and design to make these things work in the environments we are faced with. I cannot afford to “simply state” something easy that works in one platform, but not another. That would not make it a standard.

Now, for SQLServer – you CAN use Binary(16) instead of CHAR(32) – but it seems as though not all RDBMS platforms support that transformation either. Remember: CHAR(32) is UNIVERSAL, and works in Hive, MongoDB, Cassandra, Oracle, DB2, Teradata, SQLServer, Sybase, Teradata, and a host of other platforms… Hence the heterogenous platform join capacities. Binary(16) might be available, but *not* supported as a Primary Key.

And no, you are not correct in your assumption. The Source system should NEVER be a part of the hash. Identical keys SHOULD collide, that’s the nature of the definition of a business key housed in the SAME hub. I define these principles in my book: Super Charge Your Data Warehouse.

About the SQL Server – it was just a platform specific tip on how to reduce to performance hit taken when moving from a integer type to a char type column.

I read the book Super Charge Your Data Warehouse which I acquired from the DV (1.0) online course put out on learndatavault.com.
Admittedly I have not been involved in working with a professional installation of Data Vault but I’ve been working within DWH/BI for some 10 years now or so. So what I know about DV is theoretical and hobby projects which enjoy very much …

But about the source system as part of the key I might need to provide some context (this may be a bit off topic):
Say a company ‘A’ is using DV as a DWH and they have a hub which represents the chart of a accounts sourced from an accounting system. Now they acquire a new company ‘B’ which has their own accounting system. The decision is made to incorporate company ‘B’ accounting data to the DWH.

Data Profiling might show that both companies have an account with business key 1000, however for company ‘A’ it is “IT expenses” and company ‘B’ it is “Fee income”. Obviously you can’t just load the data as they text for that business key will change with every load as there will be two source data sets with different text for the same business key.

But If source system is not to be make part of the key – which I understand now because it is part of the “system” columns – what then is the solution? If I understand correctly the book provide 3 ways to handle this problem.

#1 Add a sub sequence to the hub.
#2 Add a new hub for “Company” and “Chart of Accounts” is now a Link/Link satellite.
#3 Add a new key column and make it a part of (enhance) the business key.

From my point of view solution #1 will not suffice as the distinction has to be meaningful and solution #2 is not sustainable as it would have to be replicated to a lot of other entities and result in a very messy data model. So solution #3 seems to me to be the solution – however it should not be the source system column specified in the Data Vault standard that is what you are telling me.

This is not something I just invented hypothetically. I was involved in a DWH project where this scenario was common (common data center for some 130+ similar companies) and at least physically the source system was part of all business keys by design.

If there is no “official” answer then I do not consider – and I do not mean this pejoratively – the standard to be complete, unique primary keys seems slightly important to me. That is fine, I’ll just have to go look for/figure out a solution for it should I choose to roll with it 🙂

At this point, these are topics which I cover in depth in my Data Vault 2.0 Boot Camp and Private Certification course which I am recording to put on-line ASAP. There are two ways to dive in to your questions: 1) take my course, 2) contract with me for training using my web-consultation training form.

There are official answers on how to handle this situation, and there are better solutions than using record source as part of the key. Only one of the solutions is to split the Hub data apart. But what I teach are a set of rules and standards, by which you can judge the right solution based on the case.

Thank-you for the wonderful interaction, keep up the great work in Data Vault!
Dan Linstedt

Please post these kinds of questions on the LinkedIn Data Vault Discussions forum, there is a large group there who are happy to discuss these types of things. Also, I discuss these types of things in my new book: Building a Scalable Data Warehouse with Data Vault 2.0 – you can find it on Amazon.com

The short answer is: yes you can create a hub with multi-part business keys, that is the typical practice.

Thank-you kindly for supporting Data Vault – I am happy to see you active in the community

you seem to have completely ignored the cost of hash in a warehouse!
Space is doubled.
using a hash will produce on average twice the disc space consumption vs a non have sequential integer solution. 4 bytes per key (or 8 if you like) vs 32 bytes for hash.
in multi key tables, the increase in size is substantial. ie tables like hubs and measure tables or fact tables with many keys and small actual data.
I am using hash and this is what we discovered.
Performance overhead of about 20-25%
it is absolutely slower. Due to longer keys and longer record sizes, more cpu, more IO and more spool.
our testing showed 18% more CPU and 23% more io,leading to longer run times.

Master Data
mostly we use these surrogates for Master data, and we simply do not generate enough keys in single load process to warrant such latter downstream overheads.

Domain Collisions
collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. f Integrity of data is important, then this is important.

The question is whether the benefit for loading faster is outweighed by the cost of usage..
if I load once and use many, then hashing has significant costs.

I appreciate your thoughts on this, however:
a) hashing or hash bucketing is widely used in Big Data / Hadoop and MPP systems for data distribution. If hashing caused that many issues, big data systems would have reverted to sequence numbers.
b) Data Vault 2.0 uses hashing for parallel independent stream loading, so that an architect can construct a solution with heterogeneous systems – all loading data at the same time, particularly across relational and hadoop instances, and across global divides where systems “can’t communicate” except at specific times / windows.
c) Sequences do not scale. Quite simply put in big data solutions (300TB per hour or more), sequences cause such tremendous bottlenecks that the systems break, and have to be re-designed. It is at this point that re-engineering of your entire model plus all the historically computed data must take place. At that level of data size, it is a nightmare to re-engineer your solution to leverage hashing going forward. You want to architect / design the solution ahead of time so that little to no re-engineering occurs as a result of huge volume growth.
d) because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none. Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.
e) regarding size: in Oracle and SQLServer you can switch the storage to a binary formatted PK, reducing it to 16 bytes maximum or a quad word.

There is more, a lot more to this than meets the eye – I teach many of the concepts and constructs in depth in my CDVP2 class. I’ve just begun to touch on the tip of the ice-berg as they say. However, I hope my reply has been helpful. Performance is a whole different ball game, and in order for me fully to comment on the performance – I would need to understand the sizing of your system a lot better.

well Dan,
you did not really address the concern did you?
I am not against using hashing, but before I do, I would like to understand the full implication!
I specifically called out master data in a warehouse.

My concern is that one should enter into a decision to use a Sequential Integer key vs a hash key with eyes open..

Addressing your points however…
Part(e) IF, one reduces the size of the output of a Hash, the possibility of a collision Increases.
IF, one reduces the natural domain of the key, then the possibility of a collision Decreases.
Thus goes the evidence based research, I went and looked it up and read it. Ergo, a collision is possible. Decrease the hash output size as you suggest in Part (e) of your response would suggest an increase in the probability of the collision occurring. If integrity of the object does not matter then fine.
IF INTEGRITY of the object does matter, then what is the cost of dealing with the aftermath of a collision? That is part of the support cost that has to be discussed and evaluated.

In Part (d) you make a statement that suggests 50% probability of collision is not a problem. I am fairly sure many would beg to differ on that position.. consider Account Numbers for bank accounts or trusts or anything really where integrity of the object was important. I mentioned I was interested in Master data.
IF Collision is not an issue then there is no problem hashing may be appropriate.

in Part (c) sequence do not scale. NO they do not scale in generation. “Does it matter?” is the question surely?
If the key domain is not volatile, say account numbers, the change per load is small. a sequential creation of no real importance.
One may then use parallel methods to assign the keys to the source after generation. A two step process but NOT a real problem. Master data, has these characteristics and I specifically called that out.
Once in use, the relevance of generation is no longer important, but cost of use is important.
Both hash and sequential keys scale in query use.
If I load ONCE and use 1000 times, then Hashing has distinct cost disadvantages. Take a hit on load or take a hit on use many times that is the decision to be made. Once generated a surrogate key whether of Hash or Integer is no longer an architectural issue as in Use they both scale, one just costs more at ingestion the other more in usage. One should understand these traits.

Part (b) you address multi stream parallel loads. A valid scenario. I wonder how many have a different scenario, where we have a central warehouse and multi streams at different times but what we want is a single view of our most important most often used data? That the volatility of master objects is low. The hash solution does not apply universally.

Part (a) .. you mention big data as if this proves your point! It does not. It may not be in your view but Big Data is well within mine… we use more cpu and more disc in Big Data when we use hashing for keys. We choose Hashing if data integrity is not an issue (most often this is about analytics and averages and a small issue of wrong assignment is not a problem).
When the data is intended to have integrity, then that is a problem. With Hashing we do see increased cpu as the keys are much bigger, we do use more disc space.

The reason we choose hashing is we tend to be loading ONCE and using few times, so the costs are not as significant vs the ease of programming the loads.
When we intend to load ONCE and USE VERY MANY times, then an important decision has to be made and an evaluation of the costs and other tradeoffs made.
The matter of the cost of loading vs the cost of using never goes away. The relevant costs need to be evaluated.

the Arithmetic is simple and it does not lie…
increase the size of a key and it will cost more to use.
Reducing the hash to ONLY 16 bytes vs a 4 byte integer for my master data produces 12 more bytes per key representation. So it takes longer to get to the 50% disc overhead, but you do get there!

Reducing the key size to just 16 bytes over a warehouse is still using 18-42% more space.
More space, more io to read the data, more cpu to process the data, more spool to hold the data.
I was and will continue to point this out as it is TRUE.

The use case may justify the extra cost, but I will never shirk from pointing it out to those who have to pay..
With just 1 key per table, the space is only from 6-18% overhead (50 to 200 byte record sizes)
with two key per table it raises to 10-29% per table.
with four keys per table it raises further to 18-42% per table.
and so on.

The point is one makes a decision on what surrogate key type to use, NOT because everyone is doing it, but because is makes financial and architectural sense. To make an informed decision is the correct approach.. Denying the realities of the limitations of Hashing is inappropriate.

Thank you for taking the time out of your busy schedule to have a thoughtful conversation about this. Let me see if I can address your latest comments.

Yes, i did say this:
Part(e) IF, one reduces the size of the output of a Hash, the possibility of a collision Increases.
And yes, this is mathematically correct. That said: the “size of the output hash” is specifically referring to the bit length of the hash result. In other words, using CRC32 instead of something like MD5 will reduce the “size” of the output hash. Collisions are frequent and common with a “smaller bit length output” from a hashing algorithm. If you are worried about collisions, increase the size of the output. Elect to use a larger hash like SHA-1, or SHA-256.

You then said: IF INTEGRITY of the object does matter, then what is the cost of dealing with the aftermath of a collision? That is part of the support cost that has to be discussed and evaluated
My reply: the cost of dealing with the collision will vary depending on how many times the collision occurs over a large number of keys. I deal with medium sized systems every day (250TB and above). That said: the number of collision mitigation strategies are fixed based on mathematics. So, again if integrity does matter either continue to use sequences (which I will discuss the problem with scalability and restartability again), or use hashing and design a proper collision strategy (which is completely possible / feasible) and has been done by most of my customers. The cost of a collision and dealing with it will vary slightly depending on the mathematical mitigation strategy chosen to handle it.

Your reply: “Part (b) you address multi stream parallel loads. A valid scenario. I wonder how many have a different scenario, where we have a central warehouse and multi streams at different times but what we want is a single view of our most important most often used data? That the volatility of master objects is low. The hash solution does not apply universally.”
My concern: you say the Hash Solution does not apply universally. I completely disagree. my research, and in fact my customers implementations all prove otherwise. I would however like to better understand why you think hash solutions do not apply universally? I don’t think I fully grasp why you made this statement.

Also: central warehouse, multi-streams at different times – well, hashing addresses all of these concerns including single view of ALL data. Hashing is a finite answer – put the same bits in, get the same hash out – every time (except when there is a collision). It has nothing to do with the volatility of keys, or the key space (ie: business keys). That, is a completely different discussion all together (which I am happy to have on a different thread).

Unfortunately I’ve run out of time this morning, I will return later to offer further comments to your thoughts.

Thank-you kindly,
Dan Linstedt

Yes, the cost needs to be evaluated – however you also said: “Denying the realities of the limitations of Hashing is inappropriate.” to which I say: I have *never* denied the realities and limitations of hashing. Hashing is a mathematically proven set of algorithmic functions with a specific and guaranteed output, to which the mathematicians have designed collision mitigation strategies. That said: there are such things as “perfect hashes” – which you completely ignored in your response, and I did put in my original blog post. Although, they have not yet been proven to be “perfect” because it is an infinite problem space to “test”.

You said: In Part (d) you make a statement that suggests 50% probability of collision is not a problem
No – you misread my statements, and if I did say that 50% probability of a collision is not a problem, then I made a mistake, and I appologize. 50% collision is a problem – however that said: there are two things at work here: 1) to get to a 50% collision CHANCE in a Data Vault model you have to produce 6 Billion new business keys per second per hub for 50 years. So I ask you this: is this what your system is doing? 6 billion new business keys per second per hub?? that would be quite some feat…
Collision IS an issue but the appropriate collision strategy needs to be chosen. For example: a well known bank in the US has chosen a reverse hashing strategy. They store the results of the first MD5 as the PK (as suggested in my documents and my book), they then reverse the BK string and hash again. IF a collision occurs, they then swap the two hash values – but I do believe they keep both. To my knowledge they have yet to have a collision, and yes – they do watch for them as well. Again, I think you need to take a look at collision strategies. Some work in data warehousing, and others do not. Another fortune 50 company chose to apply SHA-1 as their starting point to mathematically reduce the possibilities of having a collision in the first place. However, they too have an additional collision strategy in place to handle the event should it occur.

Your response: in Part (c) sequence do not scale. NO they do not scale in generation. “Does it matter?” is the question surely?
My reply: You go on to say “If the key domain is not volatile, say account numbers, the change per load is small. a sequential creation of no real importance.” Hmmm. Well, from this perspective; volatility of a business key has absolutely zero to do with the choice to use a hash OR to generate a sequence. In either case if the business key itself changes then you should be creating a new sequence or creating a new hash to attach to it.

That said: Sequential creation has nothing to do with a “sequence number” – I think you misunderstood my posting. Sequential Creation means “the order in which you create Hash Keys on the way IN to the staging area”. I was not referring to “sequence numbers” – to which you go on to discuss sequences themselves. Sorry about the language mix-up. By sequential creation I meant: bear the cost of hashing once, creating hash keys in sequence as the data is loaded to the staging tables. You can – should you so desire, split the stage load in to parallel processes and assign hashes in parallel – it won’t matter. Hashing is designed specifically to work in parallel computation environments. Sequence Numbers on the other hand – require a two step process, plus lookups (to maintain referential integrity) and thus – the bottlenecks to the processing are introduced.

Please remember that I deal with table loads of 1.1 Billion Records per stage load – we cannot afford to “stop and lookup” sequence numbers before inserting Satellite data to Hub Sats or Link Sats. The cost is far too heavy / too great, even with partitioning in our loading cycles.

So I repeat myself: Sequences in a data modeling architecture do not scale – regardless of referential integrity in the target solution.

Now, you mentioned “creating sequences in the staging table that carry forward to the warehouse” (if I understood you correctly). This also is not an option in the size of systems that I deal with, this process requires (again) lookups for sequences that do exist, and generation of new sequences where the keys don’t yet exist, and then – it requires writing them either in to a NEW staging table (to avoid updates against the existing staging table), or updates against the existing staging table. Again, due to the lookups over the size of the data sets, sequences have been proven not to scale – and I should add the key word: efficiently enough to meet customer demands for performance. Now that said, there is a problem with this two step approach as well, and it’s restartability.

If the sequence generator breaks in the middle, and then – the data set in the staging area changes before a restart of sequence generation takes place, then the sequencer is done – finished, there is no mitigation strategy here – other than to clear all new sequences created and restart the complete sequence generation again. But that said: my customers never get to this point because they can’t afford to have the lookup bottlenecks in the sizes of data that they deal with. Also, remember: sequences are in fact, sequential in nature (it’s in the name). So – there are no methods for “parallel computation across multiple platforms for the same business key” – it requires a lookup from Hadoop to the relational store, just to “pull” the generated sequence down in to Hadoop to get the load completed. That, defeats the purpose of ingesting data in both systems, in parallel with large volumes at high speeds.

There is something that I am clearly missing. Business or Natural keys can be reused. E.g. given a bank sort code and a account number which is a Business Key, this may be reused over time to represent a different account.

It seems that you cannot use this method if your business key could be reused over time like this? This seems like a huge limitation to me.

Hi, Dan. Thanks for this post and your most recent posts on hash keys the last few days in August 2017. They are very helpful and have given me confidence to move forward with using hashes versus sequential numbers as surrogate keys. One question about a section of your post above related to reverse hashes as a mitigation strategy.

You wrote: “Collision IS an issue but the appropriate collision strategy needs to be chosen. For example: a well known bank in the US has chosen a reverse hashing strategy. They store the results of the first MD5 as the PK (as suggested in my documents and my book), they then reverse the BK string and hash again. IF a collision occurs, they then swap the two hash values – but I do believe they keep both.”

I’m still not following how the bank would actually know whether to use the original hash or the reversed hash (even if they stored both) when loading a link or satellite table with a BK that had collided in the BK hub table. Do they have a collision table per hub that tracks collisions and then they consult that collision table as part of the link or satellite table load process to see if a BK needs to use the alternative reverse hash? That look up process would be extremely fast and probably avoid any serious bottlenecks, however, it is still a lookup and I keep thinking a lookup is bad all the time, but in this case maybe it isn’t. How would that lookup table solution work in some of the problematic lookup scenarios such as cross-platform situations? A little more explanation of the technical implementation strategy would be helpful.

Hi John, The way it works is:
1) the hashes are added to the staging data (on the way in to the stage) OR on the transaction as the real-time data arrives.
2) Then, a quick SQL Statement that says: select stage.hash,count(hub.bk) from stage, hub where hub.hash = stage.hash and hub.bk != stage.bk group by stage.hash having count(hub.bk)>0

This gets you a direct hash match index bound SQL select operation running in parallel against the primary key of the hub, and tells you which business keys don’t match those in the staging table for the same hash. Ergo: a collision. Once detected, the hash (forward and reverse values are swapped) then both loaded to the Hub.