Roland Bouman's blog

Sunday, March 18, 2018

Abstract: Matching JavaScript regular expressions against large input strings with V8 can result in memory leaks. In this post, I explain how to troubleshoot the issue using Google Chrome heap snapshots. Finally, a fix proposed by my son David (age 14) is presented.

Background

At Just-BI we developed a browser-based application for one of our customers. One way the application gets its data is by loading and parsing Microsoft Excel files. The app is succesful and our customers are happy, a fact they express by attempting to load ever larger files.

On mobile Safari (iPad), our app starts crashing when the files reach a certain size. This happens around 5 to 6 MB. Argueably, that's not that large, but things are a bit more complicated: Microsft Excel files, at least those of the .xlsx variety, are actually zip-compressed folders, containing mostly OOXML spreadsheet documents.

It is a well known fact that XML is really verbose, and I suppose we should be grateful that our 5 - 6 MB excel files uncompress to only 40 MB of XML.

We could debug the issue somewhat, and we noticed that by the time Safari crashes, it does so reporting it is out of heap space. We can't be sure if that's the actual cause, but we think we might be able to overcome or at least postpone this issue by somehow cutting down on memory usage.

This brings us to the main topic of our tale.

Parsing xlsx files in the Browser

To parse xlsx files, we use a particular javascript library called js-xlsx. This is actually a pretty nice piece of work, and I do not hesitate to recommend it. We have used it for quite a while without major issues; It's just that the particular strategey that this library uses to parse the xlsx file will temporarily spike memory usage, and we believe this triggers some bug in Safari, which eventually leads to a crash.

So, we're currently investigating a less general, less standards-compliant way to parse xlsx files. In return, this allows us to parse xlsx faster, and with a much reduced peak-memory usage.

I don't want to talk too much in detail about the xlsx parser we're developing. All I can say it is not meant to be a general, fully featured xlsx parser; the only requirement it is designed to fulfill is to avoid or at least postpone the crash we observe in Safari, and to parse portions of xlsx workbooks having a well-known structure specific to our application requirements.

I am happy to report that, with just one day of work, we managed to get an initial version of our parser to work. It has a peak-memory usage that is half that of the previous solution. And it sounds even more spectacular when I write: 100% less memory!

As an added bonus, the new solution is just a little less than 10x as fast as the old solution. For some reason, it does not sound much cooler when I say the new solution is about an order of magnitude faster, so I won't.

JavaScript Regular Expressions

One might be aware that for at least 100 different programming languages there at least 10,000 stackexchange answers to wittily denounce any attempt to parse XML using regular expressions. Some bloggers' entire careers seems built entirely around their particular brand of scorn and disdain about this topic.

We have little to add to discussions like these, other than that in modern JavaScript runtimes, regular expressions are a very productive and powerful tool for quickly building tokenizers. These tokenizers can have amazing performance, and can serve admirably as a foundation to build parsers of many kinds, including but certainly not limited to XML-parsers.

A Memory Leak

Despite the initial success, not all is well with our new xlsx parser. We found that, notwithstanding lower peak-memory consumption, it did suffer from a memory leak. We noticed this by creating a simple sample application, loading only our parser, and then making heap snapshots using Chrome developer tools during various phases of the process. See the screenshot below:

This is what happens:

First heap snapshot was made directly after loading the application, and measures 6.5 MB. Whether you think this is a lot or not, this is our baseline, and there is not much we can do about it now.

Next, the user picks a xlsx workbook, and the application opens it. The snapshot is now 12.6 MB, which is an increase of 6.1 MB as compared to our baseline. The workbook file is a little less than 6 MB and accounts for most of the increase. At this point, our sample application has also extracted and parsed the list of worksheets contained within the workbook, as well as the shared string table. I haven't looked at that in detail, but for now I am satisfied to believe that this accounts for the remaining extra memory.

At this point, we extracted the worksheet of interest from the workbook and uncompressed it into a javascript string. This made our heap snapshot increase by almost 34 MB. That is certainly a lot! However, the filesize of the worksheet document itself is 34,445 kB, so it seems everything is accounted for.

The next heap snapshot was taken after parsing the worksheet and building a parse tree. The snapshot weighs 77.3 MB - an increase another 31 MB. Now, the sheet has 32,294 rows with 24 cells of data each, and most of the cells are unique decimal numbers, so it is a decent chunk of data. But even then, it still feels as if this is way too large.

That said, things probably look worse than they really are. Our new parser is event based: the parse method accepts a configuration object that contains only a callback, which is called every time a new row is extracted from the sheet. For our sample application, the callback is only a very naive proof of concept. I suspect there are plenty opportunities to make the parse tree builder smarter and the parse tree smaller.

The last heap snapshot was taken after the parse. At this point, the parse tree, the workbook object, and the XML string went out of scope. But we are still looking at a heap snapshot of more than 40 MB! This is bad news: we really should be back at something close to heap snapshot 1. So, there's about 34 MB unaccounted for.

In the screenshot, you can also see what's hogging the memory: in the top right pane, we find our XML document string, which indeed accounts for the retained 34 MB of memory. In the bottom right pane, we can see who's still referencing it: it's some property called parent of sliced string @15298471. And these are referenced twice in some array, which is referenced by something called regexp_last_match_info in the native context.

Memory Leak, Explained

Now, what I think we're looking at is the lastMatch property of the global built-in RegExp-object.

If you're not familiar with JavaScript regular expressions, it might be helpful to consider exactly how our parser uses them. We're using code like this:

(Note that this is just an example of the concept - not literally the actual code)

The parse() function first assigns a literal regular expression to the regexp variable. Under the covers, this literal regular expression results in calling the built-in global RegExp constructor, instantiating a new RegExp instance. Then, the exec-method of the RegExp instance is called, passing the -huge- XML document string. The exec-method returns a object representing the result: if there was no match, null is returned; if there is a match an object is returned that contains information about the match(es).

If there was a match, the match object will look a lot like an overloaded array object, having the matched parts of the string argument as elements. The element at index 0 of the match object (match[0]) is the substring matching the entire regular expression, the element at index 1 is the substring that matches the first parenthesized capturing group, and so on.

Now, since the match variable is a local variable in our parse function, everything should be garbage collectible after the function ends, right?

Yes. But No.

About RegExp.lastMatch

As it turns out, when an instance of the RegExp object finds a match, then the corresponding match info object containing all the matching substrings is stored in the lastMatch property of the global built-in RegExp-object. So, even if our parse-method is out of scope, the last match made by some regular expression inside it is still dangling around, attached to the global RegExp object in its lastMatch property.

Substrings in V8

Now, if the the lastMatch-object is still around, then the substrings representing the matches are also still around. As it turns out, V8 implements these substrings as "slice" objects. From within the JavaScript environment, they act and behave like String objects, but internally, the V8 javascript engine implements them as objects that have a parent property to keep a reference to the original String object from which they are a substring, along with some indexes to indicate what part of the original string makes up the substring.

Now, if you think about it, this way of implementing substrings is actually pretty clever, since it allows V8 to do many string manipulations very efficiently, minimizing overhead and memory consumption due to copying parts of strings to and through. In my case, it just becomes an atrocious memory hog because of the RegExp object, that has decided to maintain a reference to the last match object (for whatever reason).

Other people have ran into issues due to V8's substring design as well, and a bug was filed here:

Solutions

My oldest son, David of 14 years old, came up with a pretty creative solution: what if we'd write our own substring implementation, overriding the native one? If this makes you cringe, just think of 30 MB memory leaks and crashing browsers: it puts things in perspective. If this still sounds crazy to you, you should realize that when he came up with this idea, we had been looking together at this issue for two hours already. And even though we felt we knew the substring issue was related, we still had no way to prove that this was actually the case. His idea was feasible and might possibly confirm our suscpicions, so we went ahead and did it, and attached our own implementations of substring and substr to the __proto__ object of our XML string to override only these methods of that string instance.

As was to be expected, our own substring implementations were way slower than the native ones, and the parse took about 25 times longer than before. However, it *did* solve the memory leak. This was a strong indication that we were on the right track.

Then, David suggested another solution: why don't we simply clear out the lastMatch property of the global built-in RegExp object? We tried to do this directly, simply by assignment:

RegExp.lastMatch = null;

Unfortunately, this does not work. Although it does not throw a runtime exception, the RegExp object is protected against this kind of assignment, and the property never gets overwritten. However, it is still possible to achieve what we want, simply by instantiating a new RegExp object, and then forcing a match against a known, short string. We can then wrap that in a utility function, so we can always call it after doing some serious regular expression matching on large strings:

function freeRegExp(){
/\s*/g.exec("");
}

Here's the heap snapshot after applying this fix:

Summary

Globals are bad.

Side-effects are bad, in particular if the modifications are global.

V8's substring implementation may lead to unexpected memory leaks.

Chrome heap snapshots are a powerful tool to troubleshoot them.

After applying regular expressions to huge strings, always force a match against a small string to prevent memory leaks.

David Rocks! He truly impressed me with his troubleshooting skills and his knack for pragmatic, feasible solutions.

(In case you're wondering: per-view internationalization is the ability to maintain i18n.properties files, which contain translations for human readable texts together with the view code wherein these texts appear, rather maintaining just one giant i18.properties file that contains the translations for any and all internationalized texts that appear throughout the application. I think the benefits of this idea are clear from a developer's point of view. If you're interested in more information about this concept, then please check out the two blog posts I wrote prior)

Per-view i18n and inheritance

The reason for me to revisit the topic is that I found that the method I use to figure out which i18n.properties files to load, doesn't work very well when you're using inheritance. In UI5, inheritance is achieved by calling the extend-method on the constructor you want to inherit from.

To understand why it doesn't work well, we should first define the scope and the desired behavior.

In my case, I'm using inheritence to build controllers for views that are a lot like other existing views, but with some additional features. Most if not all of the behavior (and texts) of the existing view should be copied, and I'm managing that copy by extending the controller of the existing view. It's just that we need some extra things in the view, and these extra things might bring their own internationalization texts along.

So, what we really need is to include not only the i18n files for the extended controller, we also need to load any i18n files that might be created for the views managed by the superclasses of the extended controller. We need to mind the order too: the extended view might choose to override some of the texts defined by a superclass, so the i18n texts that are closer in the chain of inheritance should be given precedence.

A Per-view i18n solution that works with inheritance

The following code will solve this problem:

_initI18n: function(){
//first, check if we already constructed the i18n model for this class
if (this.constructor._i18Model) {
//we did! Don't do all that work again, just use the existing one.
this.setModel(this.constructor._i18Model, i18n);
return;
}
//check if the view and controller are in the same directory.
//if they are not, then we need to take the possibility into account
//that the view and the controller might both have their own i18n files.
var stack = [];
var controllerClass = this.getMetadata();
var viewName = this.getView().getViewName();
if (controllerClass.getName() !== viewName) {
//if the view name is different from controller name,
//then we assume the view may have its own i18n
//that overrides those of the controller
stack.push(viewName);
}
//walk the chain of inheritance up to sap.ui.core.mvc.Controller
//store each superclass at the front of the stack
var className, rootControllerClassName = "sap.ui.core.mvc.Controller";
while (true) {
className = controllerClass.getName();
if (className === rootControllerClassName) {
break;
}
stack.unshift(className);
controllerClass = controllerClass.getParent();
}
//walk the stack and create a resourcebundle for each class
//use it to enhance this class' i18n model.
stack.forEach(function(className){
var bundleData;
if (window[className] && window[className]._i18Model) {
bundleData = window[className]._i18Model.getResourceBundle();
}
else {
className = className.split(".");
//snip off the local classname to get the directory name
className.pop();
//add i18n to make the i18n directory name
className.push(i18n);
//add i18n to point to the i18n.properties file(s)
className.push(i18n);
bundleData = {bundleName: className.join(".")};
}
var i18nModel = this.getModel(i18n);
if (i18nModel) {
i18nModel.enhance(bundleData);
}
else {
i18nModel = new ResourceModel(bundleData);
this.setModel(i18nModel, i18n);
}
}.bind(this));
//cache the i18n model for new instances of this class.
this.constructor._i18Model = this.getModel(i18n);
}

Note that this code replaces the _initI18n-method that appeared in my prior blog posts on this topic. It is also assumed that this method sits in some abstract base controller, which you'll extend to create actual concrete controllers for your views.

Here are a couple of highlights that explain the new and improved _initI18N-method:

Examining the entire chain of inheritance, up until sap.ui.core.mvc.Controller. This is achieved with this snippet:

The subsequenct forEach-iteration of the stack then constructs the resource bundle to enhance the i18n model in the usual way.

Note that names of superclasses that are "higher up" in the hierarchy (or put another way: more basal) are stacked in front of subclasses. This way, the forEach-array method will encounter the class names in the desired order, allowing the subclasess to override texts added by superclasses.

Distinguish between texts defined by the view and the controller.

Admittedly this scenario is quite rare, but if the controller and view each define their own i18n files, then we'd like to enhance our i18n model with files from both resourcebundles. I somehwat arbitrarily decided that in this case, the view's texts should probably override those of the controller.

I achieve this with this piece of code:

var stack = [];
var controllerClass = this.getMetadata();
var viewName = this.getView().getViewName();
if (controllerClass.getName() !== viewName) {
//if the view name is different from controller name,
//then we assume the view may have its own i18n
//that overrides those of the controller
stack.push(viewName);
}

In other words, the view name is placed as very last item of the stack; if it differs from the controller name, then the controller name appears directly before it, making the view resource bundle the last to enhance our i18n model.

Caching the i18n model at the class level, so that every instance may reuse it.

While I fixed the inheritance issue, it occurred to me that all instances of the controller would go through their own cycle of building the i18n model. Since the i18nmodel deals almost only with static texts, it seemed wasteful to repeat all that work for each instance. We can simply store the i18n model as a property of the constructor, and retrieve it any time we're creating a new instance.

This is achieved with the very first and last bits of the _initI18n-method:

//first, check if we already constructed the i18n model for this class
if (this.constructor._i18Model) {
//we did! Don't do all that work again, just use the existing one.
this.setModel(this.constructor._i18Model, i18n);
return;
}
...
...
...
//cache the i18n model for new instances of this class.
this.constructor._i18Model = this.getModel(i18n);

Saturday, June 17, 2017

one week ago, Friday, 9th of June 2017 I was at the Dutch house of representatives ("de tweede kamer") to participate in the 2017 Accountability Hackathon.

The Accountability Hack Event

The event was organized and sponsored by a number of Dutch ministeries, the Court of Audit ("Algemene Rekenkamer"), the Central Agency for Statistics ("Centraal Bureau voor de Statistiek") and the Open State Foundation. Goal of the event was to invite programmers, developers, data analysts, journalists and so on to come together and create applications that use one or more of the numerous open data sources published by the Dutch government to create insights in the performance or the spending of Dutch governmental or publicly subsidized organisations.

This assignment alone might need some clarification. In Dutch democracy there has always been a push towards more transparency. But in the last decade in particular, there has been an increasing demand to provide this transparency by publishing openly accessible data sets. The idea is that publishing records and metadata contributes to an environment where civilians can answer any question about how their government is functioning themselves, by querying and combining their data. Now, obviously, not everybody is capable of working with raw data sets, so there is also a demand for tools, applications and people with know-how bridge the technical gap and truly making all this data available on a functional level.

This is where events like the Accountability Hackathon come in: it is a direct attempt to stimulate individuals, but also commercial companies to apply their expertise to create applications and tools that provide meaningful information and insights, based on open data.

Team Just-BI

I participated on behalf of my company Just-Business Intelligence. Just-BI provides end-to-end Business Intelligence consultancy. I'm in the custom development branch, which creates web and mobile applications in the realm of self-service and operational Business Intelligence.

Just BI has a policy of assigning consultants to billable projects for at most 80% of their working time; The remaining 20% is meant to be invested in knowledge development. We try to align agendas and meet each other every friday at our office in Rijswijk.

This arrangement made it possible for me to attend an event like the Accountability Hackathon - in fact, Just-BI stimulates its consultants to reach out and participate in events like these.

Submission: Jubilant

The Just-BI submission is a generic OData query and exploration tool called Jubilant (short for Just Business Intelligence Analysis Tool).

Jubilant is an Open UI5 web application that provides a plugin architecture that makes it easy for developers to write their own data visualisations based on OData services. Jubilant provides rich metadata about OData services, as well as a number of reusable components that make it easy to quickly build a query editor/designer.

The Jubilant concept allows a plugin developer to focus on making a cool visualisation, without having having to invest time and effort to provide the user with a query builder. During the hackathon I managed to create two plugins - one simple table visualisation, which simply renders raw data in a data grid, and a OData Metadata Graph visualiser, which plots the structure, entity types and relationships exposed by the OData service as a graph.

OData and Open Data

The connection to open data and the assignment for the Accountability Hack is that a number of key open data API's use the OData protocol. A good example is the Dutch Parliament API.

Interestingly, there are relatively few OData query tools available, and none of them are particularly affordable. In fact, during the accountability hackathon a few teams tried to work with these OData APIs and discovered they didn't quite know how to access and process them. I don't know if this finding influenced the jury in any way, but it certainly highlighted the need for a tool like Jubilant.

For Just-BI OData is a key protocol as well, since it happens to be the standard way of exposing data by many SAP products, like SAP/HANA. While Just-BI is a general end-to-end Business Intelligence shop, many of our customer engagements have a strong focus on SAP products. This is also reflected in the Open UI5 framework, which has rather good support for OData.

Result: 2nd Prize!

I was surprised, but obviously very happy to have been awarded the 2nd prize, which is good for 1.500 EUR. It is a honour and a privilege to be in a position to work on stuff I like and maybe contribute something to the transparency of the Dutch democracy. And, frankly, I just had a great time hacking!

Since this was basically just a working day for me, I decided to donate 500 EUR of the prize money to Just-Care, which is a charity supported by Just-BI.

Where to get Jubilant?

At Just-BI, we're currently working out the exact details around a release of Jubilant. I will write an update as soon as I can disclose more, but I can already say that all the work I did for the accountability hack will become available as Open Source Software in the very near future.

In the mean while, if you're interested in Jubilant and OData-based self-service BI, don't hesitate to contact me. Or contact Just-BI.

The article focuses on the use of hash-functions in Data Vault data warehousing. To be precise, it explains how Data Vault 2.0 differs from Data Vault 1.0 by using hash-functions rather than sequences to generate surrogate key values for business keys. In addition, hash-functions are suggested as a tool to detect change of non-business key attributes to track how their values change over time.

Abstract

First I will analyze and comment on the Scalefree article and DV 2.0, and explain a number of tenets of DV thinking along the way. Critical comments will be made about using hash values as primary keys in DV 2.0, and the apparent lack of progress made in DV thinking regarding this matter. A discussion follows about the birthday problem and how it relates to DV 2.0 usage of hash functions. Using the Square approximation method it will be demonstrated how we can make informed and accurate decisions about the risk of a collision with regard to the data volume and choice of hash function. Different scenarios regarding the impact of a collision on data integrity will then be explored. Finally, a practical proposal is made to detect hash collisions and to prevent them from introducing data integrity violations into our data warehouse.

A Summary of the Scalefree article

The business key is what the business users use to identify a business object.

To identify business objects in the data warehouse, we should use a surrogate key that fits in one column instead of the (possibly composite) business key.

The reason to use a single-column surrogate key is that business keys can be large (many bytes per key field, multiple key fields) which makes them slow - in particular for join operations.

In DV 1.0, sequences are used to generate values for surrogate keys.

Using sequences to generate surrogate key values implies a loading process consisting of at least 2 phases that have to be executed in order.

The previous point requires some context regarding the architecture of the Data Vault model. Without pretending to completely represent all tenets of DV, I believe the explanation below is fair and complete enough to grasp the point:

DV stores the data that makes up a business object in 3 types of tables:

The business key and its corresponding surrogate key are stored in hub-tables.

The change history of descriptive attributes are stored in satellite-tables

Relationships between business objects are stored in link-tables.

So, each distinct type of business object corresponds to one hub-table, and at least one, but possibly multiple satellite-tables. The satellite-tables refer to their respective hub-table via the surrogate key. Likewise, link-tables maintain relationships between multiple business objects by storing a combination of surrogate keys referring to the hub-tables that participate in the relationship.

This means that for a single new business object, any of its satellite-tables can be loaded no sooner than when it is known which surrogate key value belongs to its business key. Since the new sequence value is drawn when loading the new business key into its designated hub-table, this means that the new business object must be loaded into its hub-table prior to loading its satellite-tables.
Likewise, any relationships that the business object may have with other business objects can be loaded into link-tables only after *all* business objects that are related through the link have been loaded into their respective hub-tables.

In practice this means that in DV 1.0, you'd first have to load all hub-tables, drawing new surrogate key values from the sequences as you encounter new business keys. Only in a subsequent phase would you be able to load the satellite- and link-tables (possibly in parallel), using the business key to look up the previously generated surrogate key stored in the hub-tables.

In DV 2.0, hash-functions are used to generate values for surrogate keys.

When using hash-functions to generate surrogate key values, hub-, satellite- and link-tables can all be loaded in parallel.

The previous point needs some explanation.

There is no strict, formal definition of a hash-function.
Rather, there are a number of aspects that many hash-functions share and on which DV 2.0 relies:

Deterministic: a given set of input values will always yield the same output value. In a DV context, deterministic just means one business key maps to exactly one surrogate key.

To some extent, a solution based on a sequence as generator of surrogate keys also appears to be deterministic, at least within the confines of one physical system.

Stateless: While a solution based on a sequence appears to be deterministic, its values are generated by incrementing the previous value, and by "remembering" the mapping from business key to its corresponding surrogate key by storing them together in the hub. Remembering the mapping by storing it in the hub is what makes it deterministic, because only then do we have the ability to look up the surrogate key based on the business key (and vice versa). But if we'd have two separate but otherwise identical systems, and load the same set of business objects into both, but each in a different order, then the mapping from business key to surrogate key will be different, depending on which object was loaded first in that particular system, because the one loaded earlier will draw a lower sequence number.

Fixed output size: DV expects a chosen hash-function to always return a fixed-length value output. (And typically, the length of the hash key should be smaller, ideally, much smaller than its corresponding business key)

Uniformity: This means that inputs of the hash-function yields results that are very evenly divided across the output domain. If that is the case, the chance that different inputs (i.e., two different business keys) yield the same output value is very small. The phenomenon where two calls to a hash-function, using different arguments, yield the same result value is called a collision. We will have much to say about collisions later on in this article.

Basically the idea is that a hash-function can be used to generate a surrogate key value that is typically much smaller (and thus, "faster", in particular for join operations) than its corresponding business key, and it can do so without an actual lookup to the set of existing surrogate key values. Ideally a hash-function calculates its output based soley based on its input, and nothing else.

It is in this latter aspect that makes it different from a sequence, which generates values that have no relationship at all with the values that make up business key. In the case of a sequence, the relationship between surrogate key and business key is maintained by storing it in the hub, and looking it up from there when it is needed.

When using hash-functions there's a risk of collision.
A collision occurs when two different inputs to the hash-function generate identical output.

The risk of collision is very small. In a database with more than one trillion hash-values, the probability that you will get a collision is like the odds of a meteor landing on your data center.

The MD5 hash-function is recommended for DV 2.0. As compared to other popular hash-functions (like MD6, SHA1 etc), MD5 storage requirements are relatively modest (128 bits), while the chance of a hash-collision is 'decently low'. It's also almost ubiquitously available across platforms and databases.

Comments

Many tenets of Data Vault thinking make good sense:

The focus on business keys, rather than blindly using the primary key of the source systems feeding into the EDW, ensures that the result will serve the needs of the business.

While the use of surrogate keys in transactional systems is still often a source of debate, this practice is not controversial in data warehousing.

That said, I think the technical reasons for introducing a surrogate key as mentioned by the article (make keys smaller and more wieldy to improve join performance)
are not as important as the functional requirement of any data warehouse to integrate data from multiple datasources, and ensuring identity there is resilient to change of the source systems.

There's a lot to be said in favor of maintaining attribute history in satellite-tables. In particular, the ability to have multiple satellite-tables is appealing, since it allows you to create and maintain groups of attributes that somehow belong together, and maintain them as a unit. For example, grouping attributes based on their rate of change, or according to whether they tend to appear together in queries sounds like a very useful thing.

Link-tables also sound like a good idea. Decoupling direct links from the business objects and maintaining them in completely separate tables makes the EDW very resilient to changes in the data sources. In addition it allows you to add extra relationships that may not be directly present in the source systems but which make sense from the point of view of data integration and/or business intelligence.

Once we accept the benefits of satellite- and link-tables and surrogate keys, then we must also embrace hub-tables. The model wouldn't make any sense without it, since we need an integration point anyway to maintain the mapping between business and surrogate key (which would of course be the hub-table).

Once we accept the modeling entities of DV, we also need to accept any constraints that surrogate key generation may have on the loading process. Of course, a dependency itself is something we would like to avoid, but the fact that a known limitation is recognized and anticipated is a good thing.

According to Dan's paper "DV2.0 and Hash Keys", the chances of having a hash key collision are nearly nonexistent (1/2^128). But I still worry.

What if a collision actually does occur that results in the business making a catastrophic decision? Can we really say we have 100% confidence in the "system of fact" when we choose to accept the risk of a collision? What is our course of action from a technology point of view if a collision did actually occur?

The paper "DV2.0 and Hash Keys" that Charles refers to is publicly available though not freely. (You can purchase it via Amazon.)
Here's the relevant statement from the article:

collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. If Integrity of data is important, then this is important.

In Linstedt's answer, you can find a similar expression of this probability, which goes:

because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none.

Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.

Objections

Regardless of the merits of DV (which I believe I stated fairly in the previous section) I have a few doubts and objections on the writing and thinking by DV 2.0 advocates I have observed so far; all around the topic of hash-key collisions. My objections are:

The wording around the probability of hash-collisions is not helpful to understand the risk. As such, it does not help to decide whether to use DV 2.0, let alone choose a suitable hash-function for a concrete use-case.

The actual numbers regarding probability of hash-collisions are stated flat-out wrong on more than just one occasion.

The probability doesn't matter if you can't afford to lose any data

DV 2.0 does not discuss the consequences of a hash-collision, and no concrete advice is given on how to detect hash-collisions, let alone handle them.

I believe the questions by Charles Choi and Ray OBrien show that I am not alone. At the time they voiced their doubts, DV 2.0 was relatively new and I can understand that maybe at the time these tenets of DV 2.0 would still need to mature.

A couple of years have passed since, and after reading the article on the scalefree company blog, I am sad to observe that, apparently, no progress has been made in DV 2.0 thinking. At least, if such progress has been made, the scalefree company blog article doesn't seem to offer any new views on the matter. Instead, it comes up with an - equally unhelpful - restatement of the probability of hash-collision by comparing it to the chance of being hit by a meteor.

In the remainder of the article I will explain my objections and attempt to offer some thoughts that may help advance these matters.

The DV 2.0 wording around the probability of hash-collisions is not helpful

First, let's try and analyze the wording around probabilities, and what message it conveys.

Distracting Rhetorics

What does it really mean when someone says that "the probability [...] is like the odds of a meteor landing on your data center"? What does it mean, really, when someone says that the probability is "1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456"?

Well obviously, they are saying the chance is very small. Maybe it's just me, but I also sense a level of rhetoric in the wording that seems to be intended to dwarf he reader with Big Serious Numbers.

It's almost as if they're saying: this won't happen, so you shouldn't worry. You're not worrying about meteors hitting your data center all the time, so why worry about a hash-collision? Right?

You can observe that the rhetoric is working too, just look at how Charles Choi voiced his question: "According to Dan (Linstedt) [...] the chances [...] are nearly nonexistent (1/2^128). But I still worry.".

It's as if Charles is apologizing in advance for worrying.

Probabilities are not absolute

There is a more fundamental problem with the probability wording of the previous 2 examples, and that's that they project probability as an absolute.

To be fair, if you read them in context, then you'll notice that both statements are about MD5 collisions. Obviously, this matters, since not all hash-functions have equal probability of collisions. For hash-functions that have a fixed-length output, the chance surely has to have a relationship with the length of the output, since that puts a hard limitation on the number of unique values it could possibly encode.

However, apart from the output length of the hash-function and the algorithm it uses, there is at least one other factor which determines the probability, and that is your data volume.

Intuitively, this is easy to understand: if you have an empty set, the probability of the first hash-value causing a collision is exactly zero, since there is nothing to collide with. At the other extreme end of possibilities, if the set contains as many items as the total number of unique values the hash-function is capable of generating, then the probability of a collision is exactly one, since the entire keyspace has been "used up" already. In between these extremes, we have a growing number of existing entries that could collide with a new entry, so the probability increases from zero to one as the actual number of items (i.e, the number of rows in the hub - the data volume) increases.

While this observation seems trivial, it is important to mention it in this discussion because the aspect of volume is, for some reason, often not touched at all by DV 2.0 advocates. It's a mystery why that should be so, because if we would know the relationship between probability of a collision, maximum possible number of unique values, and the maximum volume of data, then we could reason about these variables sensibly. Like:

Given the maximum risk that I am willing to take to lose data due to a collision, what is the maximum volume I can store if I use a 128-bit hash-function?

Given the maximum risk of collision that I am willing to take, and given the maximum number of rows I need to store, what would be the mimumum output length of the hash-function I should look for?

Given my current data volume, and my current choice of hash-function, what is the risk I am running now of losing data due to collision?

The Birthday Problem

Interestingly, Linstedt does provide one statement that at least takes the data volume into account: "you'd have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place". Let's see what that means exactly.

Apart from the probability, this statement includes the other two variables: 128 bits keylength of MD5; a data volume of 6 bio rows per second for a 100 years. But if you look at the probability (50%) you'll notice how completely useless this wording is, that is, apart from its rhetorical power. Who in their right mind is interested in a system that can only accept half of the data you're trying to store in it? How can you possibly apply this piece of knowledge in any practical sense to a system you have to actually build?

I spend a while thinking about how this statement could have come about, and I have to come to believe that it is actually a restatement of the classical birthday paradox:

The birthday paradox, also known as the birthday problem, states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.

(As compared to Linstedt's statement, the 50% probability stays the same; the 6 billion rows per second for a 100 years is equivalent to the number of people gathered, and the number of possible unique values in the key corresponds to the number of days in a year.)

Whether or not the original birthday problem statement is actually what made Linstedt's word his statement like he did, I think it's clear that a 50% probability of a collision has no practical bearing on building any kind of database. To me, it just sounds like more rhetoric to convince that hash-collisions are really rare.

Probabilities are stated flat out wrong

The discussion of the birthday problem brings us to an actual method to calculate the probability as a function of data volume and key length. The wikipedia article I linked to provides an exact method, as well as many useful approximations, which are much easier to calculate. The birthday problem wikipedia article explains it far better than I ever could do, and also provides this really useful rule-of-the-thumb called the square approximation:

A good rule of thumb which can be used for mental calculation is the relation

p(n) ≈

n2

/

2m

which can also be written as

n ≈ √ (2m * p(n))

which works well for probabilities less than or equal to 0.5

with

n

the actual size of the keyset - i.e, the number of rows you need to fit into the hub

p(n)

the probability of a collision given n.

m

the theoretical maximum size of the keyset, i.e. the maximum number of unique values that your hash-function can encode.

Given a fixed length hash-function, such as MD5, then m can be calculated by raising it to the power of the keylength (expressed as the number of bits):

m = 2 ^ bitlength (or 2bitlength)

(In case this needs explaining: if your key would be just one bit long, then it can old only two values - 0 or 1 or 21. If the key would be 2 bits long, it could hold 2 * 2, or 22 = 4 unique values. With 3 bits, 2 * 2 * 2 or 23 = 8 and so on)

From the discussion above as well as the previous section, we can now conclude that at least one statement of the probability:

The mathematical chances of a collision as a result of using MD5 are (1 / (2^128))

is simply flat-out wrong, since it does not take the data volume into account. Rather, since 2^128 is the number of possible unique values that MD5 can cover, 1 / 2 ^ 128 is the chance that the second row you put into your hub will collide with the first one.

So how big or small are the odds really of running into a MD5 collision in case we're handling a volume of 6 billion rows per second for a 100 years? Using square approximation, we get:

which is in fact closer to 53%. To figure out how many years we would need to insert 6 billion rows per second to achieve the 50% chance of running into a collision, we can use the second form of the formula:

The point of doing these calculations here is obviously not to prove Linstedt wrong by showing you'd already arrive at a 50% after only 97 years and some instead of after a 100. Nor is it to determine that after 100 years, the chance is actually closer to 53%. Besides the fact that I'm using an approximation, neither makes any sense anyway, because the probability of 50% is already way, way beyond any definition of a working system.

The point I am trying to make is that it is perfectly possible to reason about large numbers and to clearly and transparently demonstrate how they are calculated. Using square approximation, you have a tool to calculate the value of the third variable once you have the value of the other two, allowing you to reason about it from three different angles.

I think we can all agree that's a much better position than getting stumped by Really Seriously Big Numbers.

Probabilities add up for each keyset

So far, we've just looked at the probability for encountering a collision while loading a single hub.
But the probability increases as you have more hubs.

Intuitively this is clear because each hubs could encounter a collision independently.
So, the chance of suffering a collision in either one of them grows as you have more hubs to maintain, and is quite a bit more than the chance of suffering a collision in just one particular hub.

If we'd like to compute the chance of getting a collision in at least one hub we can apply the following calculation:

1 - ((1 - Ph1) * (1 - Ph2) * ... * (1 - PhN))

with:

Ph1

Probability of a collision in the first hub

Ph2

Probability of a collision in the second hub

...

PhX

Probability of a collision in the last hub

The rationale behind this is that if the probibility of a collision is P(n), then the chance of not having a collision is 1 - P(n). To calculate the chance of not having a collision in any of the hubs, we have to multiply the individual chances of not having a collision in one particular hub with each other. If that number is the chance of not having a collision in any of the hubs, then all remaining probability must mean there is a collision in one or more hubs. So the chance of having at least one collision is obtained by substracting the probability from having no collision at all from 1.

So the chance that someone ever will encounter a collision could be quite a bit larger than you'd expect if you're focussing on just one hub.

No matter how slight a Probability, I can't afford to lose data

Now we arrive at a more fundamental objection regarding the matter of using hash-values as keys in your database.

If you re-read Charles question, you'd notice that he is politely explaining that, although he understands and appreciates that a MD5 hash-collision maybe really rare, he simply doesn't ever want to lose any data because of it. Ray OBrien raises the exact same point, even mentioning data integrity as the reason why he cares.

When this issue is put forward in DV 2.0 disucssions, it usually means the end of any meaningful discussion. The answers that DV 2.0 advocates usually give in response typically match one of the following:

"Look, do you realize how rare a hash-collision is?"

Yes thank you. You just stumped me with some Really Seriously Big Numbers, and I get it. Super rare. I just don't want to lose data though.

"You don't have to use 128-bits MD5, you can use a hash-function that returns larger values, like 160-bits SHA1. Collisions will then be, you know, even more rare."

Perfect thanks. Did I mention I can't afford to lose data?

"We use 2 hash-functions as key and it works for us."

Ah, I get it now. You made collisions Super-duper-rare, how clever. So will you never lose data now?

"I have built hundreds of Terabyte-sized data warehouses, and I never encountered a hash-collision."

Well let me guess. Might that be because they are very rare?

"Teradata is using hashing to solve MPP data distribution, and Hadoop uses hashing in HDFS. If it works for them, then why wouldn't it work in DV 2.0"

So you're saying Teradata and Hadoop use hashing for some purpose, and DV2.0 is using hashing for a completely different purpose, and now you want me to explain why it works in one use-case but not for a completely different use-case? That's...interesting.

How about: Teradata and Hadoop are not using hash values as primary keys, and DV 2.0 is?

"Look, why are you so worried about hash-collisions? You're not worrying all the time about a meteor hitting your data center, are you?"

"Ha! Gotcha now. What about two meteors? Wouldn't that be comparable to using two hashes?"

I suppose it would. Difference is, I can't help meteors falling on my data centers. But I can choose to stick to sequences instead of hash-functions.

"But I just explained, sequences are a bottleneck and prevent you from parallel loading your EDW!"

Yes I heard. And I asked my customer: they are pretty sure about how they feel regarding the possibility of losing data, and they clearly told me they'd rather wait around for the data to be loaded as compared to being able to report super-quickly on wrong data.

"We use hash-keys to store tweets for sentiment analysis, and our results are pretty accurate, even if we lose data sometimes."

I'm sure you are- good for you! But we manage a monetary transaction log and we feel that the risk of losing one $1,000,000,000 transaction just doesn't justify loading 1,000,000,000 transactions worth $1 super-fast in a parallel fashion. Silly us eh?

And that's really all there is to it: Probabilities don't mean a thing if you're really sure you don't want to lose any data.
When it happens, it is no comfort that you were the one to have had such extraordinarily bad luck experiencing it.

Another thing people may overlook is that the probability also doesn't tell you when it will happen. The only real guarantee you have is that inserting the first key in an empty hub will always succeed. But already the 2nd row might collide. It probably won't, and the odds are really slim. But it might. If you're sure you don't want that, then don't use hash values as keys. It's really that simple.

Sure, there might be other risks that could make us lose data. For example, the probability of disk corruption might be larger than that of a hash-collision. But it doesn't follow that we should be setting ourselves a trap if we can avoid it, especially if you know how to avoid it.

We cannot control disaster like disk corruption or meteor impact. If we could though, we would! Whether to use hash keys or to stick to sequences is a conscious choice, so let's be sure we make it based on information and requirements, and rather not based on some analogy that is chosen with the express purpose of making you feel a little bit ridiculous for being so averse to taking a risk.

If you're building a database for someone else, and you're considering to use hashes as keys for your data, then be prepared to ask you customer: "How many data can you afford to lose?" or "In the event that we cannot load some data due to hash-collisions, how much time and effort can we spend to take it offline so we can fix it?"

Another way to think about it is this: Suppose you would, in fact, lose data because of a collision. Then how comfortable are you to admit to your customer that you constructed a solution, that, by design, could end up giving you wrong results, while there was an alternative that guarantees correctness, at least to the extent of things you can control? And suppose you would get wrong results, did you anticipate just how wrong those results could be?

I truly feel that considerations like this are not on the database/data warehousing professional - they are on the customer. It's their data. Please, respect that.

What if we do have a collision?

I get that there are use cases where you might want to accept the small risk of a collision. But you cannot really, truly make that assessment if you haven't considered and anticipated it as if it is a real event actually hitting you. I think DV 2.0 falls short in nourishing healthy discussion regarding the anticipation of such events.

So, what will happen if you have your hash-keys in place, and you encounter a collision? We can try and anticipate a few concrete scenarios.

First of all, will you even detect a collision when it happens? The Scalefree article has a few flow diagrams showing how raw data is staged, and then loaded into a hub. In that flow, the row is dropped when the hash already exists. So the question now is, why did the hash already exist?

Obviously, it's possible that we already loaded this business object, and we're merely seeing it again. In that case, we're fine and we'll simply be loading newer data into the satellites and links for that business object. But it's also possible that this is an entirely different business object that happened to yield the exact same hash-value as that of a different business object loaded earlier. In other words, you now have a collision. For the hub, it will pass by unnoticed, but the satellites and links that point to that business object will now store data pertaining to more than one distinct business objects.

So in this case, we're not losing data, but compromising the integrity of the business object that arrived earlier. Our database integrity is now violated and our queries will return wrong results. You won't know though, because you didn't attempt to detect a collision. To your data vault, the distinction between multiple different business objects has ceased to exist.

I don't know about you but this does not feel like a happy place to me - especially if this an inevitable and avoidable consequence of the design. (And a whole bunch of bad luck of course).

Alternatively, we build our solution in such a way that we can at least detect collisions. Once we detect it, we can maybe prevent loading associated data for the satellite-tables and link-tables for the colliding business object. This means we will be completely ignoring the later arriving business object, as if it isn't there.

We have now lost data. That is not a good thing, but at least this allows the earlier arriving business object to maintain its integrity. Our query results won't be wrong, they will just be incomplete. To me this is a slightly happier place, but the fact that it's a matter of fate which one of the objects made it into the database, and which one was rejected still makes me feel that the solution has failed.

But suppose we do want to go for that treatment (after of course getting confirmation from the business that this is really what they want) - how can we implement it? Well, at the very least, the load process for either the hub or the staging area would need to compare the hash value as well as the business key. Only if both are equal can we consider the objects equal.

Making the comparison is not hard but it will of course be slower than only calculating the hash, because in this case you need a lookup on the hub, just like you did with the DV 1.0 solution based on sequences.

But what if we detect the collision in this way? How can we then use this information to prevent loading the associated satellite- and link-table data?

Introducing a collision table

We could store collisions in a special collision table. We'd get one such table for each hub. The collision table would have the same layout as the hub table to which it corresponds, and you'd use it to store the colliding hash, as well as the business key for which the collision was met. The key of the collision table would have to be made up of both the hash key as well as the business key, so that we can handle multiple collisions for the same hash key.

Once the collision table is in place, and the process for loading the hub-tables is modified to detect and store collisions, we can think about the process for loading the satellite- and link-tables. I can see two options

Have the load process for satellite- and link-tables do a lookup to the collision table to see if you need to discard data for business objects with collisions

Check collision tables after the load and run a clean-up process to restore integrity after detecting new collisions.

Collision table lookup

This solution relies on the process that loads the satellite- and link-tables to make a lookup to the collision table, using both the calculated hash and the business key. If collisions really are as rare as they should be, then that lookup should be really fast, because the collision tables will be pretty much empty.

If all goes well, then the lookup will fail. This means we have no collision and we can proceed loading the satellite- and link-tables.
In the rare event that the lookup succeeds there is apparently a hash-collision, and we must not load the satellite- and link-tables to prevent violating the integrity of our data. You are now in a position to either discard the data, or to store it someplace else in case you have a clever idea of reconciling the data later on.

However, we now have reintroduced the constraint on the loading process, because we now rely on the process that loads the hubs to also detect and store collisions in the collision table.

So, with this solution, we lose the ability to load hub-tables in parallel with satellite- and link-tables. What may be of some comfort in comparison with the DV 1.0 sequence based solution is that the collision table lookup will be much faster than a hub-lookup to find the value generated by a sequence, because the collision table will be pretty much empty. So, the burden of the constraint and loading dependency should be much lighter than in the case of a DV 1.0 sequence based solution.

Another important drawback is that if your solution spans multiple systems, you need to maintain one set of collision tables somewhere, and all loading processes will be dependent upon them. In other words, the solution is not stateless anymore.

Clean-up after load in case of new collissions

Alternatively, we keep loading hub-, satellite- and link-tables in parallel, and we check the collision tables after each load to see if the last load introduced any new collisions. If we find that it did, we need to perform clean-up after the load.

The way clean-up would work is as follows: our load process should have been logged and our satellite- and link-tables should have metadata identifying the load process that put its contents in the data warehouse. The load identifier would also be stored in the collision table. Using that information, we can identify which new collision our last load introduced. We now have the load identifier as well as the hash key of the new collision, and we can then use that to delete all satellite- and link-table rows that have the load identifier of our latest load, as well as the hash key of the colliding business object.

After clean-up, we have restored data integrity for those business objects that encountered a collision, up to the point prior to the last load.

Now, we know that our last load brought us a business object that had a hash collision with some business object that was already in our data warehouse, and we made the conscious decision to reject that data for now, or maybe store it somplace else untill we know how to reconcile it. But the last load might also have brought us data that actually belonged to the business object that already existed in our data warehouse. We would really like to reload that part of the data for the existing business object. Our clean-up process had no way of distinguishing between satellite- and link- data for the one or the other business object, because it only knows about their colliding hash keys. In other words, our clean-up process might have removed data that actually did belong to the already existing business object, and now we need to put that back.

The solution would be to have an alternative load process especially for this -hopefully exceptional- case.

The alternative load process would be similar to the collision table lookup solution described above. It would only load satellite- and link-tables, and it would include a lookup to the collision table. If the lookup fails, we're dealing with data belonging to the existing business object and we can load it. If the lookup succeeds then this is data that belongs to a new business object that caused a collision and we should discard it or store someplace else for later reconciliation.

If things go the way they should, then the clean-up-and-reload process should occur seldom. And if we need to run it, it would probably be quite fast since it deals with only a few business objects - typically only one.

The only drawback now is that our data warehouse lived through a short period where integrity was compromised during the load process. But at least, we can repair integrity for all existing business objects, and selectively discard only data for those business objects that suffer from hash collion with an already existing business object.

While this approach still relies on keeping collision tables around, we regain our ability to load hub-, satellite- and link-tables in parallel. We can even do loads spanning multiple systems; we just need to take care to clean those up as well in the case we do encounter a collision.

Other solutions?

I don't want't to pretend this is an exhaustive list - I'm hoping there are more options and I just can't think of them right now.

How to load the colliding business objects?

What these scenario's do not solve is loading the later arriving business object. We only managed to prevent these objects from entering our data warehouse, but have not found a solution to load that data as well.

And, We can't - not unless we change the key.

On the other hand, changing the key might just be doable: you could decide to try another hash-function for at least that hub. You would need to update all satellites and links pointing to that hub (and of course, the hub itself) and rehash every row that points to it.

Of course, since you're still relying on hash-keys, just - hopefully - larger ones, you haven't solved the problem, you've just increased the odds. And you might even run into new collision while you're doing the rehashing operation. But that's just the life you've chosen. At least we now have something that resembles dealing with the problem rather than praying it won't happen.

I guess my main point here is - make sure every stakeholder is actually accepting the risk and make sure the procedures for dealing with a collision are specified and tested. The fact that the chance you'll need it may be next to neligible is not a license to pretend you do not need to be prepared to do these tasks. If it is decided that you'll be taking the risk, then actually do take the risk, and take it seriously.

Can't we have our cake and eat it too?

Isn't there some way we can benefit from hashes and still, magically immunize ourselves against hash-collisions? It turns out we can.

To recap:

DV, like other data warehousing methods - suggests using surrogate keys, because business keys are largish and unwieldy, and slow down join operations.

DV 2.0 suggests using hash-keys to avoid sequences, which make it impossible to load hubs, satellites and links all in parallel, and which slow down the load due to a lookup in a large hub.

This maybe a longshot, but it has as advantage the guarantee that it will work. As should be amply clear from the discussions in this article, hash-keys always have the chance of a collision, and it doesn't matter how small the risk is if you already know you do not want to accept losing data or giving up integrity. So, what is clear is that if that is the requirement, you cannot use hashes as keys. Period.

But that does not mean we cannot benefit from hashes.

Especially if the hash-key is small in comparison with the business key, then we could build up our keys as a composite of the hash-code, followed by the field or fields that make up the business key. If our database uses B-tree indexing, then any joins will be able to resolve the join in the very vast majority of cases only based on the hash-key column, which should be the first column in the key definition. You would still keep using the fields of the business key in your join conditions, to ensure that, in case of a hash-collision, the query will still return the correct result. Since the collisions are so rare, the database will end up with very few rows after it has resolved the join over the first field in the key, so the overhead of such a large key should be minimal.

Of course, this solution does not help in cutting down the amount of data you need to store - that will be much more in this scenario since you keep dragging the business keys literally everywhere: every satellite and every link table that points to this hub will inherit the column for the hash, as well as all columns that make up the business key. But if it helps - it should just add to the storage requirements, and not that much in terms of join processing.

The benefits of the -relatively- fast join will break down though when the database already uses hash-joins. In those cases, it will not be able to use the first column of the keys as prefix.

Finally

I hope you enjoyed this article. I am super curious to hear from DV practitioners if they have scenarios we can learn from. Drop a line in the comments! I'm looking forward to it.

Thursday, March 30, 2017

Dear SAP HANA DBa's, developers, architects etc, we at Just BI released Shivers.

What is Shivers?

Shivers stands for SAP/HANA Information Viewers. It is a tool to analyze and visualize dependencies between SAP/HANA information views (that is, Analytic Views, Attribute Views, and Calculation Views) and catalog objects (such as tables and views).
Below is a screenshot of Shivers so you can get an impression of what it looks like:

Why Shivers?

In our work as SAP- and Business Intelligence consultants, SAP/HANA information views are a key tool in delivering Business Intelligence and Custom App Development Solutions to our customers.

In some cases, information views can get quite complex. Visualization and documentation features offered by development tools, like HANA Studio, do not always provide the kind of insight we need to create or maintain our solutions.

One very particular and concrete case is creating technical documentation, or preparing presentation materials for knowledge transfer sessions to handover to the customer support organization. In the past, some of our consultants would manually piece together overview slides of the data objects (information views, tables) that make up our solutions. I decided to try to whip up a solution that would make tasks like this a little simpler.

How do I install and run Shivers?

The simplest option is to download a shivers.zip archive from github. You can then unzip the archive, and open the index.html file inside in a modern webbrowser (tested with chrome and IE11 but should work in all modern browsers).

Note that you do not need to install shivers on a web server or HANA XS server. It runs directly from your local disk in your browser. Of course, you can install Shivers on a webserver or HANA XS server if you like, and then you can open Shivers by navigating to the url corresponding to the place where you installed Shivers. But this won't affect how Shivers works.

How do I load information views into Shivers so I can visualize them?

On the Shivers toolbar in the top of the application, you'll find a file chooser button. When you click that, a file browser will open. Use it to browse to whatever location on your local disk where you keep your information view source files.

I for example have my HANA Studio workspace stored in C:\Users\rbouman\hana_work, so I browse to that location and then find the appropriate path in the subdirectories to the HANA System and package I'm interested in. When I found a directory containing information view source files, I selected them, and confirm the file chooser.

Shivers will then prompt for a package name. It needs to do this to know how other views could refer to this set of views. Since the package name is itself not stored in the information view source files, the user has to supply that information. Later, when analyzing dependencies, Shivers will use the entered package name and match it with pacakge names it encounters inside the information view source files, when that information view refers to other information views.

After these steps, Shivers will populate the treeview on the left-hand side of the screen with the package structure and entries for the information views you loaded.

Note that shivers has a log tab. The log tab will report about the loading process, and it will also report it whenever it loads and information view that depends on other information views that are currently not loaded into Shivers. If you want to make complete and exhaustive graphs, you should then also load whichever file is reported in the log tab as an omitted dependency.

When the loaded files are visible in the treeview, you can click on any of the treenodes representing an information view. When you select one, a tab will open inside the Shivers window, showing a dependency graph of the selected information view.

Shivers currently does not do any clever layout of your dependency graphs. But you can manually drag and place items inside the graph to make it look good.

When you are done editing your graph, you can right click it, and choose "Export Image" to export the visualization to a png file, which you can then use in your technical documentation or knowledge transfer presentation slide deck.

HADES are server-side tools (so far, only stored routines) that help analysis of information views. HADES is data-oriented, and requires access to a HANA SQL client and allows you to extract just about any information you'll ever want to know about your information views. But, HADES is in a way also very low-level, and using it effectively requires SQL skills.

Shivers is a graphically oriented client-side tool. It is implemented as a web page, that you start in your browser, directly from your file system. Shivers does not require any connection to a HANA Server. Rather, you use checked out information view source files (.analyticview, .attributeview and .calculationview files), load them into shivers, and then the tool does static code and dependency analysis. So far, possibilities to extend or influence reporting of Shivers are very limited. Shivers draws dependency graphs of your information views, and for now - that is it.

Why is Shivers an offline client-side tool?

Implementing Shivers as non-server, client side tool was a very deliberate decision. At our customer's systems, we cannot assume that we are allowed to install our tools on the HANA Server for just any purpose. So we really must have a solution that works regardless of that. What we can assume is that we have access to the information view source files, since we mostly work with HANA Studio and check out information view source files from the HANA repository all the time.

What are the terms and conditions for using Shivers?

Shivers is open source software under the Apache 2.0 License.
This should give you all the freedom to copy, use, modify and distribute Shivers, free of charge, provided you respect the Apache License.

How can I get support Shivers?

We at Just BI are always happy to help if you run into any issue, but Shivers is not currently a for-profit solution. If necessary we can always negotiate professional support should you require it.

We Want your Feedback!

Please check out Shivers and let us know what you think. You can either drop me a line at this blog. Or you can post issues in the github change tracker.

Monday, October 24, 2016

For one of Just-BI's customers, we're currently working to productize a custom-built proof-of-concept SAP/HANA application.

This particular application is quite typical with regard to how we use SAP/HANA features and techniques: it is a web-application for desktop and tablet devices, that we serve through SAP/HANA's XS engine. For database communication, we mostly use OData (using XS OData Services), as well as the odd xsjs request (in this case, to offer MS Excel export using ODXL)

The OData services that our application uses are mostly backed by SAP/HANA Calculation views. These, in turn, are built on top of a mixed bag of objects:

Some of these are custom base tables that belong to just our application;

Some are base tables that collect output from an advanced analytics recommendations algorithm that runs in an external R server

Some are information views (analytic views, attribute views and calculations views) that form a virtual datamart (of sorts) on top of base tables replicated from various SAP ERP source systems to our SAP/HANA database.

One of the prerequisites to productize the current solution is a re-design of the backend. Redesign is required because the new target system will be fed from even more ERP source systems than our proof-of-concept environment, and the new backend will need to align the data from all these different ERP implementations. In addition, the R algorithm will be optimized as well: in the proof-of-concept environment, the advanced analytics algorithm passes through a number of fields for convenience that will need to be acquired from elsewhere in the production environment.

To facilitate the redesign we need to have accurate insight into which base columns are ultimately used to implement our application's data services.
As it turns out, this is not so easily obtainable using standard tools. So, we developed something ourselves. We think this may be useful for others as well, which is why we'd like to share it with you through this blog.

Information View Dependencies

The standard toolset offers some support to obtain dependencies for information views (analytic views, attribute views and calculation views):

As it turns out, these standard tools do not give us the detailed information that we need.
The HANA studio features are mainly useful when designing and modifying information views, but do not let us obtain an overview of all dependencies, and not in a way that we can easily use outside of HANA Studio.
The usefulness of querying the OBJECT_DEPENDENCIES system view is limited by the fact that it only reports objects - that is, base tables or information views - but not the columns contained therein.

Getting the information view's definition as XML from _SYS_REPO.ACTIVE_OBJECT

To get the kind of information we need, we're just going to have to crack open the definition of the information view and look what's inside.
As it turns out, HANA stores this as XML in the CDATA column of the _SYS_REPO.ACTIVE_OBJECT system table, and we can query it by package name, object name and object suffix (which is basically the extension of the file containing the definition that is stored in the repository):

NOTE: It turns out that querying OBJECT_DEPENDENCIES fails at reporting dependencies between analytic views and the attribute views they use.
To capture those dependencies, you need to query _SYS_REPO.ACTIVE_OBJECTCROSSREF.

Once we obtained the XML that defines the information view, we still need to pull it apart so we can figure out how it is tied to our base table columns.
To do that, we first apply a general XML parser that turns the XML text into a (temporary) table or table variable, such that each row represents a distinct, atomic element inside the XML document.
For this purpose I developed a HANA stored procedure called p_parse_xml. Here is its signature:

The result is a tabular representation of the XML parse tree. Each row essentially represents a DOM Node, and the column values represent the node's properties:

The NODE_TYPE column tells us what kind of node we're dealing with. Values in this column conform to the w3c standard document object model (DOM) enumeration of node type values.
The most important ones are 1 for element nodes ("tags"); 2 for attributes, and 3 for text. The entire parse tree is contained in a document node, which has node type 9.

The NODE_ID is the unique identifier of the node while PARENT_NODE_ID points to whatever node is considered the parent node of the current node.
The parent node is basically the container of the node.
As you can see, the element with NODE_ID=3 has the element node with NODE_ID=1 as parent.
These correspond to the first <child-element> and <parent-element> elements in the document.
Attribute nodes are also marked as children of the element to which they belong. The DOM standard does not consider attributes children of their respective element node, but p_parse_xml does, mainly to keep the result table as simple as possible.

The NODE_NAME column is a further characterization of what kind of node we're realing with. For most node types, the node name is a constant value which is essentially a friendly name for the node type.For example, document nodes (NODE_TYPE=9 always have #document as NODE_NAME, and text nodes (NODE_TYPE=3) always have #text as NODE_NAME.
For element nodes and attribute nodes (NODE_TYPE is 1 and 2 respectively), the NODE_NAME is not constant. Rather, their node name conveys information about the meaning of the node and its contents. In other words, element and attribute names are metadata.

The NODE_VALUE column contains actual data. For element and document nodes, it is alway NULL. For attributes, the NODE_VALUE column contains the attribute value, and for text nodes, it is the text content.

The POS lists the position where the current element was found; the LEN column keeps track of the length of current item as it appears in the doucment. Typically you won't need these columns, except maybe for debugging purposes. The TOKEN_TEXT column is also here mostly for debugging purposes.

Extracting Base Columns from Analytic and Attribute views

If you examine the XML definition of Analytic and/or Attribute views, you'll notice that table base columns are referenced by <keyMapping> and <measureMapping>-elements like this:

So, assuming we already parsed the model of an analytic or attribute view using p_parse_xml and captured its result in a table variable called tab_dom, we can run a query like this to obtain all <keyMapping> and <measureMapping>-elements:

select mapping.*
from :tab_dom mapping
where mapping.node_type = 1-- get us all elements
and mapping.node_name in ('keyMapping'-- with tagnames 'keyMappping' or 'measureMappping'
,'measureMapping')

While this gives us the actual elements, the actual data we're interested in is buried in the attributes of the <keyMapping> and <measureMapping>-elements. You might recall that in the p_parse_xml result, attribute nodes have NODE_TYPE=2 appear as childnode of their respective element. So, we can extract all attributes of all <keyMapping> and <measureMapping>-elements with a self-join like this:

select mapping_attributes.*
from :tab_dom mapping
inner join :tab_dom mapping_attributes
on mapping.node_id = mapping_attributes.parent_node_id -- find all nodes that have the keymapping element node as parent
and 2 = mapping_attributes.node_type -- but only if their node type indicates they are attribute nodes
where mapping.node_type = 1
and mapping.node_name in ('keyMapping', 'measureMapping')

Since we are interested in not just any attribute node, but attribute nodes having specific names like schemaName, columnObjectName and columnName, we should put a further restriction on the NODE_NAME of these attribute nodes. Also note that this query will potentially give us multiple rows per <keyMapping> or <measureMapping>-element (in fact, just as many as there are attributes). Since we'd like to have just one row for each <keyMapping> or <measureMapping>-element having the values of its schemaName, columnObjectName and columnName attributes in separate columns, we should rewrite this query so that each attribute gets its own self-join.

Extracting base columns from Calculation views

Getting the base columns used in calculation views is a bit more work.
However, the good news is that in terms of the queries we need to write, it does not get much more complicated than what we witnessed for analytic and attribute views in te previous section.
Querying the xml parse tree almost always boils down to finding elements and finding their attributes, and then doing something with their values.

The reason why it is more work to write queries against the model underlying calculation views is that the XML documents that define calculationviews use an extra level of mapping between the objects that represent the source of the columns and the way these columns are used inside the view.
The following snippet might illustrate this:

Get all <DataSource>-elements having a type-attribute with the value "DATA_BASE_TABLE".
These elements represent all base tables used by this view. Other types of objects used by this view will have another value for the type-attribute.

To obtain the schema and table name of the base table, find the <columnObject>-childelement of the <DataSource>-element.
Its schemaName and columnObjectName-attributes respectively contain the database schema and table name of the base table.

The <DataSource>-elements have an id attribute, and its value is used as unique identifier to refer to this data source.

Find all instances where the base table datasources are used.

A calculation view is essentially a graph of data transformation steps, each of which takes one or more streams of data as input, turning it into a stream of output data.
In the XML document that defines the calculation view, these transformation steps are represented by &ltcalulationView>-elements.
These &ltcalulationView>-elements contain one or more <input>-child elements, each of which represents a data stream that is used as input for the transformation step.

The <input>-elements have a node-attribute.
The value of the node-attribute is the value of the id-attribute of whatever element it refers to, prefixed by a hash-sign (#).

Note that this is a general technique to reference elements within the same XML document.
So, in order to find where a <DataSource>-element is used,
it is enough to find all elements in the same XML document that reference the value <DataSource>-element's id-attribute in the value of their node-attribute.

Once we have the elements that refer to our <DataSource>-element, we can find out which columns from the data source are used by looking for <mapping>-child elements.

The <mapping>-elements have a source-attribute, which holds the column-name.

With these steps in mind, the SQL query we need to do on the calculation view parse tree becomes:

Putting it all together

How to do general queries for dependencies using OBJECT_DEPENDENCIES, but that you need to query _SYS_REPO.ACTIVE_OBJECTCROSSREF to find out which Attribute views are used by Analytic views.

How to find the model XML code underlying our information views from the _SYS_REPO.ACTIVE_OBJECT table.

How to parse XML, and how to query the parse tree for elements and attributes

How the XML documents for information views are structured, and how to find base columns used in their models

With all these bits and pieces of information, we can finally create a procedure that fullfills the original requirement to obtain the base columns used by our information views.
This is available as the p_get_view_basecols stored procedure.
Here is its signature:

create PROCEDURE p_get_view_basecols (
-- package name pattern. Used to match packages containing analytic, attribute or calculation views. Can contain LIKE wildcards.
p_package_id nvarchar(255)
-- object name pattern. Used to match name of analytic, attribute or calculation views. Can contain LIKE wildcards.
, p_object_name nvarchar(255) default '%'-- object suffix pattern. Can be used to specify the type of view. Can contain LIKE wildcards.
, p_object_suffix nvarchar(255) default '%'-- flag to indicate whether to recursively analyze analytic, attribute or calculation views on which the view to be analyzed depends. -- 0 means only look at the given view, 1 means also look at underlying views.
, p_recursive tinyint default 1-- result table: base columns on which the specified view(s) depends.
, out p_cols table (
-- schema name of the referenced base column
schema_name nvarchar(128)
-- table name of the referenced base column
, table_name nvarchar(128)
-- column name of the referenced base column
, column_name nvarchar(128)
-- list of view names that depend on the base column
, views nclob
)
)

Obtaining the list of base columns on which our application depends is now as simple as calling the procedure, like so:

Finally

I hope you enjoyed this post!
Feel free to leave a comment to share your insights or to give feedback.

Please note that all source code for this topic is freely available as open source software in our just-bi/hades github reposiory.
You are free to use, modify and distribute it, as long as you respect the copyright notice.

We welcome contributions! You can contribute in many ways:

Simply use the procedures. Give us feedback. You can do so by leaving a comment on this blog.

Search This Blog

About Me

I'm Roland Bouman (@rolandbouman on twitter). I'm a software (web) application developer and I work on both the front end as well as the back end. I do data modeling, database design, ETL, Analytics, and Business Intelligence.