Musings by Ronald Damhof

Sunday, November 22, 2015

There is a fundamental choice to be made when data is to be 'processed':

a choice between consistency vs. availability or

a choice between work upstream vs. work downstream or

a choice between a sustainable (long term) view vs. an opportunistic (short term) view on data

Cryptic, I know. Let me explain myself a bit.

Let me take you on a short journey. Suppose we receive a dataset from <XXX>. In the agreement with <XXX> we state the structure, semantics and rules regarding the dataset. This might be shaped as a logical data model or - for communication-sake - it might just be natural language, I don't care as long as the level of ambiguity is low. On the spectrum of consistency vs. availability, I choose a for a position skewed towards consistency.

If I choose consistency, I need to validate the data before it is processed, right? We have an agreement, and I am honoring this agreement by validating whether the 'goods' are delivered as agreed. In data I like to validate the data on the logical data model. So, when the data violates the logical model, the data does not adhere to the agreement, agree?

What are the options? Simple, you give feedback to <XXX> that they need to solve this issue fast, it's a violation of the agreement.

"But, but, but...we can't ask them that, they never change it, we gotta deal with the data as we receive it".

Ah, so you want to process the data, despite the fact that it does not adhere to the logical model? So, what you are saying is that you want to slide to the right on the spectrum of consistency vs. availability? Fine, no problem, we will 'weaken' the logical model a bit, so the data survives validation, will be processed and will be made available.

But, what happened here is crucial for any data engineer to fully comprehend. We chose to weaken the logical model, a model that correctly reflects the business. We chose to accept something that is broken. The burden of this acceptance will shift downstream towards the user of the data. They need to cope with this problem now. What they can not say is 'oh crap, data dudes, you gotta fix this', it is there problem now!

There is another aspect at play here, which might as well be the most important one; data-integration. The logical model ideally stems from a conceptual model (either formal or informal) that states the major concepts of the business, or in other words, future integration points. Logical models re-use these concepts, to assure proper integration of any data coming in. Suppose I have a dataset coming in, where the data is so bad that the only unique key we can identify is the row number. We basically get a 'data lake' strategy; the logical model is one entity (maybe 2) and the data is basically loaded as it was received. We are way down the spectrum of consistency vs. availability. You probably guessed it; data integration (if at all possible) is pushed downstream.

Whether or not you can push the slider towards consistency poses great value for the business since we can alleviate the (e.g.) data scientists from the burden of endlessly prepping, mangling and cleaning data before they can actually do the work they were hired to do. But we also got to acknowledge the fact that sometimes (often) we do not have a choice and we are forced to slide towards availability. It's not a bad thing! Be conscious about it though, you are introducing 'data-debt' and somebody is paying the price. My advise; communicate that 'price' as clearly as you can....

And if you are forced into the realm of availability, setup your data governance accordingly. Are you able to setup policies in such a way that the slider - in time - gradually moves towards consistency? A great option is to keep the agreed-upon logical model (highly skewed towards consistency) but you agree on a weakened version (so you can process the data and make it available). However, you report back to <XXX> on the validation-errors of the original logical model.

Final thought; lets be honest, we tend to often go for the easy way out, lets process all we receive and deal with it later. Our product owners and management are happy; yeah, we got data. But then reality kicks in, other sources are to be combined and the data is very hard to use because of its inconsistencies ("we seem to own cars with five wheels, is that correct?"). Lets buy some data-mangling tools and let each data scientist code its way out of the problem he or she is facing, increasing the data-debt even more. My suggestion; make a conscious choice between consistency vs. availability - all the work upstream done in the name of consistency will - in the end- pay high returns. Resist the opportunistic urges....;-)

This post is trying to describe a very subtle orchestration that is going on between data architecture, data governance, data processing and data quality.

Wednesday, October 07, 2015

It is a nasty cocktail; making a business case upfront and executing it 'change driven'.

Why? Because the majority of business cases are grounded in a kind of 'plan-driven-we-can-mold-the-world-as-we-see-it'. A business case is unfortunately often used - lets be frank guys - as a management fetish, a stick that can be used when things go differently and somebody needs to be blamed.

I work in environments where the IT-component is often pretty big and if we have learned one thing, it's that execution of anything should be change-driven. Why? Because, we know one thing for sure; we know very little when we start and we know nothing of what the future will bring.

Development of software is often situated in the complex domain and when something completely new needs to be done, it can even be situated in the chaotic domain (see figure). In these domains we do not have BEST PRACTICES! Yes, we might have good practices, maybe emergent practices (stuff we need to discover) or even novel practices (we need to experiment...). Context is leading, no cookbooks, methods, frameworks, blueprints or protocols.

And still we write our business plan: "If we do <x> it will cost <y> and the benefits will be <z>"

And shit happens:

1) The business case is followed by the letter and something is created that is totally useless. We have seen this happening over and over again with large ICT developments in government. 'but we executed the business plan'. And now the weird part; senior management is satisfied.

2) We go 'change-driven', discover, experiment, learn, improve and eventually end up with something successful. But unfortunately, something different as the stuff stated in the original business plan. And now the weird part; senior management is dissatisfied.

What is happening here? Two things.

1. TRUST/SAFETY is missing:a) TRUST of senior management in its executioners and to be truly interestedb) Executioners that need to feel SAFE to give feedback to senior management

2. Management Accounting sucksThere is another dynamic going on. And that dynamic is rooted in the school of management accounting and in the DNA of many executives. Suppose I am manager of business unit X and I execute a highly innovative solution for a specific purpose. This innovative solution turns out to be efficient and effective for business unit Y and Z as well, so they decide to use it. Hell, you even adapt your solution here and there to match their requirements.

What happens in old school management accounting? I can hear their creepy bureacratic voices: "your solution was way over budget and you did not make your benefits". You are dumbfounded...

NASA went to the moon and in doing so they revolutionised healthcare and saved countless lives. I'll bet ya, old school management accounting would regard NASA's trip to the moon as a failure.....

Old school management accounting is much to prominent and its decisive power is far too high. I was at the Drucker conference in Vienna last year and I heard Clayton Christensen saying that 'the finance role in the average Board of Directors should be downgraded in order for innovation to thrive'.

Friday, October 02, 2015

My two sons play Badminton, four years now. They are fourteen and eleven. I remember vividly the first year, boldly they entered the championships. They got beaten, with big numbers, very big numbers. No chance, whatsoever.

There were 2 possible reactions; 1) hmm..I want to try soccer or 2) I want to train more.

They chose the second one, they began training a lot more. Entered the badminton school where they got in contact with likeminded kids and passionate, skilled trainers.

So, the next year, another championship. The nice thing about youth championships is that you keep on battling the same guys, every year - they are the same age-class. So, we had a baseline from the first year. The second year, same guys, and they still got beaten. But the numbers were not that high.

And you know what? My sons noticed that....."I did not win, but I am closing in, I am improving myself! Let's keep on training." My sons are both (14 and 11) playing in the highest youth league (Under-19) now and the coming championships will be thrilling. The rate of improvement is staggering. And you know what my eldest son said to me last week when we were driving with the team to our next game in the competition?

"I so hope I get to play a good guy and we can make it three sets (the whole nine yards..), I want to be in the field as long as I can and play my best. I don't care what the result is."

And you know what? I believed him, he was not just saying that to cover his ass. He lost his single (21-19, 19-21, 19-21) and he was smiling.....

It is a life lesson he gave to me at that very moment. It is not about winning, it is about improving oneself. I am still privileged that both of my sons allow me to coach them and I belief wholeheartedly that a coach should not aim for the kill/win, he is to be focused on the progress and improving self consciousness. That is something my son has teached me.

My son (and me) learned something which I think is a bit lost in society; it is not about winning, profit or maximizing shareholder value (!). Those are consequences (very nice consequences btw), just play the game, do your job, give it your best, try to attain an ever-increasing deeper understanding of the field you work in. The result? That is just a consequence.

Wednesday, September 23, 2015

I loath the misuse of the term 'Data Architecture' by most software/consultancy/technology firms.I loath the misuse of the term 'Data Architecture' by the average Enterprise Architect.

This is a blogpost that attempts to clarify the term 'Data Architecture'. And I do not want to complicate things, so lets just quote DAMA and its Body of Knowledge that extends the term to Data Architecture Management:

Data Architecture Management is the process of defining and maintaining specifications that:

Provide a standard common business vocabulary,

Express strategic data requirements,

Outline high level integrated designs to meet these requirements, and

Align with enterprise strategy and related business architecture.

Data architecture is an integrated set of specification artifacts used to define data requirements, guide integration and control of data assets, and align data investments with business strategy. It is also an integrated collection of master blueprints at different levels of abstraction.

The DAMA Body of Knowledge also defines Enterprise Data Architecture. Aye, dear Enterprise Architect, do you have an Enterprise Data Architecture?

Enterprise data architecture is an integrated set of specifications and documents. It includes three major categories of specifications:

The enterprise data model: The heart and soul of enterprise data architecture,

The information value chain analysis: Aligns data with business processes and other enterprise architecture components, and

Enterprise data architecture is really a misnomer. It is about more than just data; it is also about terminology. Enterprise data architecture defines standard terms for the things that are important to the organization–things important enough to the business that data about these things is necessary to run the business. These things are business entities. Perhaps the most important and beneficial aspect of enterprise data architecture is establishing a common business vocabulary of business entities and the data attributes (characteristics) that matter about these entities. Enterprise data architecture defines the semantics of an enterprise.

I would like to point out two additions to these descriptions; 1) business rules management, although that comes with the territory if you model information (but loads of people do not get that). 2) (Enterprise) Data Architecture needs to be clearly aligned with data management and data governance. Getting them aligned results in relevant (&acceptable) data quality.

Please note that my definition of quality is; quality is value to some person. This definition deviates from the general (ISO) definition. If you want to know more, please read this blogpost I wrote a year ago.

Please note that my little 101-data-triangle deviates from the DAMA management functions. I think that the quality of data should play the lead part, all others (management, architecture, governance) are supportive.

And to those that think that in a world of increasing datafication, Enterprise Data Architecture is not a part of Enterprise Architecture:

Your organisation will miss out on opportunities that will never be discovered;

Your organisation will have a hard time leveraging exciting innovative technology in the data-space;

Saturday, August 15, 2015

“As long as we persevere and endure, we can get anything we want” – Mike Tyson

Human beings are terrible at seeing the bigger picture, strategizing, taking responsibility for it and overcoming huge amounts of opposition. Add to that a management with an innate proposition towards short-termism, risk aversion and lack of trust and you have yourself a rather explosive cocktail that results in chaotic organisations.

Chaotic organisations will have chaotic data…

The quality of the data in organisations is terrible. Lets not beat around the bush, it is. Michael Stonebreaker, winner of the A.M.Turing Award in 2015, confirmed it. According to Stonebreaker the average organisation owns 5000 silos of data, bigger organisations estimated to have 10.000. It boggles the mind why organisations keep on persisting in cleaning this data downstream. Stonebreaker – founder of Tamr – wonders why you don’t clean your data before it enters your downstream systems. If you don’t, he continues: “systems like Tamr will consume all your profits”.

I love this guy. Blatantly honest. But the fact is that the visualisation and “data muddling” technology is thriving. And I get that…

Cleaning up the garbage is ‘easier’ when the actual garbage can be seen. You can hire cleaners, buy machines/software to do the cleaning, etc.. It falls perfectly in line with a managements innate proposition towards short-termism, risk aversion and lack of trust.

Prevention of (data-) garbage requires a long-term view, vision, strategy, trust (!), perseverance. It requires an informational perspective on your organisation. A perspective where data (and process) – the raw ingredients – are at the center of the information strategy. Applications, functions and technology are derived from it, not vice versa. Organisations keep on buying or building applications/technology where the data is a by-product and subsequently keep investing in downstream tools and people to clean the garbage. Crazy.

It is what I call ‘high heels architecture’ – It looks sexy, but walking hurts like hell.

We need to look at data with an holistic perspective – across the divisions and often even across organisations. We need to conceptualize the information, logically model the data and choose the appropriate technology.

Very old school? Hell yes!! We need to educate our young people again in the craft of information, the ability to think in (logical) abstractions and separations of concerns. It is a craft! Nowadays, our education systems are hugely biased towards programming languages, applications and technology, etc..

Finally, the biggest problem in my view, is how we run organisations. How we educate our managers. Appreciate bad news from your employees, trust the craftsmanship, take responsibility for the path chosen, trust in teams to organize themselves, radical transparency (!), etc..

Back on the subject. Many organisations seem to have given up on data. In a future where datafication will only increase, huge amounts of their profits will indeed be spend on getting the data right. Risks in terms of operational failures but also privacy breaches will increase ever-more. Ultimately, the chances of delighting your customers, patients, tax-payers or students will diminish drastically.

Monday, June 15, 2015

Last month an interview with me was published in the SAS Future Bright Magazine. There quarterly theme was 'A Data Driven Reality' [Link to Dutch version] - right up my alley. This magazine was published in French, English and Dutch.

At the event that launched this quarterly magazine I had the honour to speak about the core of my interview; The Data Quadrant Model.

Thanks SAS Institute for having me, for publishing a great magazine and for giving data management the attention it so desperately needs.

For those interested in the interview that attempts to explain the Data Quadrant Model in layman's terms - here are the downloads:

Wednesday, May 20, 2015

The Data Quadrant Model (DQM) is typically a model that can be used by an organisation to make sense of their data domain. One of the sense making aspects is 'how to organise' and although every organisation should translate the DQM to an organisation model that fit their context (culture, maturity, timing, people, system landscape, etc..), there might be some heuristics that help you on your way.

In general the DQM - if travelled from I to II, IV to III (see figure) - will be characterised by increasing entropy, something I have tried to explain in an earlier blogpost. For the sake of simplicity in this blogpost, the question regarding 'How to organise' is translated into the degree of centralisation versus decentralisation.

Heuristic: systems with lower entropy are more prone to centralise as opposed to systems with high entropy which are prone to decentralise.

So, quadrant I and quadrant II have the highest propensity to be organised centrally and quadrant IV and quadrant III have the highest propensity to be organised decentrally.

Easy, peasy..? So we need one central quadrant I organisational entity. Lets go overboard and give it a name; a Data Service Center (DSC). The DSC is comprised of Data- and Information modellers, engineers and architects that model, validate and process the data and give access to the data to serve application development (II), BI professionals (II), data scientists (IV) etc..

But how is the DSC organised in different operating models? Lets refresh our memory a bit with the four quadrant model developed by Ross and Weiss in their landmark book ' Enterprise Architecture as strategy'. They identified four types of operating models; diversification, replication, coordination and unification. Yes, my dear data architect, this is stuff you need to know. This is stuff you need to be able to translate to the data domain and advise your management on the consequences.

Saturday, May 16, 2015

Where quadrant I and quadrant II are (simplistically stated) degrees of order and quadrant III and IV are degrees of un-order, there is a fifth quadrant. The quadrant of disorder.

This quadrant comes into play when one does not have a glue what to do, where they are and what decision to make. If you are not aware that you are in a disorderly state, you will fall back to your comfort-level. Do the stuff you always did....

Data scientists (quadrant IV) need to write down their 'requirements' because somebody needs to make a data-mart....

Entropy in the Data Quadrant Model is lowest in quadrant I, higher in II, even higher in IV and highest in III. If we do not actively decrease it, we tend to loose the value of (the) data(-platform) and we will find ourselves investing huge amounts of Euro's to do the same thing over and over again (like Groundhog Day) or spending huge amounts of Euro's to control an 'out-of-control-beast'. Unfortunately, I have seen and still see this a lot.

In an ultimate open system1, entropy can not decrease. In the universe the entropy will only increase (and eventually we all die). However, the Data Quadrant Model is a closed system. And in a closed system we can decrease entropy. How?

There are roughly three important directions in which entropy is to be decreased actively:

Decreasing entropy from III to I

Decreasing entropy from IV to II

Decreasing entropy from II to I(I describe these in the details of this post. Warning it is not for the faint-hearted)

Important message: like in physics, decreasing entropy costs energy. The higher the difference in entropy between two systems/quadrants, the higher the energy needed. And yes, you can replace 'energy' with 'costs'.

We now enter the field of Data Management. A prime directive of Data Management is to reduce entropy in data and to keep the data-platform in a sustainable modus where it serves the data-driven and data-centric organisation. It is hard, not cheap and still mostly unknown territory, but if you can make it work, the rewards - in the era of datafication - are huge.