Changing the world, one data model at a time. How can I help you?

Archive for the tag “data lake”

Hey fellow data warriors! Here is a new joint blog post I just did with fellow data warrior Dale Anderson from Talend! Check it out. I hope you find the concept compelling!

So you want to build a Data Lake? Ok, sure let’s talk about that. Perhaps you think a Data Lake will eliminate the need for a Data Warehouse and all your business users will merely lure business analytics from it easily. Maybe you think putting everything into Big Data technologies like Hadoop will resolve all your data challenges and deliver fast data processing with Spark delivering cool Machine Learning insights that magically give you a competitive edge. And really, with NoSQL, nobody needs a data model anymore, right?

Avoid the data swamp! Use modern cloud based DWaaS (Snowflake) and the leading-edge Data Integration tool (Talend) to build a Governed Data Lake.

More specifically do we still need to worry about data modeling in the NoSQL, Hadoop, Big Data, Data Lake, world?

This keeps coming up. Today it was via email after a presentation I gave last week. This time the query was about the place of data modeling tools in this new world order.

Bottom line: YES, YES, YES! We still need to do data modeling and therefore need good data modeling tools and skills.

A picture can say so much!

In order to get any business value out of the data, regardless of where or how it is stored, you have to understand the data, right?

That means you have to understand the model of the data. Even if the model (or schema) is not needed upfront to store the data (schema-on-write), you must discern the model in order to use it (schema-on-read).

It is (mostly) impossible to get repeatable, auditable metrics, KPIs, dashboard, or reports that bring value to the business without understanding the semantics of the data – which means you at least need a conceptual or logical model.

And if you want/need to join data from multiple source then you really have to understand each source or there is no way to properly join it all together to get meaningful results.

There are a few data cleansing, discovery,and “virtualization” tools out there that will help you figure out those relationships but they are expensive and mostly rely on standard data profiling techniques to find similar data objects across the sets and propose “relationships”. Some allow for the definition of fairly sophisticated matching rules including customizations. But a human still needs to figures those out, test, and validate the results.

In the end you still have to know your data.

One of the best ways to do that, in my opinion, is to model that data. Otherwise your data lake will likely become a data swamp!

So keep your data modeling tool and keep building your data dictionary with your business folks.