Eckhard Zemphttps://www.zemp.ch
Business Intelligence, Lifehacks and TravelblogSat, 10 Nov 2018 10:18:19 +0000de-DEhourly1https://wordpress.org/?v=4.9.857859474Subscribe with NewsGatorSubscribe with BloglinesSubscribe with NetvibesSubscribe with GoogleSubscribe with PageflakesSubscribe with Mein Yahoo!How to organize a Raw Data Lakehttp://feedproxy.google.com/~r/ecki/blog/~3/SS4BuycNkAc/
https://www.zemp.ch/2018/11/10/how-to-organize-a-raw-data-lake/#respondSat, 10 Nov 2018 10:00:45 +0000https://www.zemp.ch/?p=1078How many times did you hear in the last few years the words „Data Lake“? And how many times have you asked yourself how it should be organized? And how often did you ask Google for more information and didn’t get any valuable results and only funky buzzwords? I did a lot and found not enough answers.

Over the time, I figured out myself an optimal design for organizing the very first layer of a data lake or in old school language a ‚file repository‘. And recently I completed it by asking the experts at a DDVUG meetup, where Michael Olschimke showed us their solution. Thanks Michael.

So, what layers have I found:

/source

This layer represents the data provider. If it is from an external partner, I name it after him. If it comes from an internal system, I name it after that.

It could be also the technology, which can be the master of all data following. E.g. mysql, kafka, etc.

There are also tons of files which are produced by my company and don’t fall into any category. E.g. Excel-Files, manually maintained data, etc. I set the structure above anything, so we as a company are a source too.

/connection

If there is a chance of having multiple connections for the same source, add a connection layer. If there is the slightest chance of having another set of credentials, separate it.

/operation

Usually I start collecting data and process them. But some when in the future I start to pump data in the opposite direction. So where to put these files? Could be a completely different file repository. Or it can be combined in this repository.

This is the place to specify the kind of operation we do. I have:

Import

Export

Update (log files of operation on the source system I do)

…

/object type

This layer could be used to group the source objects into a structure. E.g.

tables

meta

log

…

Could be also a directory structure like database_name/schema_name if data from SQL systems are to be saved.

/object

This is the data object. In data warehousing this is the table name. Another way to describe it, is to think about all other following directories and files below summed up at this level. If a file doesn’t match the object definition or object meaning, create another one.

I had difficulties to find a proper word for that layer. Maybe content is more likely. Choose your favorite.

Examples are ‚customer‘, ’stock‘ or something like ‚marketing_cost‘. You get the point.

/data set type

What happens if we have different data sets for the very same object?

I have seen the following data sets so far:

Full

Delta

CDC

Full means that we get the files every day in full. E.g. a customer table export from an application. Or an Excel-File which is modified by the users.

Delta means that we only get changed records. E.g. log files, transaction files.

CDC (Change Data Capture) is a very detailed transaction log file of a database having all INSERTs, UPDATEs and DELETEs of the source system. Usually it is ingested directly from a database to another database.

If the rows in a file are just data, why do we need to separate Full and Delta at all? The only reason for separating them is, if we need to detect deleted records in the source system. Then I’m able to create a status tracking satellites or in non-data vault systems to delete the record or mark it as deleted.

The difference between Delta and CDC is, that CDC contains information about deleted records.

/version

Version? What is a version for? This principle I derived from APIs. E.g. Google API evolved from v3 to v4. What happens, if the source suddenly decides to modify the file structure? E.g. translating all column names from German to English? Or modifying from csv to json? Adding columns is not a new version. The loading procedure should cope with that.

The main reason for a new version is, if the existing loading procedure can’t process the file. Then add a new version and write a new loading procedure.

My default version is ‚1‘. I have only a few objects were a ‚2‘ had to be created. But I want to be prepared! Modifying an existing layer after a need occurs, is not future-proof. Tweaking loading procedures to understand were something ends and something new starts is bad.

/load_date

If you are going to model your data lake or file repository to build data vault objects, this layer is a must have. You’ll want to preserve, when the data first appeared in your data lake. The definition of the load_date can be found here.

All others can probably omit that.

/group

Sometimes I need to ingest data by different characteristics. E.g. per company, per shop, per account, per machine, per anything. Per day is not a valid group (see below in event date).

Usually the differentiator is not part of the file name but part of the directory structure of the source system. Here is the place to preserve and save it.

The semantics of a group and connection could be kind of equal. Sometimes it makes sense to use it as a connection above or as a group here.

Note to myself 1: Why not placing group before data_set_type? Because we need a loading procedure for each data_set_type. Groups and the rest can get looped through.

Note to myself 2: Why not placing group before load_date? Could be, but a load_date should be a single entity. It serves better for incremental loading instead of having it inside each group. A group is data and data comes at the end.

Note to myself 3: Why not placing group after event_date? Sure, feel free.

/event_date

First question: Does there exists an event date as content inside the file? If yes, all good! No need for an extra layer. Nevertheless it could be created in a Hive environment for speed purposes to partition the files.

Second question: Should the file have an event date? If yes and we don’t have it, we need to add this somehow to the path.

Maybe the file has a string in the filename like ‚YYYY-MM-DD‘ or an integer value representing a timestamp. Then add this information as an event_date.

If there is nothing available, use as a last resort the create datetime of the file.

Structure it as /YYYY-MM-DD or /year/month/day. As you like.

/missing

Do we still miss something? If a key information is missing inside the file, add it as a layer. Manipulating files to add a missing column is a no-go!

The only way possible to add more information is to create them outside the file content. Maybe as another layer.

Actually group and event_date from above are just examples of missing data.

/file.ext

Finally, our source file.

If you ever need the create datetime of the file, preserve it by renaming the file. It doesn’t survive the copy process to any cloud platform and is lost.

And don’t give too much meaning to a filename. To extract data from the filename is difficult and needs precise instructions, where to look for what. No automation.

A file represented by a filename is just data!

Summary

All these components can be used as directories or if your system supports it, as meta data to the files. If you use meta data, you will probably find and use even more tags to describe your files.

Choose which elements from above are important for you. In a directory structure, the full string will look like this:/source/connection/operation/object_type/object/data_set_type/version/load_date/group/event_date/file.ext

A very very long path and people might get lost. But computers will like it, as many pieces can help to automate the loading procedures. Some systems are even suggesting to use key=value as a directory name, e.g. year=2018 or shop_code=de. This makes it even easier to automate.

The absolute minimum structure would be in my opinion:/source/object/version

If you believe that a table with 2 versions could be automated too, then the 2nd version is not of importance ;-).

For ingesting data, we need a procedure for each combination of/source/object/data_set_type/version

If you believe that there are way too many directories and you don’t need all these nifty things, don’t use load_date and modify the filename to include the omitted layers:{group}__{event_date}__original_file_name.ext
{group}__{event_date}__{object}.ext

Please tell me your solution, if you believe I missed a case.

]]>https://www.zemp.ch/2018/11/10/how-to-organize-a-raw-data-lake/feed/01078https://www.zemp.ch/2018/11/10/how-to-organize-a-raw-data-lake/Praise be to the LoadDatehttp://feedproxy.google.com/~r/ecki/blog/~3/uW9THmgd4Y0/
https://www.zemp.ch/2018/10/21/praise-be-to-the-loaddate/#respondSun, 21 Oct 2018 08:26:40 +0000https://www.zemp.ch/?p=1263The best invention in my Business Intelligence career is the adoption of the concept of having a LoadDate.

This should be a post to give credit to Dan Linstedt and thank him for inventing Data Vault and introducing LoadDate.

When I encountered the hub-and-spoke concept of Data Vault I thought, why on earth do I need a LoadDate? Combining in a Satellite the HubHashKey with the LoadDate to form the Primary Key makes sense. But why would I need it in a Hub or a Link? Just to know when a Hub entry was first introduced to the data warehouse? The usage is universal and far reaching!

What is the LoadDate? The definition is:

„The LoadDate is the date and time of insert for the record“

It is the moment when a record is first inserted into a Data Warehouse. It is not the extract date from the source or the current datetime in the source system. It is really the INSERT datetime in my Data Warehouse! The first time a record exists.

How to ingest it:

Save the value in a variable and insert it with the data set into the load table

Set a default constrain like SYSDATETIME() on the LoadDate column

What about when the insert takes a long time with a lot of data? Do I get a LoadDate for every second it loads? This would be very precise. In my world one LoadDate is one batch. One batch could be the load of one file. The next file has a new LoadDate. When loading data in a loop of something, each loop has a new LoadDate.

When the record has been inserted into the load table, the LoadDate stays with the record. Where ever it goes! Like a tracking marker.

The LoadDate is used in:

Load

Data Store

Hub

Link

Satellite

Dimension

Facts

I use it everywhere!

In a Data Store (aka Persistent Staging Area) I use it to order the data and batch-load the data to the target. Together with the SequenceNo it forms also the Primary Key.

In Link and Hub I know when the first appearance of a business key happened.

In Raw Vault Satellites I use it to load data incrementally from the Data Store.

In Business Vault Satellites where multiple source tables are joined together, which LoadDate succeeds? If history is important, I use it to create mini-Point-in-Time tables. If history is not important, I use the most recent record and calculate the MAX value of all LoadDates.

How can I use it in Dimensions & Facts? The main question in those objects is, when do I need to UPDATE columns with new data? When I was researching the best way, I used first to calculate a ChangeHash. Very time consuming! After detecting the formula above, I thought, why not use the LoadDate for change detection? If for a given Business Key the LoadDate changes, chances are high that the record needs an UPDATE.

With this approach I’m much faster now. And I’m also faster in preparing the data in a stage table as I can use the LoadDate for the delta loading layer too.

Are there any other similar concepts? In SQL Server one could use rowversion (altered from timestamp). It is a system-wide incremental number saved as a binary. But copying data into the next table will create a new rowversion except you extract the value and save it as a normal value. In my view not easy to handle and not readable. How do you know, when record 1234 was inserted?

Dan Linstedt: Thanks for introducing LoadDate to the Business Intelligence Community and forcing users to use it correctly. It is of tremendous help! I use it in batch loading, delta loading and change detection (only in Dimensions & Facts). I couldn’t do without it anymore. Thanks a lot!

From my background as an accountant and controller I had by nature a good feeling for numbers. This helped me to understand, how to structure data.

A couple of years ago when I started to work in a startup, I usually grabbed data directly from source systems to support our cause to tackle financial issues and finding process deficits.

Over time the queries grew in sizes and run-time got longer. When the run-time was not any more satisfying and our excel sheets with tons of formulas and recalculations broke, the need arose to build something more reliable.

First, I created queries and jobs which persisted data in advance. Just a bunch of reports. But we wanted to give these data also to users. So, we started to build a data warehouse with dimensions and facts and giving access to them with a nice tool. My data journey from Kimball Star Schema to Data Vault is for another blog post. I want to focus here on indexes.

My first objects had no indexes. When performance got degraded, I started to investigate. In SQL Server, there is a feature called „Database Engine Tuning Adviser“. I would consider its suggestions and create indexes and statistics.

Time passed and processing time was growing again. At one time, I analyzed my created indexes and thought, what would happen, if I would drop all indexes? Wow, the processing time decreased from 3.5 hours to 2 hours.

Proper Indexing

My next adventure was about clean data. From my background as an accountant, I want to be sure that my numbers are always correct. My nightmare situation is, when data is not correct and numbers, especially financials don’t match.

Sometimes my Dimensions, Facts or Stage tables would create cartesian products (duplicates). The solution was to make sure, that I would never ever get duplicates. I started to use UNIQUE indexes.

The confidence in my data grew, because any error would alert me.

The downside of it was, that when something broke, parts of my whole pipeline stopped at that stage. If the processing time is 6 – 8 hours and users want data at 9, what happens, if my processes breaks in the night?

We set up a shift plan among our team, so that first thing after getting up is, to check emails on the phone if everything runs smoothly. This would save time until my users get their data. Not a satisfying solution, but with indexes breaking the load, what should I do?!?

Lessons learned: Indexes are great for data quality. But can break loading processes.

Exploring Cloud Databases

At a fast-growing startup, available data explodes. We got into a habit to regularly replace our SQL server to accommodate our increasing data need.

We figured out fast, that always increasing the size of our server is not satisfying. At that time Hadoop with MapReduce made its appearance. Just adding commodity hardware. No high-performance expensive single-server anymore.

I’m a SQL guy. I had a trauma replacing all my code somehow with MapReduce.

To leverage that, I started to analyze some cloud offerings such as Redshift and others.

When testing those analytical platforms, I discovered that none of them supported constrains. No primary key, no UNIQUE indexes. Even nowadays Azure SQL Data Warehouse does not support it. Their primal goal and use is to analyze a lot of data fast without any obstacles.

Their answers to my issues was, that the application must make sure, that there will be no duplicates. BI has usually no application.

For my accountant’s heart this was kind of no go. How to make sure that my or my colleagues programming will be always correct?

Lessons learned: Cloud databases are not always the answer.

BI automation

One way to ensure good results is, to make sure that data processing will always be the same. This is also something that Dan Linstedt inspired me, when he showed me the maturity diagram which he translated to a data warehouse. On the highest level you would have:

BI automation

Parallel Teams

Rapid Delivery

This was something I wanted to achieve. I can say now, that I’m able to produce best results for any objects I’m loading data in. Give me more developers and all will produce great data!

Lessons learned: Using a framework to support data processing application is a big win!

No indexes

A year ago when I migrated to SQL Server 2016, I discovered a new index type. Actually, it is rather new table storing type. Instead of a row store, data is now stored in column stores. Think of saving in Excel column A, B, C, etc. instead of Rows 1, 2, 3.

Over time, I added those CLUSTERED COLUMNSTORE INDEXES to all my objects (except stage tables). The immediate effect was, that I saved a lot of disc space. Usually 10 – 20% from the former disc space is used now.

On top of that I still had my NONCLUSTERED indexes. With time and analyzing execution plans I discovered, that SQL Server hardly uses any of my NONCLUSTERED indexes. And when it uses them, the execution plan might include nasty „Nested Loops“ instead of „Hash Matches“ to join 2 data sets together.

With growing confidence in my BI automation templates, I’m now in the situation that my application makes sure, that anybody in my team produces great results and that I can remove any UNIQUE constrains.

The immediate effect was, that our processing speed increased. Why? Data is not anymore loaded into the databases logfile for checking the UNIQUE constrains before inserting the data, but immediately inserting the data without any delay. Like any other cloud database.

I’m now kind of par with the cloud offerings. There is no need to migrate to cloud because of speed issues. Any indexes, especially UNIQUE indexes, are speed blockers. They don’t have them, I don’t have them. Another benefit was disc space. The indexes are using up to 2 times the space of a CLUSTERED COLUMNSTORE table. Saving space and time!

Do I have a safeguard routine? Of course! As the BI automation solution has all the metadata, I regularly query for duplicates.

When I look back, I started with no indexes and ended with no indexes. The difference is, that with BI automation I increased data processing quality and with CLUSTERED COLUMNSTORE a new technology emerged.

From now on I don’t need to grab my phone at first sight in the morning. The reasons for breaks now are usually external dependencies.

Lessons learned: Sleep is a precious good!

]]>https://www.zemp.ch/2018/09/01/my-indexing-journey/feed/01238https://www.zemp.ch/2018/09/01/my-indexing-journey/Timeline calculation for Satellite viewshttp://feedproxy.google.com/~r/ecki/blog/~3/4ZNHEgBaMfw/
https://www.zemp.ch/2018/06/16/timeline-calculation-for-satellite-views/#commentsSat, 16 Jun 2018 12:24:59 +0000https://www.zemp.ch/?p=1177My last blogpost about the universal loading pattern has generated a huge number of visitors to my blog. In the recent months I have been working and refining the model.

Today I would like to talk about the last part in that model. Views based on satellites. Satellites are usually built with INSERT only. Any logic interpreting the data should be deferred to a view. What kind of views do I have in our system:

First: Not used yet but possible, retrieves the first record of a satellites surrogate key

Last: Retrieves the last valid record. Could also be called „current“

Timeline: Generates a timeline for LoadDatetime and EventDatetime. EndDate is 9999-12-31.

Infinity: Generates a timeline for LoadDatetime and EventDatetime. In contrast to the previous timeline the first record in the timeline has StartDate = 0001-01-01.

Especially the last option „infinity“ has been of great help. For effectivity satellites it was important to get the timeline as wide as possible. But also in other circumstances like adding purchasing costs to a sales order line. Instead of searching for the first valid purchase cost which might be after the first sales I can be sure, that I always get a price.

At first, I had a configurable timeline where I could „overwrite“ the first StartDate. Over time it became difficult to distinguish timelines with a wide date bandwidth and those without. To make it clearer for us, we decided to create 2 separate views if needed. A traditional timeline and a wide timeline. „Infinity“ sounds great for that.

Let’s see. In our database, satellites can have 2 date columns (of course even more, but I automated those).

LoadDatetime

EventDatetime

Instead of using LoadDatetime as the first value to calculate timelines on, we use own columns to represent the timelines. This has the advantage that I don’t „overwrite“ the LoadDatetime to represent ‚0001-01-01‘. This is also the strategy of WhereScape which is Data Vault compliant.

What are the names used:

LoadDatetime

StartDatetime

EndDatetime

EventDatetime

EffectivityDatetime

ExpiryDatetime

Let’s get to the code. Easy stuff first.

First & Last Value views

One could think that calculating a timeline first and then getting the EndDatetime = ‚9999-12-31‘ would be a nice solution. Still doable, but not fast.

The fastest option I found was to calculate it like that.

SELECT *
FROM satellite
INNER JOIN (
SELECT hub_hash_key AS hub_hash_key
,MAX(dss_load_datetime) AS dss_load_datetime
FROM satellite
GROUP BY hub_hash_key
) AS last_rows
ON last_rows.hub_hash_key = satellite.hub_hash_key
AND last_rows.dss_load_datetime = satellite.dss_load_datetime

Nice, easy and fast. For the first value replace the MAX function with MIN.

Timeline & Infinity views

When calculating only one timeline from a date is easy. To calculate it with 2 date columns is just a little trickier because of odd but understandable behavior. A bi-temporal view comes some when in the future when I grasp it.

Timeline calculation uses a lot of window functions and execution plans are quite bad as it is sorting the data forth and back.

What are the requirements:

dss_load_datetime/dss_event_datetime: Stays as it is

dss_start_datetime/dss_effectivity_datetime: Takes the datetime. For infinity views it should represent the first value of the timeline as ‚0001-01-01‘

dss_end_datetime/dss_end_datetime: Takes the next datetime or if there is none ‚9999-12-31‘

What will be the PARTITION BY clauses for each timeline:

dss_load_datetime: The hub_hash_key of the satellite

dss_event_datetime: The hub_hash_key or the driving_key of the satellite

What is the driving key? The driving key is useful for effectivity satellites to force a 1:m instead of a m:n relationship. A good blog post which describes it in-depth can be found on Roelant Vos’s Blog.

From my universal loading pattern blog post, you see that I organized all logic in sub-queries. To calculate the timeline, I need 3 sub-queries.

Sorting the whole dataset by dss_load_datetime

Calculate dss_start_datetime and dss_end_datetime

Calculate dss_effectivity_datetime and dss_expiry_datetime

Sorting dataset

I use this function not only in satellites but also for calculating timelines in stage tables. Sometimes a cartesian product happens and produces bad results. In a cartesian product 2 or more lines look the same. It can happen that for the LoadDatetime timeline the first row is being used and for the EventDatetime timeline the second. Resulting in both records being returned and violating a unique constrain.

Therefore, I sort the whole dataset by dss_load_datetime and use that number in the following queries.

ROW_NUMBER() OVER (PARTITION BY hub_hash_key ORDER BY dss_load_datetime ASC) AS dss_row_no

Calculate LoadDatetime timeline

For normal timelines just take the dss_load_datetime as dss_start_datetime. For infinity timelines where we want to set the first dss_start_datetime as '0001-01-01', we need to figure out, if this record is the first or not.

CASE WHEN LAG(dss_load_datetime,1,NULL)
OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC) IS NULL
THEN '0001-01-01'
ELSE dss_load_datetime
END

For dss_end_datetime we need the following window function query:

LEAD(DATEADD(ns,-100,dss_load_datetime),1,CAST('9999-12-31' AS DATETIME2(7)))
OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC )

As you can see, we order both queries with the dss_row_no from the previous query. As we need both window function LEAD and LAG, SQL Server will sort the data forth and back. I didn’t find another solution for that yet.

Currently I use a DATEADD function to subtract 1 time tick to being able to use BETWEEN queries. If you don’t need that, just remove the DATEADD function.

Calculate EventDatetime timelines

The calculation looks like the previous one. Except that we can sort the data by hub_hash_key or driving_key.

In contrast to satellites where we can be sure, that we don’t get any duplicates per hub_hash_key and dss_load_datetime we can get duplicates on dss_event_datetime. Instead of just sorting the data by dss_event_datetime we use also dss_load_datetime represented by dss_row_no for sorting.

The calculation for the dss_effectivity_datetime looks like the following:

CASE WHEN LAG(dss_event_datetime,1,NULL)
OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC) IS NULL
THEN '0001-01-01'
ELSE dss_event_datetime
END

The calculation for the dss_expiry_datetime looks the same as for the dss_end_datetime.

LEAD(DATEADD(ns,-100,dss_event_datetime),1,'9999-12-31')
OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC)

Again, the DATEADD function can be removed.

You see that we can end up with 4 sorting functions which can have a huge impact on performance. Create 2 separate views or persist them (or not).

Happy sorting!

]]>https://www.zemp.ch/2018/06/16/timeline-calculation-for-satellite-views/feed/11177https://www.zemp.ch/2018/06/16/timeline-calculation-for-satellite-views/Love-hate relationship with WhereScapehttp://feedproxy.google.com/~r/ecki/blog/~3/uo-5GIdd7D4/
https://www.zemp.ch/2018/06/02/love-hate-relationship-with-wherescape/#respondSat, 02 Jun 2018 12:55:30 +0000https://www.zemp.ch/?p=1140Well, I don’t know if this is the right way. But I want to share my experience about it with others.

First, I want to say that WhereScape enabled me to reach my goals faster. At least now. It took a year to get to full speed. This is due to the fact that I had first to discover, how it works and what is important to me. The first sentence of the sales consultants was: „We can do whatever you want!“.

You have to take that really literally! Don’t underestimate it. It is a blessing that you can form this tool to your likings. It is a curse that it takes time and has no blue print attached. Well, WhereScapes has own templates which are filled with options, accumulated from the beginning to present by user requests. Either I click through the whole process over and over again or I develop my own templates to remove manual work.

Object creation

It is a dream to design transfers. Create the first load object and then drag the objects or the columns from one step to next. E.g. Load → Stage → Data Store → Stage → Satellite → Stage → Dimension → Stage → Export. So easy with drag & drop. From one or multiple source tables. And then just building the right JOIN statements and an optional WHERE and GROUP BY clause.

The objects are created with the right data type with a mouse click. Stored Procedures are created without further input. Indexes are added automatically if needed. I just have to worry about business logic implementation.

But to get there has taken us a long time. And it has some limitations.

32 bit vs 64 bit

The application is still running in 32 bit. For a meta data management platform this is ok. It gets nasty when I have to work with execution.

My loading works with SSIS, ODBC or PowerShell. In ODBC I have to use a 32 bit ODBC driver. Calculate yourself, how much memory I can use for a data transfer from a database? Depending on row size crashes happen at around 50 to 100 Mio. records.

PowerShell is another beast. Windows has 3 versions of it. The 32 bit Version in the 64 bit Windows directory and vice versa. And then there is a third version as well. Figuring out which works in combination with WhereScape for a given job is tiresome. And if you need to extend PowerShell with more functionality, most modules are in 64 bit. So I have to start a 32 bit PowerShell, switching to a 64 bit version and then start executing.

Object name length

This limitation is really bad! RED has a limit of 64 characters for any object like tables, columns, indexes, stored procedures, etc. Only „parameters“ (kind of global variable) have a length of 256 characters but the system can address only 64 of them?!?

The 64 characters is for all objects. So if I create a table with 64 characters, I still get errors because of other objects appending to its string.

Stored Procedure: update_{object name}

Index: {object name}_idx_BK

With this calculation I have only 57 characters left.

In 3D there is a limit at 1024 characters. Enough space to evolve.

I hate abbreviated object names. Nobody except me is going to understand them. If you think of something like stage_{satellite}_{source system}_{source table}_{calculation step}, you can see that I might run out of space easily. And when I need to compose a multi source mapping and adding src_{hub}_ to the front, this runs out even faster.

3D to RED deployment

As described, there is a discrepancy of 1024 to 64 characters for object names. When bringing objects back to RED, object names are cut at 64 characters. For custom indexes it gets cut at 40 characters already. I don’t know why.

So I need to take care about object naming inside 3D. 3D is a nice program to design. But if I need to take care about object length for later usage in RED, it gets cumbersome.

My aim is currently to get a rough idea in 3D and then do the transfers in RED directly. With my customized templates this is as fast as working in 3D.

Object defaults

In 3D, I can set default column names with default data types. This is not possible in RED. E.g. I improved performance by switching from CHAR(32) to BINARY(16). Or, all of my date system columns should be datetime2(7). Not possible to set. Unfortunately until recently I had to manually modify them or using a QA query to modify the meta data in the database regularly. When I started writing this blog entry, I thought why not using triggers for that?!? Well, after such along time without it, I have that now in place.

In my opinion, users shouldn’t interfere with the application database directly. This is really really bad behavior, but unfortunately necessary.

Pebble

Pebble is the templating language WhereScape introduced. The intention was right to give the user the flexibility to develop their own transfers. It has taken me a year to develop fail prove transfers to get industries standards mapped. It is a trade off swapping manual time with development time. But satisfying when I get more insights, how things should work.

With successful usage the appetit grows to get more manual work incorporated into my templates. Then the problems are starting to appear. Pebble doesn’t scale with complexity. Macros suddently stop running, variables are getting forgotten while compiling. If you don’t check each and every creation, suddenly a bad stored procedure is there. Very unstable. Then time is wasted to find alternative ways to accomplish the same output. Another developer would question my bad design. But it has to work and I can’t wait!

For small and easy templates without logic, Pebble is running well.

Object Types

I like the split for every type of table I need: Stage, Data Store, Hub, Link, Satellite, Dimension, Fact, Export, etc. I love the concept of having a load object for getting outside data into the data warehouse. This is far more logical than having a stage table directly and load that content to the target.

Stage is my workhorse. It is there to prepare the data from target to target. Load is also a target.

7 object types = 7 templates to rule them all. But bringing logic inside it was not easy. A long and steep learning curve.

Stage Table and Stage View

All target objects like Stage, Data Store, Hub, Satellite are tables. And there is an additional object type called View. This is the only view.

But when developing a transfer, I don’t want to recreate the object as a view when I want to find out, if this runs faster. If I have a stage table, I want to be able to switch between a view and a table.

Objects are grouped for each target type. So switching between the stage menu node and view menu node is cumbersome.

There is an option to create a DDL statement. I developed a template for that. But once a view, it stays a view. I can’t switch back to a table without going manually to the database and dropping the view.

Data Vault

WhereScape is proud of their Data Vault skills. But it has taken them very long time to get that right. To add columns to a HashKey or ChangeHash is a dream!

The defaults are running ok. But they forgot that hubs are loaded from multiple stage tables. And just until recently I had to circumvent that with creating fake objects and using fancy templates. And in my opinion Links are not built correctly. The Unique Constrain of Hub Hash Keys are not enforced.

Advanced forms don’t work:

Multi-Active Satellites

Links with peg-legged Hubs (e.g. sales order item id for which I don’t want to form a hub)

Point-in-time (no template)

Bridge (no template)

Meta Data

The application collects all the meta data necessary to implement an object. It is really an open application with nearly total freedom. You are absolutely free to implement bugs and stuff which doesn’t work.

At first, every team member evolved their own development style. Through discussions and documentation we agreed on some principles and layout of our objects. But things are getting forgotten or are not getting checked.

The need arose to write some QA queries to check for issues. This proved to be very unreliable as there are too many bits and pieces to configure for each object.

This is the downside of having total freedom. Other programs which are guiding you through the development steps might be easier and more consistent to work with.

Jobs

When developing, I create tables and stored procedure. Through lineage I see which object depend on others. This is really a great feature.

When adding a job based on my work, I click the needed objects together. There it should end. But… I have to group them by execution layer and setting the dependencies manually?!? By default it maps EVERY object from one layer to EVERY object in the next layer. Then I need to delete the obsolete mappings. I mean, the system knows the dependencies of each object. Why the hell I need to teach the program it again?!? I see the need of a manual overwrite. But the first draft should be done by itself.

Another issue is that I can set the dependencies on another job. This is great and runs well. But sometimes, especially when developing, I want to force start this job. If the dependent job didn’t run, this job will not run either. Either I delete the dependency or I add another copy of the same job without a dependency. The later we do currently.

Support

The support is based in New Zealand. If I have an issue which I want to show somebody, I will have to schedule here in Europe a slot in the evening to midnight.

Usually this is not necessary. The help provided is fast. For bugs and feature requests, you will receive a number and then the issue is forgotten. Sometimes you see according to the low number it is known for years. The development focus is on delivering more fancy features than ironing out old bugs.

In every release I get newly introduced bugs of stuff which worked before. Then I get an info that they will fix it for the next release. Release cycles are roughly 9 to 12 months.

Support for Pebble doesn’t exist. Somehow understandable, as this is an area for a lot of user questions and resources would be bound. And I believe they already experienced themselves, that Pebble is unpredictable and don’t want to waste time on it. Or maybe we are too small and they make rather big players happy. I don’t know.

Implementation

I envy those companies which are able to implement WhereScape in one week. The success stories of WhereScape are full of this. Maybe they sent a full team of success engineers in and get them to full speed or did that themselves.

Our success engineer was great. Helping us here and there with the applications default functionality. Maybe our „We can do whatever you want“ was too complicated. I can’t imagine how others are doing it. I mean, I have only 7 object types. Loading e.g. a Fact or a Data Store should be everywhere the same. My aim was to get standards implemented accurately, automatically and without too much manual work. And this has taken us a lot of time!

Final Words

I love WhereScape for their flexibility. I hate it because I have to (re-)invent everything. Their default is too „free“ and only basic. I love the drag & drop of columns. I love it for creating tables, views and stored procedures instantly. I can’t do without it anymore. I hate it for their limitations and bugs. I hate/love the job engine. I love it for the speed of deploying objects to production. I hate Pebble.

I love/hate WhereScape!

]]>https://www.zemp.ch/2018/06/02/love-hate-relationship-with-wherescape/feed/01140https://www.zemp.ch/2018/06/02/love-hate-relationship-with-wherescape/Universal Loading Patternhttp://feedproxy.google.com/~r/ecki/blog/~3/xo-BgRJhC84/
https://www.zemp.ch/2018/04/07/universal-loading-pattern/#respondSat, 07 Apr 2018 11:07:57 +0000https://www.zemp.ch/?p=1017While working on the templates I published recently, I came across similar looking pattern between object types. E.g. the persisted history of a satellite looks the same as in a persistent staging area (except of the primary key of course). Or loading a dimension is like loading a hub (just that hubs have no change hash). So, a dimension loading pattern would nearly work for hubs.

I concluded that instead of having 7 templates (for each object type) 2 would be enough. One for stage and one for any target. They have all similar patterns.

While compiling all the patterns, I came to a solution that even 1 template would be enough. A universal loading pattern. Usable for any target type!

As we are loading data with stored procedures generated with WhereScape, see the following captions as sub-queries. In that way we add value in each layer to the result set. Also for maintaining template code this is much easier. SQL Server will find the best way to execute them.

I separated them into stage and target loading for easier reading. But in reality, it is just one big data flow.

Stage Loading Process

Extract & Transform

The only user generated logic is defined here. A simple SELECT statement and some additional meta data about what to save in addition, e.g. hash keys, change hashes or dimension keys. Everything afterwards is automated and for data preparation.

Delta Loading

Target Object Name is defined

A stage table can optionally have a link to its target. With that in place delta loading is possible and will add the following WHERE clause.

CAST to Target Data Type

Has Hash Columns

Hashing is a beast of itself. It looks trivial, but can still be tricky. Just recently we figured out, we would need to add another step in between: Casting to Target Data Type. E.g. the source column is of data type datetime and the target column has date. Of course, nobody does that ;-)! Hashing without casting would mean that it hashes e.g. ‚2018-03-08 12:13:14‘ instead of just the date ‚2018-03-08‘. Any rehashing later will fail. Of course, it should have been taken care about in the transformation, but other examples can be found. This step is just adding the invisible implicit transformation to a visual explicit transformation before the hashing step.

Another way to model it, is to create a temporary table with the same data types as the stage table and persist the result before going into the hashing function. This is anyway faster in delta loading because SQL Server comes usually up with odd execution plans.

Clean Business Key for Hash Keys (other Business Keys are not touched)

Has Hash Columns

There are many options out there. I decided to go with the following:

CHAR Data Types

UPPER(LTRIM(RTRIM(ISNULL(NULLIF(source_column,''),'-1'))))

INT Data Types

ISNULL(source_column,-1)

If a CHAR column is empty I set it to NULL and every NULL column is set to -1. Then remove any spaces and set it to upper case.

If an INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.

Casting everything to NVARCHAR(MAX) is the easiest thing. NVARCHAR has the best implicit conversion from any data type to string. Don’t ever cast to VARCHAR(MAX) only. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR, stick with UTF16 in NVARCHAR. Looking beautiful. Only DATETIME and another (which I can’t remember) are looking ugly. But as I use only DATETIME2 I don’t come across.

Dimension Lookup

Has Dimension Columns

Just looking up the surrogate key.

Row Condensing

Business Key is defined

Removing true duplicates. Be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.

Load

Preparing everything from the previous steps and sort the columns in the right order for the INSERT statement.

Post-Process Update Dimension Key

Has Dimension Columns

If a business key in the stage table for a fact object doesn’t exist in the dimension we could add it if this is a desired outcome.

After INSERTing the missing business key into the dimension, we need to update the stage table.

Target Loading Process

The target loading can be as simple as having a short INSERT statement.

Which differences are in the target object types:

Persist timeline calculation? (e.g. having EndDate in a table or calculate it through views)

Update existing or insert as new row ?(e.g. Dimension vs Satellite)

Is it allowed to insert „in-between“ data? (e.g. Satellite)

Row Condensing

Here happens the real magic, where every target type is different. Now is the right time to figure out, if we need to persist all the rows.

Target Type

Rule

Stage

We need only the last record by Business Key (if it is specified)

Data Store

We need only changes in rows sorted by LoadDate

Hub

We need only the first record by Hub Hash Key

Link

We need only the first record by Link Hash Key

Satellite

We need only changes in rows sorted by LoadDate and sorted by EventDate.

Dimension

Dimensions without SCD should be already unique in stage. No condensing.

Fact

It should be already unique by Business Key in stage. No condensing.

Compare last saved row with first new row

Has StartDate Column

An additional step is to check the last saved row with the first new row. Happens only in tables with a timeline column. Usually Satellites or History.

The result after this step is, that the data set is now cleaned, useful, ready to be inserted in the target. From a data point of view.

History Columns

Has History Columns

The history columns can be deferred to a view. Then we have an INSERT only system which can have a huge performance gain. The question is, if we can wait that long for the calculation of the timeline in the view to finish. If not, we should persist the history columns.

Column names for the first timeline. The calculation is based on the LoadDate.

dss_start_datetime

dss_end_datetime

dss_version

dss_current_flag

Column names for the second timeline: The calculation is based on the EventDate.

dss_effectivity_datetime

dss_expiry_datetime

For templates, not all columns are needed. Just add those which you need.

I know, this process fails if we are talking about bi-temporal objects. I need to learn first, how to build and use them.

Load

This step brings all the data from the previous steps together and adds the columns, which were missing in the extract layer.

Where (Not) Exists

This step was the most difficult one. Not because of detecting what exists and what not, but where to place?

If I put it after the extract layer, it would be impossible to do a full load. All existing records are removed from the data set and the history calculation is therefore rubbish. Also in delta load are problems. What about loading data in-between LoadDates? The whole row condensing step will produce wrong results.

What about placing it after the row condensing layer? Seems to me the best place. I could save time calculating the history layer as I’m not going to update an existing row anyway. But I would need to UPDATE the timeline after the INSERT. Any single UPDATE costs me 10 times more than a proper INSERT.

And where do we have history calculations anyway? Only with satellites and data store. Therefore, it can be placed at the last place possible.

What is new and what is existing? This are the rules for detecting it:

Target Type

Definition

Data Store

Business Key, Load Date and Sequence No

Hub

Hub Hash Key and Business Key (the unique constrain will tell if there is a hash collision)

Link

Link Hash Keys, Hub Hash Keys and peg-legged Business Keys (the unique constrain will tell if there is a hash collision)

Satellite

Hub Hash Key or Link Hash Key and LoadDate

Dimension

Business Key

Fact

Business Key

Which objects could have an UPDATE statement? Maybe not only in delta but also in full load?

Target Type

Definition

Data Store

No, there can't be any changes in the data. Otherwise the LoadDate is used wrongly.

Hub

No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.

Link

No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.

Satellite

No. If by accident there are different Change Hashes for a given record which already exists, the whole calculation in stage must have changed. Then the whole satellite should be reloaded instead.

Dimension

Yes!

Fact

Yes!

For Facts DELETEing the record could be a faster option than UPDATEing it.

Everything else is going to be INSERTed.

Post-Process Update Timeline

Has History Columns for a timeline (StartDate only is not a timeline)

In that step we are going to update the timelines of our target. By joining to the stage table, we only need to update the changed timelines per Business Key/Driving Key.

In our thinking we had first also a Pre-Process Update step where we would outdate old data. This works, if we run the INSERT batch-by-batch. But as we don’t know, if there will be any inserts at all after the „Row Condensing“ layer we must omit that pre-processing step.

Conclusion

Above you see 2 images. Actually, in my data flow they are not separated. It is just a huge select statement incorporating every rule. Of course, empty layers are not shown.

Now, I can just throw any data for any target type at it. By designing it in layers, I’m flexible in adding new rules when I discover them. I’m also able to take out each layer in a separate temporary table to speed up processing time, if I figure out that this way would be faster.

By writing down my thinking gave clarity to the desired outcome. If you want to discuss this model or want to share your experience with me, please get in touch.

]]>https://www.zemp.ch/2018/04/07/universal-loading-pattern/feed/01017https://www.zemp.ch/2018/04/07/universal-loading-pattern/Pebble Template Release 1.0 for SQL Serverhttp://feedproxy.google.com/~r/ecki/blog/~3/_jtOMby7Zi0/
https://www.zemp.ch/2018/03/03/pebble-template-release-1-0-for-sql-server/#respondSat, 03 Mar 2018 12:45:52 +0000https://www.zemp.ch/?p=990In all my previous blog posts you would have had to go to my GitHub repository, loading each file to your environment and trying to get it up and running.

Maybe it is time now, to package them all up into an application, which is ready to be loaded into your environment.

You get templates for the following WhereScape objects

Stage

Stage Merge

History (Data Store)

Hub

Link

Satellite

Dimension

Fact

which you can create as

TABLE

VIEW

Included is also an extensive list of meta data checks. Modify it to your liking.

It is hopefully also a good resource if you want to learn writing templates for yourself.

Before using, please read the LICENSE and the README.

After that, your first stop is the template nu3_sqlserver_utility_configuration. I moved all „hard-written“ code into here for modification. Just go through it and modify it to your environment.

If you like it and the templates are helping you, visit the code sponsors website at http://www.nu3.com/ and buy some products.

]]>https://www.zemp.ch/2018/03/03/pebble-template-release-1-0-for-sql-server/feed/0990https://www.zemp.ch/2018/03/03/pebble-template-release-1-0-for-sql-server/Virtual Data Warehousehttp://feedproxy.google.com/~r/ecki/blog/~3/KO1-XDH0U6c/
https://www.zemp.ch/2018/01/27/virtual-data-warehouse/#respondSat, 27 Jan 2018 10:37:37 +0000https://www.zemp.ch/?p=986So far, we have been discussing the creation of stored procedures to pump data into our hubs, satellites, facts and all the other objects.

In an ideal world, I would specify my objects and good is. All data would come into their place automatically. Unfortunately, that doesn’t happen too often.

Currently we create tables and according to the template produce stored procedures to load the data. Then we create jobs to process them. We would need to check that they are in the right order, that we didn’t forget a task and that we maintain the dependencies. A lot of auxiliary work.

To remove all this auxiliary work, we would need to specify our objects as views instead of tables. So far WhereScape doesn’t offer this choice. Even not in stage objects. Would that be really achievable?

In our templates, I have written a lot of code and tried to give „intelligence“ to it. What could I reuse to create views based on them? It’s the SELECT statement of the INSERT statement. No UPDATES and no delta loading. This pure SELECT statement I placed in a block template which can be reused in a procedure and in a DDL statement.

Unfortuately, there is a little inconvenience with it. WhereScape warns currently when we want to recreate an object which has a DDL creating a VIEW. You would need to delete that manually from the database and recreate it with WhereScape.

Hopefully you have a machine which is powerful enough to calculate only views based on our history (DataStore) layer a.k.a. Persistent Staging Area.

In my case we have too much data to only work with views. What I do is using stage views as much as possible. Especially in multi-step stage preparation. No need to fiddle creating nesting jobs for them. And in loading business vault satellites I use stage views only. For the raw vault we would need a stage table to load hubs, links and satellites in parallel. But for all the cleaning and calculation while our only target is another satellite, there is no need to create a table for that. A view would be faster.

I reached now my main goal of having all objects created as tables or views. If you liked so far my contribution, please leave a message.

]]>https://www.zemp.ch/2018/01/27/virtual-data-warehouse/feed/0986https://www.zemp.ch/2018/01/27/virtual-data-warehouse/How to ensure development quality in WhereScapehttp://feedproxy.google.com/~r/ecki/blog/~3/2b-u4-9mcaQ/
https://www.zemp.ch/2018/01/10/how-to-ensure-development-quality-in-wherescape/#commentsWed, 10 Jan 2018 19:22:55 +0000https://www.zemp.ch/?p=965With WhereScape you have the freedom to do whatever you want. With time passing, I figured out, what is important to me and what not.

But how to ensure that things are done always the same way? And not only for me but also for my collegues?

We started to create meta views for objects, columns and indexes. On top of them we created „error views“ to list all issues which were important to us: e.g. all tables and indexes should be compressed, data type for hash columns should be binary, etc.

For some rules we implemented a metadata cleaning service which constantly scanned for issues and fixed them straight away.

But some issues were still views which we needed to check regularly.

Lately when I gave some thoughts to this issue, I came to the conclusion that this is not an efficiant way to do it. When we develop, I want to get feedback straight away. Not checking once a week or every day my error views for issues.

It is out of scope to believe, that we can persuade WhereScape to change their software to our way of thinking. There are too many different views from customers in the market.

How can we achieve to get some feedback in our work? Did we forget something? Do the objects have the correct Key Types? We can’t change the way, how the system works, but maybe we are able to pass some error messages into our output. And what is our main output? Tables with their corresponding stored procedures.

I created another section in our stored procedures, where we list all meta data checks. If the check is ok, I add the message as a comment. If there is an error, then without the comment dashes. In that way, the compilation of the stored procedure will fail and we can check the messages!

We created 2 templates:

nu3_utility_metadata (for any platform-independent metadata checks)

nu3_sqlserver_utility_metadata (for any SQL Server checks)

What could be platform-independent checks:

A hub needs one hub hash key

A link needs one link hash key

A satellite needs either hub or link hash key (and only 1)

A satellite needs a change hash key

If the stage tables has hash keys, make sure that all columns belong to a hash key definition

In a satellite, verify that all columns which are hashed in the change hash key on the stage table are also available in the satellite

…

SQL Server checks could be:

All hash columns should be of data type ‚binary‘

Objects of type DataStore, Satellite, Link, Hub should have a Clustered Columnstore Index

Any index of stage tables should have the feature ‚Drop Index before Table Update‘ enabled

…

These are just some ideas.

I believe to include such things into the template, abuses the original idea of templating. But it helps to get the meta data quality straight.

First, do I set business keys or not? If I set business keys the whole loading procedure is based on a cursor which updates or inserts 1 record at a time. Not good if I’m going to load 100 mio. of records. This will take ages.

Nowadays more and more people like clustered columnstore indexes. Cursors don’t work with them.

Ok, no business keys. Then I’m able to tick the „Set Based Insert“ option. But with incremental loading I would need to delete or update updated records. So, I tick the „Include Delete Before Insert“. Now I’m able to select a column which matches stage and fact table. But only 1 column! This creates the following WHERE clause:

What about if I have multiple columns as a business key? By the way, we have already decided in the previous step that we don’t have business keys?!?

This was not satisfactory to work with.

What are the requirements:

Stage

In the stage table, we prepare the data set to be loaded. If we have dimensions with surrogate keys, we want to add a dim_key column to the stage table. This will look up the business key in the dimension table and returns the surrogate key. For business keys which don’t exist we want to add the unknown key.

Maybe we have late arriving dimensions or we are loading dimensions and facts in parallel. In that case, we don’t want to have the unknown key, but adding missing business keys to the dimension table. But for some dimensions I don’t want to get missing keys created (e.g. date dimension or dimensions with a fixed set of values).

Fact

Replace or insert data as fast as possible. Either use the dss_load_datetime or the dss_change_hash for delta detection.

Stored Procedure

Stage

There is one thing I want to point out from the standard stage template. How can we add missing business keys to a dimension?

WhereScape by default has another stored procedure to look up dimension keys and update it in the stage table. Again, with a cursor. 1-by-1. Unfortunately, we can’t create templates for that stored procedure.

Some when I figured out, that I could add another section in the stage template by inserting missing business keys into the dimensions. A really brilliant and fast solution.

I just INSERT a DISTINCT list of business keys for a dimension followed by an UPDATE statement. Both are set based. No cursor. Incredibly fast!

Fact

Delete existing records/Update changed records

If you have traditional tables, the template will create an UPDATE statement. If you are working with Clustered Columnstore Indexed tables then UPDATE has a really bad performance. Then DELETE and INSERT is much faster. With DELETE it sets only a marker on the record like is_deleted instead of replacing all the values. Please read the following blog entry for more background information.

Select Extract from Stage

Load the data from the stage table.

Where Not Exists

Load Data only if the fact doesn’t exist. Records that existed have already been updated or deleted.

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

nu3_sqlserver_procedure_fact

nu3_sqlserver_block_fact

nu3_sqlserver_procedure_stage (updated)

Select the template in the properties.

For any dimension, for which we don’t want to have missing business keys automatically created, modify the extended property on the dimension insert_missing_bk = false (see blog post).

That’s it.

]]>https://www.zemp.ch/2018/01/07/template-fact/feed/0913https://www.zemp.ch/2018/01/07/template-fact/Template: Dimensionhttp://feedproxy.google.com/~r/ecki/blog/~3/jBcSoEe9NJk/
https://www.zemp.ch/2018/01/06/template-dimension/#respondSat, 06 Jan 2018 14:39:44 +0000https://www.zemp.ch/?p=912Loading data to a dimension is straight forward. Prepare the data in a stage table and load it to dimension. The interesting stuff is how to define and load the dimension.

What are the requirements:

Stage

Nothing special. Just make sure that the business key is unique.

Dimensions

This has taken me some time to figure out all rules how to build a dimension fast and properly.

The primary key of a dimension is the surrogate key. It is usually an integer identity column start counting from 1 upwards.

But when working with it, I found a lot of exceptions. Especially around zero keys. What is the zero key? It is the record used when no value is defined (aka NULL) in the fact table.

I got confused because the feature is called zero key. It implies that this record is always of the value 0. But in Data Vault, we define missing values as -1. Should I use here also -1 as unknown value? I decided to go with 0 because maybe I need to revert back to WhereScape standard functionality with SCD stuff and so on and WhereScape defines it as 0.

But then I hit dimensions where 0 couldn’t be the zero key. Then the argument came that we shouldn’t give meaning to the surrogate key. But people do. E.g. the key in the date dimension is usually in the form of yyyymmdd. And then I read that the zero key of the date dimension should be something in the future because otherwise the data would be in the same partition as old data.

This leads to the following requirements:

I want to control the id generated. I need to map a column from stage to the dimension table

I want to be able to set the zero key to a user-defined value

I want to suppress the zero key generated

I want to suppress that missing business keys are getting inserted by the fact table

For that to work I used the recently introduced feature of extended properties. I have the following extended properties defined on the dimension:

zero_key_id

Optional: Define another zero key value

insert_zero_key

Optional: Set it to false if no zero key should be added.

insert_missing_bk

Optional: Set it to false if no missing business keys from a fact table should be added

Stored Procedure

The following steps are defined:

Insert Zero Key

SELECT Result

SUB-SELECT Where Not Exists

SUB-SELECT Zero Key Record

Update existing records

Insert new records

INSERT INTO Dimension

SELECT Result

SUB-SELECT Where Not Exists

SUB-SELECT Extract from Stage Table

Insert Zero Key

As already mentioned, I had some difficulties to understand, how to set up the zero key properly.

By default, WhereScape creates the artificial key, business key and dss_create_datetime and dss_update_datetime. Any other column for which you can define zero key values, are not considered. They are just NULL.

But I wanted to add dss_load_datetime as a system column. This is not supported.

For our cube, I don’t want to have NULL values. I came up with a view template, which replaces any NULL values from a dimension column based on the zero key value specified. For normal SQL queries on the table, NULL will still be shown.

Update existing records

The existing solution was just too slow. Comparing all columns for a change takes ages if the dimension table goes into the millions and having tons of columns.

We learnt already when loading data to a history table, that comparing only one column is much faster. The option here is to define a dss_change_hash. This will speed up the update process compared to default WhereScape behavior.

But there is something faster. Calculating dss_change_hash takes up also some processing time. With what can we replace it?

Until now we have taken a lot of energy to pump dss_load_datetime throughout the system. The idea is now, to just update columns if there is a change in dss_load_datetime. If there is a change let’s just update it. No need to compare if there is REALLY a change. This would kill the performance again.

How can we make sure, that there is always the most recent dss_load_datetime? Especially if we are joining data from dozens of satellites? I have modified the stage template slightly to calculate the most recent datetime by adding a transformation on dss_load_datetime.

Implementation

Just download from the template section the following templates for WhereScape written in Pebble:

nu3_sqlserver_procedure_dim

nu3_sqlserver_block_dim

Select it from the properties. That’s it.

]]>https://www.zemp.ch/2018/01/06/template-dimension/feed/0912https://www.zemp.ch/2018/01/06/template-dimension/Template: Hub/Linkhttp://feedproxy.google.com/~r/ecki/blog/~3/h6yghHYSP7s/
https://www.zemp.ch/2017/12/17/template-hub-link/#respondSun, 17 Dec 2017 13:36:49 +0000https://www.zemp.ch/?p=890When writing about satellite loading, we need also to talk about hub and link loading. They happen at the same time.

Before Data Vault 2.0 a hub or link surrogate key was an integer. Nowadays modern data vaulters would use hash keys instead to allow parallel loading. Others are going back to the roots and are using natural business keys. I dropped that idea because there was a reason to define something else instead. In our case it is about the handling of strings. If I want to be sure that something is unique, I would have to specify a binary collation for the business key column. Too much work and I wanted to have a properly working data warehouse first before going into experimental work with all the obstacles involved.

The following pictures shows the data flow for all data vault objects:

From the stage table, we load data to satellite, hubs and links. As you can imagine, there might be multiple satellites for 1 hub. Unfortunately, WhereScape doesn’t allow multi-to-one mapping. What exists is a kind of merge stage table.

This leaves us with the following data pipeline:

From the stage table (which loads also data to the satellite) we create a stage view with columns like they consist in the hub. Then we would add a stage merge to combine multiple views into one. The hub WOULD then load data from the stage merge (but doesn’t).

This pipeline is for lineage only. We can track columns and tables forth and back without breaking the lineage.

What we do is to „recycle“ the stored procedure which loads data from the merge stage table to the hub/link and replace the source table with the stage view.

In that way, we can load anything in parallel.

What are the requirements:

Stage

For the hub or link we need to create a hash key. The hash key for a hub can be calculated from multiple business keys. And the link hash key from multiple hub business keys. I add also business keys to the link which have no hubs. A peg-legged hub.

That’s it what we need from the stage table.

Hub

The primary key of the hub is the hash key. The business keys form a unique constrain.

Link

The primary key of the link is the hash key. The hub hash keys and added peg-legged hub columns should form a unique constrain. Unfortunately, WhereScape doesn’t support that yet. This is caused by the key type. Usually the key type ‚A‘ is the business key. Hub columns have the key type ‚h‘, so we can’t assign ‚A‘ to it. Maybe they come up with a solution some when.

Stored Procedure

Before going into the detail, I want to stress one major difference to other templates I have seen. Usually they insert new records where the business keys doesn’t exist. I changed it to insert new records where the hash key don’t exist. Why so? First, the hash key is the primary key and has usually an index on it. Much faster to search in. If I would scan for the business keys they can consist of multiple columns which is slower again.

The main objective against using business keys is, that SQL Server might not find a difference in the value due to the collation used. You would have to set the business key to a binary collation to capture every change. Without a binary collation, it would lead to satellite records not having a hub record and therefore a broken foreign-key relationship. Happened to us.

What about hash key collisions? A lot is talked about. To get these collisions, I check for the existence of both, hash key and business keys. The unique constrain on the hub or link (when it gets supported) will give us a crash, if I get duplicate records.

For the hub just select the template. This will generate the stored procedure to load data from the merge stage table.

Now comes the tricky (or dirty) part. Open the properties of the stage view and select this generated stored procedure in the „Custom Procedure“ option field. With the replace function in place, we can call now on the stage view „Execute Custom Procedure“ and it will load from the view instead of the default merge stage object.

That’s it.

]]>https://www.zemp.ch/2017/12/17/template-hub-link/feed/0890https://www.zemp.ch/2017/12/17/template-hub-link/Template: Satellitehttp://feedproxy.google.com/~r/ecki/blog/~3/deARi4T28Ao/
https://www.zemp.ch/2017/12/16/template-satellite/#respondSat, 16 Dec 2017 15:14:32 +0000https://www.zemp.ch/?p=859Finally, we reach our goal of talking about Data Vault Objects. This is the new area which I’m working and learning from and probably you might be interested as well. Data Vault consists of mainly 3 objects:

Hub

Link

Satellite

I’m not going to discuss what Data Vault is. There are lots of resources in the Internet from people who are much smarter in teaching what Data Vault is than I am.

How does a loading procedure look like in WhereScape?

In this post, I’m going to talk about loading to a satellite. The next one will be about hubs and links.

First, a satellite is a system of record like a history table. The difference is that the data is collected around a HashKey of a hub or link (which represent business keys of a business concept) instead of the business key from the source table.

What are the requirements to load data into a satellite?

Stage

The creation of the stage table is discussed in-depth in another blog post. The main goal is to prepare the date to be loaded to a satellite.

Data from a history table is loaded first to a Raw Vault Satellite. There should be no transformations except hard business rules like data type changes.

If we need to clean the data, do some calculations or derive values from, we load data from existing satellites into a Business Vault satellite. Do whatever you want in the stage table.

After the data set is like we want it to be, we need to define the hub/link_keys and the dss_change_hash. The cleaning of the business keys for the hub or link hash key is taken care of by the template.

Until the history level we had the following system columns:

dss_load_datetime

dss_sequence_no

dss_record_source

For satellite loading we are dropping the dss_sequence_no. For the first load from a history table we need to do also a little transformation by using the dss_sequence_no to make sure that we reduce the chance of hitting a duplicate row. No need to remove any rows from the source. Take them all!

DATEADD( NS , 100 * dss_sequence_no , dss_load_datetime )

Satellite

A table of records for a given HashKey. The Primary Key of a satellite is:

hub/link_key

dss_start_datetime

For a multi-active satellite add another business key to the table. Refrain as much as possible of doing that. As WhereScape doesn’t support multi-active satellite yet, I was forced to do it differently and this made my life easier by accident.

The dss_start_datetime is loaded by dss_load_datetime.

As I have an insert only system we move the calculation of the dss_end_datetime further down to a current or timeline view. All our satellites are Clustered Columnstored Indexed tables. Great performance for INSERT and SELECT.

View

To make the satellite more readable and not to write too many WHERE clauses in other objects we need to create 2 views:

Current

Timeline

The current view retrieves the last valid record for a hub/link_key while the timeline gives back a whole timeline. Currently I’m not sure what the first datetime of the timeline should be: '0001-01-01' or the first dss_start_datetime? The templates support both options.

Stored Procedure

Which steps do I take to load data from a stage table to a satellite:

Insert new records

INSERT INTO Satellite

SELECT Result

SUB-SELECT History Columns

SUB-SELECT Row Condensing

SUB-SELECT Where Not Exists

SUB-SELECT Extract from Stage Table

Extract from Stage Table

All the cleaning and transforming is done in stage. Now it’s time to load new data.

Where Not Exists

We have now 2 options. If you have a dss_version column we must make sure that the order of records inserted is correct. No old records are allowed.

If you delete this column like me, loading old and new data is possible.

Row Condensing

We could save all records in a satellite. But that would be a waste of space. So, we partition the data by hub/link_keys and order it by dss_load_datetime. What can we check the data for?

Is there a difference between dss_change_hash from one row to the other?

Is the dss_load_datetime different from one row to the other? This is a hard rule. We circumvent a Primary Key constrain warning by eliminating duplicates which is dangerous. BUT… we have already taken care of. We added in the stage table loading process a little time variance. The chance of hitting a duplicate now is nearly zero.

Views

Current View

To identify the most current record we must find out the most recent dss_start_datetime for a hub/link_key.

SELECT *
FROM sat_demo
INNER JOIN (
SELECT
hub_product_key AS hub_product_key
,MAX(dss_start_datetime) AS dss_start_datetime
FROM [sat_demo] AS sat_demo
GROUP BY
hub_product_key
) AS current_rows
ON current_rows.hub_product_key = sat.hub_product_key
AND current_rows.dss_start_datetime = sat.dss_start_datetime

Timeline View

Currently there are 4 columns supported:

dss_start_datetime

dss_end_datetime

dss_effectivity_datetime

dss_expiry_datetime

Currently I’m not sure what will the first dss_start_datetime be. I added an extended property to define it. It will calculate the first dss_start_datetime with the following query:

CASE WHEN LAG(dss_start_datetime,1,NULL) OVER (PARTITION BY hub/link_key ORDER BY dss_start_datetime) IS NULL THEN '0001-01-01 ELSE dss_start_datetime END

Add dss_end_datetime as a column without a source column definition.

The template will calculate it with the following query:

LEAD(DATEADD(ns,-100,dss_start_datetime),1,'9999-12-31') OVER ( PARTITION BY hub/link_key ORDER BY dss_start_datetime )

The other 2 columns exist, but are not perfect yet. I need to learn more about bi-temporal stuff to get it right.

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

nu3_sqlserver_procedure_sat

nu3_sqlserver_block_select_sat

nu3_sqlserver_ddl_view_sat_current

nu3_sqlserver_ddl_view_sat_timeline

There is not much to do to set it up. Just select the template. That’s it.

]]>https://www.zemp.ch/2017/12/16/template-satellite/feed/0859https://www.zemp.ch/2017/12/16/template-satellite/Template: History (DataStore)http://feedproxy.google.com/~r/ecki/blog/~3/SkalaAhww4U/
https://www.zemp.ch/2017/12/09/template-history/#respondSat, 09 Dec 2017 14:30:33 +0000https://www.zemp.ch/?p=813Creating a history layer (or Persistent Staging Area from Roeland Vos) is a safeguarding layer. All data arrives unmodified from the source.

People in the Data Vault Space would probably suggest creating a Source Vault if I don’t want to build a proper Raw Vault with good Business Keys. I considered that option too. I would have probably built it also that way. But I thought, if WhereScape has a history layer already built in, why not using it? And I used that kind of history already in the past.

So, how do I ingest data to the history layer?

What are the requirements:

Load

We have any kind of source type and source system. Loading the external data to the landing table is the first step. Unmodified, as original as possible.

To the loaded data, 3 system columns are added:

dss_load_datetime

dss_sequence_no

dss_record_source

What is the dss_load_datetime? It is the datetime of that instance when the batch is inserted into the data warehouse. I give the whole batch the same datetime although it might load for minutes. Very important, it is not the extraction date. It is rather the insert date. See it as a human readable rowversion. In my language, it is also a batch. And there can be many batches in the load table. E.g. looping through files or looping through similar databases having the same data structure.

Each row in a batch is sequenced. I have seen that in the book from Dan Linstedt and in other designs. He doesn’t persist it, as it is not needed in satellites. But as I have a history layer I want to keep it for later use in satellites (I describe it later). If I can’t set e.g. the dss_sequence_no right in the load table, I add it in the stage step. Having it auto-increment in the load table is not possible, because I can have multiple batches.

The dss_record_source belongs to the Data Vault Standard. Just add it the way you want. We have the following rules:

{source_name}.{source_database}.{source_schema}.{source_table}

{source_name}.api.{api}

{filename} (which contains the subdirectory structure including the source_name)

Stage

The creation of the stage table is discussed in-depth in another blog post.

The goal is to add any missing system column which are not yet defined in the load table and adding a hash value for dss_change_hash.

If you have many batches, you can speed up the next step by adding a non-clustered index on dss_load_datetime.

History

A system of records for a given business key at a specific time. Only changes from one batch to another are persisted. How do we identify a batch? It’s the dss_load_datetime and this is always counting towards the future. Only in the first load you can load historical data. After that only new data arrives.

Persisting data over time needs at least the following system columns

dss_start_datetime

dss_end_datetime

WhereScape has also these other system columns on offer:

dss_current_flag

dss_version

Someone might think that the dss_start_datetime is the same as dss_load_datetime. I thought about that at the beginning too. But I had troubles later joining different datasets together.

I concluded that I start with the first batch at 0001-01-01 and having an end date of 9999-12-31. With the 2nd historization and following, dss_load_datetime matches dss_start_datetime. I still have the dss_load_datetime in my history table. This is used throughout the system.

Views

To have a current view could speed up the development later. No need to write WHERE clauses you might forget. And you can create in WhereScape an alias to another database, e.g. raw_mart.

The timeline is not needed as an extra view. It is already inside the history table. But to be consistent with later stuff, I added it here.

The error view is a nice add-on if you load data from users which tend to write anything into Excel and you want to be sure that it matches the later process. It doesn’t prevent (aka breaking the load), but informs you.

Stored Procedure

Which steps do I take:

Create Cursor from dss_load_datetime

Update expiring records

Insert new records

INSERT INTO History Table

SELECT Result

SUB-SELECT History Columns

SUB-SELECT Where Not Exists

SUB-SELECT Extract from Stage Table

Cursor

First, extract the last saved dss_load_datetime from the history table.

The cursor consists of a distinct and sorted list of dss_load_datetime which are equal or greater than the last dss_load_datetime. Why equal and not just greater than? If something breaks, I want to start over where it ended. And if it just takes another second to check, it is a safety check and no duplicate data is inserted.

Especially while migrating old data with thousands of batches, sometimes loading breaks. So, it is a good way to restart where it ended and not just do all over again. This feature is not available by default in WhereScape, so I used views before templates existed to just load not yet historized data.

Update expiring records

It sets for the expiring record a dss_end_datetime as 100 nanoseconds before the next dss_start_datetime. In that way, we can use BETWEEN in SQL statements. There is no time tick available 2 times. I recently learned another view on that topic from Dirk Lerner, but this is for another blog post. This is currently our rule set.

Extract from Stage Table

All the cleaning and transforming is done in stage. Now it’s time to load new data.

WHERE dss_load_datetime = cursor

We don’t clean for duplicates. The indexes will make sure that it will crash if you hit one.

Where Not Exists

To find out if a record has changed we need to get the latest record from the history table and check if the dss_change_hash differs.

History Columns

WhereScape has 4 history columns on offer. If you don’t need dss_version or dss_current_flag, delete them.

dss_start_datetime

CASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END

dss_end_datetime

CAST('9999-12-31' AS DATETIME2(7))

dss_current_flag

'Y'

dss_version

CASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END

To get the last version we need to add a LEFT OUTER JOIN to extract the maximum version for a business key. See below.

Add-On

There is also a performance booster available if you insert history data with 1000s of batches. I experienced, when I refresh the indexes every 100 batches the insert performance is accelerated. But only if the table is not too big. Delete it or use it.

Future

Instead of looping through a cursor I want to load it in one go. I didn’t had time yet to explore and define a template, so you will have to wait. And it needs to be bullet-proof, because the history layer is the foundation for everything afterwards.

INSERT INTO [TABLEOWNER].[hist_demo] WITH ( TABLOCKX )
(
shop_code
,sid
,request_id
,times
,order_id
,total_price
,dss_load_datetime
,dss_sequence_no
,dss_record_source
,dss_change_hash
,dss_start_datetime
,dss_end_datetime
,dss_current_flag
,dss_version
,dss_create_datetime
,dss_update_datetime
)
SELECT
hist_demo.shop_code AS shop_code
,hist_demo.sid AS sid
,hist_demo.request_id AS request_id
,hist_demo.times AS times
,hist_demo.order_id AS order_id
,hist_demo.total_price AS total_price
,hist_demo.dss_load_datetime AS dss_load_datetime
,hist_demo.dss_sequence_no AS dss_sequence_no
,hist_demo.dss_record_source AS dss_record_source
,hist_demo.dss_change_hash AS dss_change_hash
,hist_demo.dss_start_datetime AS dss_start_datetime
,hist_demo.dss_end_datetime AS dss_end_datetime
,hist_demo.dss_current_flag AS dss_current_flag
,hist_demo.dss_version AS dss_version
,@v_current_datetime AS dss_create_datetime
,@v_current_datetime AS dss_update_datetime
FROM (
SELECT hist_demo.*
,CASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END AS dss_start_datetime
,CAST('9999-12-31' AS DATETIME2(7)) AS dss_end_datetime
,'Y' AS dss_current_flag
,CASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END AS dss_version
FROM (
SELECT hist_demo.*
FROM (
SELECT
stage_hist_demo.shop_code AS shop_code
,stage_hist_demo.sid AS sid
,stage_hist_demo.request_id AS request_id
,stage_hist_demo.times AS times
,stage_hist_demo.order_id AS order_id
,stage_hist_demo.total_price AS total_price
,stage_hist_demo.dss_load_datetime AS dss_load_datetime
,stage_hist_demo.dss_sequence_no AS dss_sequence_no
,stage_hist_demo.dss_record_source AS dss_record_source
,stage_hist_demo.dss_change_hash AS dss_change_hash
FROM [TABLEOWNER].[stage_hist_demo] stage_hist_demo
WHERE stage_hist_demo.dss_load_datetime = @v_batch_start
) AS hist_demo
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[hist_demo] hist_demo__not_exist
WHERE hist_demo__not_exist.shop_code = hist_demo.shop_code
AND hist_demo__not_exist.sid = hist_demo.sid
AND hist_demo__not_exist.request_id = hist_demo.request_id
AND hist_demo__not_exist.dss_change_hash = hist_demo.dss_change_hash
AND hist_demo__not_exist.dss_current_flag = 'Y'
)
) AS hist_demo
LEFT OUTER JOIN (
SELECT
hist_demo__current.shop_code AS shop_code
,hist_demo__current.sid AS sid
,hist_demo__current.request_id AS request_id
,MAX(hist_demo__current.dss_version) AS dss_version
FROM [TABLEOWNER].[hist_demo] hist_demo__current
GROUP BY
hist_demo__current.shop_code
,hist_demo__current.sid
,hist_demo__current.request_id
) AS current_rows
ON current_rows.shop_code = hist_demo.shop_code
AND current_rows.sid = hist_demo.sid
AND current_rows.request_id = hist_demo.request_id
) AS hist_demo

Of course, it would be possible to replace both steps with a MERGE statement. My experience with that is not the best. With huge tables or data sets it just takes time and sometimes SQL Server „got irritated“ about what to do.

Current View

Often, we use only the current valid record. Instead of writing everywhere WHERE clauses, I just create a view from the history table and select the appropriate template. That’s it. What will it do?

It adds a simple WHERE dss_current_flag = 'Y'. Not that difficult. But it will get more complex when we are coming to Satellites and Link Satellites.

If this view is business critical, I added also a template for an indexed view.

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

nu3_sqlserver_procedure_hist

nu3_sqlserver_view_hist_current

To set up:

Select the template „nu3_sqlserver_procedure_history“

Set the Business Key Columns

Set the Change Detection Fields to „dss_change_hash“

That’s it.

]]>https://www.zemp.ch/2017/12/09/template-history/feed/0813https://www.zemp.ch/2017/12/09/template-history/Template: Stage (View)http://feedproxy.google.com/~r/ecki/blog/~3/KhUi24T61FE/
https://www.zemp.ch/2017/12/03/template-stage-view/#respondSun, 03 Dec 2017 13:10:03 +0000https://www.zemp.ch/?p=805What is the fastest way to ingest data? To persist or to stream? A view comes close to streaming data. Persisting is always kind of stop-and-go in my view.

I wanted to wait until the end of my little template story. But I couldn’t resist to talk about it now. I’m too excited about it!

When I started, I thought I could design either a table or a view alike. But all but one of those object types are fixed to TABLE and the view type is VIEW. I created a ticket to give choice for any object to be created as TABLE or VIEW. Maybe it is still on the development list.

What’s the story behind? I had a couple of stage tables for an export and I was building up a network of data snippets. Each of them persisted. What happens, when I need now to create a job? I must add all those little steps together and having to check for dependencies. Others would maybe build a custom stored procedure. But this kills parallelism. Why not creating little views?

Another one would be, why do I need to persist a staging table when it could be „in-transit“ to load a satellite? No need to wait for persistence.

Recently when I have been exploring the DDL function to create special purpose views. E.g. a „current“ view for satellites, it hit literally my face. Would it be possible to use the DDL function in any object to create views instead of tables?

I couldn’t wait. I copied an existing View DDL and added some code of my stage template. Done.

So easy and so obvious!

Now I can create stage tables and stage views with the same functionality.

Unfortunately, WhereScape has some safety checks around it (e.g. a view can’t be dropped when the object type expects a table). Hopefully they are going to fix that some when.

]]>https://www.zemp.ch/2017/12/03/template-stage-view/feed/0805https://www.zemp.ch/2017/12/03/template-stage-view/Template: Stagehttp://feedproxy.google.com/~r/ecki/blog/~3/CnjMhlRgur8/
https://www.zemp.ch/2017/12/02/template-stage/#respondSat, 02 Dec 2017 14:05:52 +0000https://www.zemp.ch/?p=721Usually for a start, somebody starts with an easy template. The stage table is the most difficult one. The first running version was done in a few days. But perfecting it and adding more features has taken us months!

The stage table is used everywhere. So, I’m starting with it. Where do we use a stage table?

Pretty everywhere. What can we do with a stage table?

Load data from any source table

Transform data the way we want it

Prepare data for the target table

Loading & Transforming is done easily in RED. Drag & Drop a column from the source table into the stage table. Adding a transformation if we need that. Done!

The energy is used to prepare the data for the target table. What do they need?

Cleaning of the Business Key
Adding a Hub Hash Key
Adding a Link Hash Key

Satellite

Adding a Change Hash

Data Mart

Nothing special

Fact

Lookup of the Dimension Key

We have the option to create a stage template for each of it. Or combine it as they don’t much differ and doesn’t affect performance if one feature is not used.

Lets discuss those requirements in-depth.

Requirements

History

The goal in loading data to the history table is, to only persist changes over time. How to accomplish that? Either you compare each of your columns from the stage table to the history table. This results in a huge WHERE clause in the loading statement. Performance wise this is really bad. SQL Server has to compare each column and this takes performance. How to make it faster? Calculate a hash value of all values concatenated. Then SQL Server has to compare only one column. The disadvantage is, that the hash calculation takes time. But much less than comparing each of the columns to the target.

Hub & Link

Hub & Link form only an abstraction layer from the data. We can group data in satellites around a business entity. What does it need? It needs a clean business key. If it is a number, easy. But what about text? There might be spaces around it, mixed cases and even characters, which SQL Server won’t recognize as a difference from one another if the column is not set to a binary collation (e.g. „ß“ and „ss“). And NULL values are no good for a business key either. Can you give me more information about a customer who doesn’t exist?

We need to clean the business key, so it is understood from all perspectives and doesn’t lead to duplicates (country with space and not).

After the cleaning, we add a hash value of them. In Data Vault 1.0 it was a surrogate key. But with a hash we can load data in parallel into any destination.

Satellite

As above in history. Gain performance in loading data into a satellite. And it belongs to the Data Vault standard.

Dimension

Nothing special for the stage table. Loading data to dimension is written about later.

Fact

If the joining from a fact table to a dimension table is based on business key, nothing must be done. But if we use a surrogate key in the dimension table as many people do, we need to look up this value and add it to the stage table.

Pipeline

How to order all those requirements?

I must admit, that the initial design originates from Dan Linstedt. First, I had more steps in between and a different order. I thought that I am cleverer… But, (sigh) over the time I came to the same result.

At the beginning of my journey, I had the cleaning of business keys in the transformations. But sometimes I forgot to clean them. Just recently I found a way to clean them through the templates.

Stored Procedure

Which steps did I design to get all that in a fast working data flow:

TRUNCATE TABLE stage_table

Temporary Table (only for tables which needs hashing)

INSERT INTO #temp_table

SELECT + TRANSFORM from source_table

Target Table

INSERT INTO stage_table

SELECT Result

SUB-SELECT Dimension Lookup

SUB-SELECT Remove true duplicates

SUB-SELECT Calculate Hash Keys and Change Hash

SUB-SELECT Clean Business Key for Hash Keys

SUB-SELECT + Transform from source_table or #temp_table

Do you wonder, why I create a temporary table? As I load a lot of incremental data, I came across an odd behavior of SQL Server. Sometimes it would calculate every hash from the whole source table and then filtering out the incremental data. E.g. calculating hashes for 100 million rows and taking the last 100.000. It believes that calculating hashes are done in no time. Even if I re-sort the sub-queries it still comes up with the same execution plan. I need this #temp_table as an invisible stage table to persist data so it will hash only the incremental data.

With using Sub-Selects, I can focus on each part to give a perfect result back. I don’t need to mix things up. Initially we had those sub-selects as views. A lot of objects! With templates, it is now a no-brainer.

Extract & Transform

The template takes the source column or the transformation on it.

Clean Business Key for Hash Keys (other Business Keys are not touched)

There are many options out there. I decided to go with the following:

CHAR Data Types

UPPER(LTRIM(RTRIM(ISNULL(NULLIF(source_column,''),'-1'))))

INT Data Types

ISNULL(source_column,-1)

If a CHAR column is empty I set it to NULL and every NULL column is set to -1. Then remove any spaces and set it to upper case.

If an INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.

On SQL Server take a BINARY data type to store the hash. It uses half the space, it’s the default output and SQL Server doesn’t need to cast it to CHAR. Much better performance. Hashing is already slow. No need to make it slower. If you don’t know how to query for a binary value, just add the value without quotes, like an integer value.

SELECT * FROM hub_product WHERE hub_product = 0x02D...

The template can handle both. If you have it in CHAR it will cast it, otherwise not.

Casting everything to NVARCHAR(MAX) is the easiest thing. NVARCHAR has the best implicit conversion from one data type to string. Looking beautiful. Only DATETIME and another (which I can’t remember) are looking ugly. But as I use only DATETIME2 I don’t come across. My first version was to give for each data type its own string. But this is much easier to understand and use. Don’t ever cast to VARCHAR(MAX) only. I lost data with that. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR, stick with UTF16 in NVARCHAR.

For Hash Keys, everything is already cleaned in the previous step. When I had the cleaning in the first step at select & transform, I occasionally forgot to add the cleaning. I saved bad business keys because they were not cleaned. But now they are. Always! I love to not think about obstacles.

For Change Hash, we just must make sure that we don’t get NULLs. With NULLs, the whole string is emptied. Some are also using RTRIM. But then I would lose a change from the source system which I don’t like. I analyzed the impact and found it negligible.

Ah, before I forget: I sort the column names by alphabet for hashing. With that I’m free to reorder any column (what we shouldn’t do, but do sometimes) or when I maybe need to rehash stuff at any given time later.

Remove true duplicates (NEVER DO THAT!)

It is included, but be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.

I use it only for a stage table to load data to history. For satellites, I give you later another option.

ROW_NUMBER() OVER (PARTITION BY <business key columns> ORDER BY dss_load_datetime DESC, dss_sequence_no DESC)

With that I’m getting the last version of a business key. The dss_sequence_no is a column I use for history loading. If you don’t have them, it won’t show up.

Performance wise a ROW_NUMBER() calculation is a really bad thing to do. It slows the whole execution plan and kills any parallelism. But it is the easiest option. I tried another one which doesn’t wait until the internal sort of the ROW_NUMBER() is over. But it takes a little more CPU and it can’t get modeled in this sub-query context.

Dimension Lookup

Just looking up the key.

COALESCE(<dimension table>.<artificial key>,0)

If you need to add missing dimension keys to the dimension table, I’m going to blog later about a solution. In templates, we can’t address this get_dimension_key procedure. Anyway, they are not needed anymore.

Implementation

Now we have seen what can be done with it. How to set it up? Add a stage table. In the preferences, there is no need to distinguish between table types (Stage, Data Vault Stage, Permanent Stage, Work Table). Stage is just fine for every data load.

To remove duplicates, I thought the best place would be the Business Key Columns. NEVER define them! Only if you want to get rid of duplicates. It will by default create an index to enforce it.

Define the Source JOIN and WHERE clause as you need.

That’s it! It’s a sleek simple template without all the whistles available from WhereScape. But it serves our purpose of pumping data easily to a stage table without to think about any special treatments.

Please leave a comment if you like it, are going to use it and if you find bugs.

WhereScape 3D is a newer application and is the way to go for the future. With 3D I can dig into my source system, profile it, document it and design my future design and at the end it spills out any objects (including any needed intermediate tables such as stage tables) for the „machine room“.

WhereScape RED is the machine room application. RED creates tables, indexes, views in the way a user wants them. On top of it, it creates stored procedures to pump data from the source to the target table. For me this was a design shift from SSIS which is an application retrieving data, transforming it and inserting it into the target. All in-memory outside the database. Stored procedures are running inside the database engine.

When we started I had big hopes to get to my target design very fast. Hoped for a default path of reaching my goals. But the consultants just told us, we can do everything the way you want! So, what did I want? And which is the way we wanted it? We started to model stuff in 3D, exporting it, testing, remodel, exporting, testing, etc. As I didn’t know what RED can do and how to achieve things best, this took us some time.

Finally I focused on RED, the machine room, to learn how things are working and how to design things in a pattern which is repeatable.

Our first draft was to rebuild a kind of data flow with views: extract & transform → merge → clean business key → calculate hash diffs → remove duplicates (optional) → stage table → load → target table. With a service creating all those sql queries based on view type this was quite fast and simple. Along the way we learned also strange behaviors from SQL Server which I’m going to point out later. Creating all those tables from 3D took us some time and some features didn’t exist in the way we wanted it to work.

But to be clear: Working and figuring out was still faster then working in SSIS.

In June 2017 WhereScape released a template engine based on Pebble. Never heard of that before.

In a matter of days I redesigned all our objects, removing all those views and simplifying the whole stack. Mainly it looks like source → stage → target. I’m very satisfied with that!

So, what kind objects does WhereScape offer?

The table type „Stage“ is the work horse for anything.

What target table types do I use (there are more)?

Load

History (renamed from Data Store)

Satellite

Link

Hub

Dimension

Fact

That’s it. Do I need only 7 templates „to rule them all“ (including stage, without load)? Wow, that’s a short list. Ok, I added 3 more templates for special stage tables, but that’s it. Focusing on those types and defining the best templates gives us the power to develop fast our pipelines. Is this automation? I don’t know. Working with SSIS I needed to create the tables and indexes manually. In WhereScape I define and it creates. I believe when I start using 3D then we can talk about automation.

Something else. Maybe you wonder why „Load“ is a target table type? Usually the books are telling about a stage table and a target table. But as stage tables have „logic“ in it (at least in WhereScape) and is used to organize data towards the target table type, a „Load“ table is the de facto landing zone for any data from the outside.

Our data flow looks like that:As you can see, the Stage table is the workhorse for any transfer from one model to another.

In the next blogs I’m going to present my recipes how I load data into each of those objects.

]]>https://www.zemp.ch/2017/11/12/wherescape/feed/0650https://www.zemp.ch/2017/11/12/wherescape/Business Intelligence Automationhttp://feedproxy.google.com/~r/ecki/blog/~3/WcyuTpLN2h8/
https://www.zemp.ch/2017/11/11/business-intelligence-automation/#respondSat, 11 Nov 2017 21:00:42 +0000https://www.zemp.ch/?p=640How to build a data warehouse? I used to work with SSIS on Microsoft SQL Server to generate all my pipelines. This was fast and convenient and a lot of people in the industry are working like that. Define a source, do some transformations and load them to a target table. Very easy, very fast.

But what can we do, if we don’t want to work over and over again on transformations which are looking always the same? Dan Linstedt pointed me to the Capability Maturity Model to find out how mature a data warehouse is. Translated to BI this would mean:

After some work I reached level 3 easily. But my automating efforts to deploy fast my data flow according to my pattern, turned out to be very time consuming and difficult. My ambitions where very high.

Many people have been trying to accomplish that too and some of them have sold there software what they found out about automating things. All are building a meta data repository and deriving from that what needs to be done to get to a desired result.

So I did an extensive research on that topic to find out, that such a meta data tool doesn’t exist and my own definitions were never finished and I added more and more „exceptions“ to that. Why should I spend time researching it while others have already solved it?

So I turned to commercial tools and tested some. Finally I decided to go with WhereScape. When we bought it, it had only a couple of supported databases and wizards helping to get the patterns right. But it is turning more and more into a „target-less“ meta data repository which can talk to any database and any data structure through templates. In the following posts I’m going to share some of my findings, how to process data fast.

]]>https://www.zemp.ch/2017/11/11/business-intelligence-automation/feed/0640https://www.zemp.ch/2017/11/11/business-intelligence-automation/Data Loadinghttp://feedproxy.google.com/~r/ecki/blog/~3/muwxQGnq0co/
https://www.zemp.ch/2017/11/11/data-loading/#respondSat, 11 Nov 2017 14:00:54 +0000https://www.zemp.ch/?p=632In my previous post I documented roughly a data warehouse and some data modeling styles. In this post I want to discuss, how I came to the conclusion how I want to ingest my data and serve them to my user.

So what we know for sure is, that we have some sources and our data product we produce are data marts. Which steps in between are making sense?

In a Kimball-Style data warehouse, you would have the following steps in between

Staging Area (multi-layered to conform data)

Data Mart

Well, that was my first approach and somehow I was successful. I got fast to my target and was able to present data immediately. The problems started when I had to consolidate data from many different source systems. Customer 1 in system A was not the same as Customer 1 in system B. I needed to redesign my loading procedures over and over again.

The advantage is that the whole system is auditable as it focuses on the preservation of any data (unmodified) over time and its translation to any business formula. For that you would need to know which business objects (hubs) the business knows and wants to talk about. Some IT-focused BI developer translates business objects to their source systems, which is not the same. This would be a Source Vault which has no business benefit.

I don’t have any consultants at hand and I didn’t develop 10 Data Vaults yet. I’m still learning how to build it and I don’t want to rely on my modelling techniques (yet). I want to have the freedom to restart my Raw and Business Vault if I made a mistake.

This leads to Roelant Vos, who is a very experienced Data Vault modeler and crazy data developer. He invented a concept of a Virtual Data Warehouse which has the following steps:

Persistent Staging Area

Raw Vault

Business Vault

Data Mart

This means, his main base is the Persistent Staging Area. On top of it he models with meta data virtualized data vaults and data marts. This means he has the freedom to run many similar data vaults in parallel to discover which model performs best. And on top of that data marts which are also fully virtualized. For that you would need fast hardware capable of doing that. Of course he can persistent some of the heavy views to tables.

For me this has also the advantage, that I can start loading any data which I might need later before the modeling and transforming the data is clear yet. And I can start from scratch, if I find something faster to run. Of course I don’t need to virtualize aswell. I don’t have those powerful machines to calculate hash keys for 100s of millions of rows at hoc. So I persist everything to tables. But my home base is the Persistent Staging Area where everything starts from.

So, this is my data pipeline:

Just a little hint: Maybe you have heard of ETL vs. ELT. In my model I load data to PSA unmodified, no transformations, no anything. Extract & Load. From there to the Raw Vault the same. Only while loading data to the Business Vault I do the transformations which the business requests.

In my previous company I had been working on loading the data, preparing the data and creating directly dimensions and facts to be consumed by our analysts. This was tiresome and buggy as this style of modeling resulted in a lot of redesigns and re-engineering. In the meantime I learned different techniques to accomplish what I needed.

In short, here an overview of some techniques:

3rd Normal Form

Best method to model Business Processes. Every piece of software makes use of this modelling style.

Setting a business keys to every table. If the system needs a relationship to another table just add a column pointing to it (foreign key). Straight forward industrial thinking.

But useless for BI. If the source system is modeling after that, I’m very happy about it. Because somebody thought about the business process and added constrains to make sure that the data is accurate. But modeling our BI with that style is of no use. The user has already an application on top of their data.

If you have a closer look onto the picture, you will discover, that a table has different things in it: Business Keys, Relationships, Attributes. I found a much better model, to separate those information in their own objects.

Data Vault

Best method to model Data Knowledge and aggregate information for each subject (NSA-style).

This model consists of Hubs, Links and Satellites. A HUB consists ONLY of a business key. Relationships are built with LINKS and any data is saved in SATELLITES. That’s it! As you can see adding value to any Hub doesn’t need to re-engineer everything. Adding a satellite is the only thing needed. Google for more information. There is plenty of information out there.

For me this was a game-changer. Now I don’t need to work with tables of 100s of columns. I split data into satellites and this is my work of unit. Getting the perfect result into that data. And if something is wrong, change it there. No need to change it all over our data pipeline.

Dimensional modeling

Best method for end users to access data and understanding it. This is the standard approach to deliver self-service data.

Save countable values in FACTS and link DIMENSIONS to it. Nowadays in modern systems this is flattened down to one table consisting of all facts and dimensions in hundreds of columns.

Since I manage detail data in satellites, it is just a way of joining all those information together into a satellite.