Realization of “pseudo-realtime” in BW systems without HANA

Introduction

I’d like to share my experience in developing “pseudo-realtime” in BW systems without HANA with large data volume.

I met this task on one retail project. The client wanted to update data in analytic reports every 15-20 minutes – main transactional data including checks from POS-system, movements from ERP, stocks, sales and so on.

Problem description

If we have a BW system on one of traditional DB (like Oracle, MSSQL, Sybase and so on) containing large volume of data (e.g. 50 mln lines of transactional data in a year), such task will bring us several serious problems:

Index reload takes a long time, it can run during 10 and more minutes. During this time the speed of reports reduces heavily.

Aggregates updating takes a long time too, and during this process data in reports can be unavailable for some time.

Together reloading indexes and updating aggregates can takes much more time than required 20 minutes.

The task is even more complicated if we use the “slice scheme” of stock infocubes instead of non-cumulative. In this scheme we store stocks on every day instead of calculating them “on the fly” via non-cumulative. Such method of storing stocks is quite popular in retail projects because it provides high speed of reading stock data, but it leads to huge number of lines in the infocube.

Solution

So, the idea of my solution was in storing data of the current day in separate infocubes. For example, we have a big infocube with POS data ZRT_C01, it contains 200 mln lines of data and has several aggregates.

We make a copy of this infocube ZRT_C02 with the same transformation to it. It’s useful to make an info-source and to put all logic before the infosource, this allows changing logic once in case of some changes.

The second infocube ZRT_C02 contains only today’s data. During the day we load delta in this cube. It has quite small volume of data so we don’t need to make aggregates on it. We can load data in one parallel process without index rebuilding, but anyway, index rebuilding doesn’t take a long time in such case.

Further, during the night data loading, we do the following:

Load today’s data again from PSA (or some intermediate DSO) to the main infocube ZRT_C01, with index rebuilding, updating of aggregates and so on. It takes a long time, but at night it is not a problem.

Delete data in secondary cube ZRT_C02.

Initialize delta for the secondary cube ZRT_C02. This step is very important, because we want that the next delta-loading will load only tomorrow’s delta but not all data from PSA.

The 3rd step should be additionally commented, because it can’t be realized via standard elements of process chain. In this step we need to call data transfer process in the “Initialization” mode.

Night process chain. It loads master data and transactional data which are not critical for “real-time”. It loads delta to main infocubes (like ZRT_C01 in my example), delete data in the secondary infocubes, initialize delta in them and so on.

Day “real-time” process chain. It works every 15-20 minutes and doesn’t contain operations with indexes and aggregates.

Both infocubes are included to one multi-cube and work at the same time. The main cube provides all data before today, the secondary cube provides today’s data.

In case of my project, such method allowed to make a pseudo-realtime with updating of critical transactional data every 15 minutes and without losses in report’s speed.

Assigned tags

Related Blog Posts

Related Questions

The task is even more complicated if we use the “slice scheme” of stock infocubes instead of non-cumulative. In this scheme we store stocks on every day instead of calculating them “on the fly” via non-cumulative. Such method of storing stocks is quite popular in retail projects because it provides high speed of reading stock data, but it leads to huge number of lines in the infocube.

Of course, your variant is possible too. I chose such variant because I wanted to keep main data flow simple and clear without any delta initialization, deleting of data and so on. This make easier such things as support and debugging. In my variant, there is no a risk that somebody starts (by some error) data deleting in the secondary cube before they were loaded to the main cube. Even if such thing happens, it will affect data only during one day, but after the next night process chain data will be correct.

Furthermore, if we need to reload large volume of data (for example, we need to recalculate stocks on the last year) – it’s easier to reload data without intermediate cube.