Blogs mainly around BW/4HANA and BW-on-HANA

Main menu

From Sep 24 to 29, the 5th edition of the Heidelberg Laureate Forum (HLF) took place, mainly on premisses of the University of Heidelberg. It intends to bring together laureates in mathematics and computer science with young researchers. The program roughly looks like this: during the mornings, the laureates give presentations on their field, their experience, their opinion where things are leading to, interesting research questions etc. During breaks, workshops, poster sessions, social and other events, there is plenty of room to pick up one or the aspect, e.g., from those presentations and have a fruitful discussion with some of the leading brains in the field. One of my favourite moments was a coincidental lunch time chat with Sir Michael Atiyah on a variety of topics and that variety makes it so fascinating.

I could not attend all the presentations this year. But from those that I could attend, I recommend the following 3; this is an arbitrary, sobjective selection as all presentations had impressive content:

John E. Hopcroft: “Deep Learning Research”. If you wonder about the relationship between maths and computer science then watch the first 5 minutes of John’s presentation. Later, he showed some fascinating potential of neural networks in image processing.

Many companies currently complement their existing relational data warehouses with big data components, such as Spark, HDFS, Kafka, S3, … This leads to a new form of data warehouse (DW) that we call big data warehouse (BDW). This blog elaborates how BW/4HANA and the SAP Data Hub (DH) are a perfect match for building a BDW.

The idea of a BDW is prevailing in many companies and industries. This blog describes a BDW built at Netflix, this one a BDW at Sears. Many more can be found on the web. All those examples show how big data storage and processing environments complement traditional relational data warehouses by providing

an easy way to process semi- and unstructured data, such a photos, videos, sound, text,

an inexpensive storage for fine granular data, e.g. from sensors and logs.

Figure 1 shows a generic setup of a BDW. Usually, there are 2 to 3 storage layers involved; sometimes, the first two are collapsed into one:

an ingestion layer: inexpensive storage to collect data from many sources, e.g. thousands of sensors; Amazon’s S3 is frequently used here,

a processing and refinement layer for distributed processing of large and/or many files,

a relational DW: this layer serves to provide semantically rich and well structured data to business users who use analytic clients tools for interactive analyses.

Fig. 1: Many BDWs follow the pattern of these storage and processing layers.

Many SAP customers are on the same trajectory as described in the Netflix and Sears examples. All of them have run a relational DW for many years and are now evolving and complementing it with big data components. BW and BW-on-HANA are capable to play the role of the relational DW in such an environment through various connection options. However, BW/4HANA’s ambition is to excel this and be well integrated with SAP’s Data Hub. The latter manages the ingestion and processing layers to the left of figure 1. This is outlined in figure 2 which represents the pattern of figure 1 implemented with SAP software components.

Fig. 2: BW/4HANA and SAP’s Data Hub combined.

Now, what does this tight integration between BW/4HANA and SAP’s Data Hub mean? What are the specifics? This is shown in figure 3 and comprises the following features:

Workflows between the 2 environments can be mutually triggered: the Data Hub’s data pipeline can be part of BW/4HANA’s process chains and vice versa.

Data movement between BW/4HANA and the Data Hub – or technically: between HANA and VORA – is highly optimized and aligned for performance (e.g. align data types thereby reducing overheads through type casting).

The repositories of BW/4HANA and the Data Hub will be integrated and interoperate to enable common transports, lineage and impact analysis.

In the area of data tiering, VORA is leveraged for archiving (cold store) of BW/4HANA data with high data throughput and fast read access.

Fig. 3: Integration points between BW/4HANA and SAP’s Data Hub.

What already exists today and what is planned to be shipped at what time is described in the roadmap shown in figure 4. Click the picture to enlarge.

Fig. 4: Roadmap of planned integration features for BW/4HANA and SAP’s Data Hub.

Conclusion

In times of digitalization and the Internet-of-things, traditional and relational data warehouses are complemented with tooling, engines and infrastructure from the big data area. This leads to “big data warehouses” or, sometimes, also labeled “modern data warehouses”. BW/4HANA and the SAP Data Hub are a perfect match in that respect.

HANA promises to cater for both, OLTP and OLAP, workloads. That allows to provide operational analytics within a S/4HANA system. The SAP-focused reader might wonder why, on earth, do you still want to have a BW/4HANA system in your landscape? This blog looks at 3 anonymised customer examples that reveal why having a data warehouse – such as BW/4HANA – is even more pressing in times of digitalisation than ever before. A data warehouse is thereby considered as the place that brings data and its underlying semantics from a variety of sources together in one place, either physically, virtually or mixed, either using an RDBMS, a big data environment or a combination thereof, either deployed on premise or in the cloud.

Example 1: Consumer Goods Customer

The first example comes from a leading consumer goods company. Figures 1a and 1b show details from 2 of their slides and list the sources of data that feed into their data warehouse. As expected, there is a bunch of traditional SAP systems, such as ERP (S/4), CRM and APO, but – as it has become common in days of digitalisation – also from sensors, logs, digitalised sales and marketing. Now, bringing that data semantically together – for example, to understand the financial impact of digital marketing on financial results – becomes mandatory. You need a system that is equipped with tooling and mechanisms (like modeling, security, transformation, connectivity, lifecycle, monitoring, governance in general) that allows that semantic consolidation. This is exactly what a data warehouse does. BW/4HANA provides this infrastructure while S/4HANA focuses on certain business processes.

Fig. 1a: Detail of an original slide by an SAP consumer goods customer.

Fig. 1b: Detail of an original slide by an SAP consumer goods customer.

Example 2: Fashion Customer

The second example is from a fashion customer who sells his products predominantly via on-premise stores but increasingly online. The latter triggers the need to look into more and more online behavioural data, such as clickstream or social media information, in order to answer questions about the products for which a customer has shown some interest or what the brand perception is etc. Fig. 2 lists data sources that this company is analysing. One aspiration is that demand can be better predicted by better understanding a customer’s interest indicators from clickstreams and social media. That in turn can impact demand, supply and other planning in – e.g. – a BW/4HANA system.

Fig. 2: Detail of an original system landscape slide by an SAP fashion (on-premise + online) customer.

Example 3: Oil and Gas Customer

The third example is from an oil and gas customer. Fig. 3 shows the data sources that they connect to their data warehouse. There is obviously a mix of SAP and non-SAP sources. For instance, there is data on seismic measurements, oil rig sensor information, drill status (both for predictive maintenance), oil well status etc. Again, there is a number of scenarios or analytic questions that require to combine such data with data from an SAP system. To that end, a data warehouse approach is required. Simply copying such data into the HANA system underlying an S/4HANA instance would fall short in many ways: you would still end up creating a data warehouse on HANA that coincidentally sits on the same HANA as the S/4HANA instance.

Fig. 3: Detail of an original system landscape slide by an SAP oil and gas customer.

Conclusions

These 3 real-world examples show that modern analytics requires data from an even larger variety of data sources than ever before. Big data, IoT, digitalisation etc. are trends that have added to that variety. Integrating data from those sources is more than just copying them together to or exposing them logically in one location. The need for a data warehouse remains as the place that brings the data together (physically or logically) and semantically integrates them through transformation, harmonisation, synchronisation etc. This is complemented by operational analytics inside a single operational system, such as S/4HANA, as it focuses and analyses data in there in an isolated way.

Hasso’s SAPPHIRE NOW 2017 Keynote Comments

Hasso commented in his SAPPHIRE NOW 2017 keynote (see here, at 0:37 to 0:39) that “he fought against data warehouses in the 1990s”. However, he also states that “there is still an application for data warehouses”. He then elaborates that not all analytics does have to sit in a data warehouse.

This is exactly the distinction and the point argued in this blog, namely: there is operational analytics (directly inside an operational system like S/4HANA and not necessarily in a data warehouse) and there is cross-system analytics (which needs something like a data warehouse). The latter is a problem that is not addressed by S/4HANA but that exists in the real world – see the customer examples above – and that is addressed by BW/4HANA.

In a recent blog, I have introduced the Data Warehousing Quadrant, a problem description for a data platform that is used for analytic purposes. The latter is called a data warehouse (DW) but labels, such as data mart, big data platform, data hub etc., are also used. In this blog, I will map some of the SAP products into that quadrant which will hopefully yield a more consistent picture of the SAP strategy.

To recap: the DW quadrant has two dimensions. One indicates the challenges regarding data volume, performance, query and loading throughput and the like. The other one shows the complexity of the modeling on top of the data layer(s). A good proxy for the complexity is the number of tables, views, data sources, load processes, transformations etc. Big numbers indicate many dependencies between all those objects and, thus, high efforts when things get changed, removed or added. But it is not only the effort: there is also a higher risk of accidentally changing, for example, the semantics of a KPI. Figure 1 shows the space outlined by the two dimensions. The space is the divided into four subcategories: the data marts, the very large data warehouses (VLDWs), the enterprise data warehouses (EDWs) and the big data warehouses (BDWs). See figure 1.

Figure 1: The DW quadrant.

Now, there is several SAP products that are relevant to the problem space outlined by the DW quadrant. Some observers (customers analysts, partners, colleagues) would like SAP to provide a single answer or a single product for that problem space. Fundamentally, that answer is HANA. However, HANA is a modern RDBMS; a DW requires tooling on top. So, there is something more required than just HANA. Figure 2 assigns SAP products / bundles to the respective subquadrants. The idea behind that is to be a “flexible rule of thumb” rather than a hard assignment. For example, BW/4HANA can play a role in more than just the EDW subquadrant. We will discuss this below. However, it becomes clear where the sweet spots or the focus area of the respective products are.

Figure 2: SAP products assigned to subquadrants.

From a technical and architectural perspective, there is a lot of relationships between those SAP products. For example, operational analytics in S/4 heavily leverages the BW embedded inside S/4. Another example is BW/4HANA’s ability to combine with any SQL object, like SQL accessible tables, views, procedures / scripts. This allows smooth transitions or extensions of an existing system into one or the other direction of the quadrant. Figure 3 indicates such transitions and extension options:

Data Mart → VLDW: This is probably the most straightforward path as HANA has all the capabilities for scale-up and scale-out to move along the performance dimension. All products listed in the data mart subquadrant can be extended using SQL based modeling.

VLDW or EDW → BDW: Modern data warehouses incorporate unstructured and semi-structured data that gets preprocessed in distributed file or NoSQL systems that are connected to a traditional (structured), RDBMS based data warehouse. The HANA platform and BW/4HANA will address such scenarios. Watch out for announcements around SAPPHIRE NOW 😀

Figure 3: Transition and extension options.

The possibility to evolve an existing system – located somewhere in the space of the DW quadrant – to address new and/or additional scenarios, i.e. to move along one or both dimensions is an extremely important and valuable asset. Data warehouses do not remain stale; they are permanently evolving. This means that investments are secure and so it the ROI.

A good understanding or a good description of a problem is a prerequisite to finding a solution. This blog presents such a problem description, namely for a data platform that is used for analytic purposes. Traditionally, this is called a data warehouse (DW) but labels, such as data mart, big data platform, data hub etc., are also used in this context. I’ve named this problem description the Data Warehousing Quadrant. An initial version has been shown in this blog. Since then, I’ve used it in many meetings with customers, partners, analysts, colleagues and students. It has the nice effect that it makes people think about their own data platform (problem) as they try to locate where they are and where they want to go. This is extremely helpful as it triggers the right dialog. Only if you work on the right questions you will find the right answers. Or put the other way: if you start with the wrong questions – a situation that occurs far more often than you’d expect – then you are unlikely to find the right answers.

The Data Warehousing Quadrant (Fig. 1) has two problem dimensions that are independent from each other:

Data Volume: This is a technical dimension which comprises all sorts of challenges caused by data volume and/or significant performance requirements such as: query performance, ETL or ELT performance, throughput, high number of users, huge data volumes, load balancing etc. This dimension is reflected on the vertical axis in fig. 1.

Model Complexity: This reflects the challenges triggered by the semantics, the data models, the transformation and load processes in the system. The more data sources that are connected to the DW, the more data models, tables, processes exist. So, the number of tables, views, connected sources is probably a good proxy for the complexity of modeling inside the DW. Why is this complexity relevant? The lower it is the less governance is required in the system. The more tables, models, processes there are, the more dependencies between all this objects exists and the more difficult it becomes to manage all those dependencies whenever something (like a column of a table) needs to be added, changed, removed. This is the day-to-day management of the “life” of a DW system. This dimension is reflected on the horizontal axis in fig. 1.

Figure 1: The DW quadrant.

Now, these two dimensions create a space that can be divided into four (sub-) quadrants which we discuss in the following:

Bottom-Left: Data Marts

Here, the typical scenarios are, for example,

a departmental data mart, e.g. a marketing department sets up a small, maybe even open source based RDBMS system and creates a few tables that help to track a marketing campaign. Those tables hold data of customers that were approached, their reactions or answers to questionnaires, addresses etc. SQL or other views allow some basic evaluations. After a few weeks, the marketing campaign ends, hardly any or no data gets added and the data, the underlying tables and views slowly “die” as they are not used anymore. Probably, one or two colleagues are sufficient to handle the system, both setting it up and creating the tables and views. They now the data model intimately, data volume is manageable and change management is hardly relevant as the data model is either simple (thus changes are simple) or has a limited lifespan (≈ the duration of the marketing campaign).

An operational data mart. This can also be the data that is managed via a certain operational application as you find them e.g. in an ERP, CRM or SRM system. Here, tables, data are given and data consistency is managed by the related application. There is no requirement to involve additional data from other sources as the nature of the analyses is limited to the data sitting in that system. Typically, data volumes and number of relevant tables are limited and do not constitute a real challenge.

Top-Left: Very Large Data Warehouses (VLDWs)

Here, a typical situation is that there is a small number of business processes – each one supported via an operational RDBMS – with at least one of them producing huge amounts of data. Imagine the sales orders submitted via Amazon’s website: this article cites 426 items ordered per second on Cyber Monday in 2013. Now, the model complexity is considerably simple as only a few business processes, thus tables (that describe those processes), are involved. However, the major challenges originate in the sheer volume of data produced by at least one of those processes. Consequently, topics such as DB partitioning, indexing, other tuning, scale-out, parallel processing are dominant while managing the data models or their lifecycles is fairly straightforward.

Bottom-Right: Enterprise Data Warehouses (EDWs)

When we talk about enterprises then we look at a whole bunch of underlying business processes: financial, HR, CRM, supply-chain, orders, deliveries, billing etc. Each of these processes is typically supported by some operational system which has a related DB in which it stores the data describing the ongoing activities within the respective process. There is natural dependencies and relationships between those processes – e.g. there has to be an order before something is delivered or billed – that it makes sense for business analysts to explore and analyse those business processes not only in an isolated way but also to look at those dependencies and overlaps. Everyone understands that orders might be hampered if the supply chain is not running well. In order to underline this with facts the data from the supply chain and the order systems need to be related and combined to see the mutual impacts.

Data warehouses that cover a large set of business processes within an enterprise are therefore called enterprise data warehouses (EDWs). Their characteristic is the large set of data sources (reflecting the business processes) which, in turn, translates into a large number of (relational) tables. A lot of work is required to cleanse and harmonise data in those tables. In addition, the dependencies between the business processes and its underlying data are reflected in the semantic modeling on top of those tables. Overall, a lot of knowledge and IP goes into building up an EDW. This makes it sometimes expensive but, also, extremely valuable.

An EDW does not remain static. It gets changed, adjusted, new sources get added, some models get refined. Changes in the day-to-day business – e.g. changes in a company’s org structure – translate into changes in the EDW. This, by the way, does apply to the other DWs mentioned above, too. However, the lifecycle is more prominent with EDWs than in the other cases. In other words: here, the challenges by the model complexity dimension dominate the life of an EDW.

Top-Right: Big Data Warehouses (BDWs)

Finally, there is the top-right quadrant which starts to become relevant with the advent of big data. Please beware that “big data” not only refers to data volumes but also incorporating types of data that have not been used that much so far. Examples are

videos + images,

free text from email or social networks,

complex log and sensor data.

This requires additional technologies involved that currently surge in the wider environment of Hadoop, Spark and the like. Those infrastructures are used to complement traditional DWs to form BDWs, aka modern data warehouses, aka big data hubs (BDHs). Basically, those BDWs see challenges from both dimensions, the data volume and the modeling complexity. The latter is being augmented by the fact that models might span various processing and data layers, e.g. Hadoop + RDBMS.

How To Use The DW Quadrant?

Now, how can the DW quadrant help? I have introduced it to various customers and analysts and it made them think. They always start mapping their respective problems or perspectives to the space outlined by the quadrant. It is useful to explain and express a situation and potential plans of how to evolve a system. Here are two examples:

SAP addresses those two dimensions or the forces that push along those dimensions via various products, namely SAP HANA and VORA for the data volume and performance challenges, while BW/4HANA and tooling for BDH will help along the complexity. Obviously, the combination of those products is then well suited to address the cases of big data warehouses.

An additional aspect is that no system is static but evolves over time. In terms of the DW quadrant, this means that you might start bottom-left as a data mart to then grow into one or the other or both dimensions. These dynamics can force you to change tooling and technologies. E.g. you might start as a data mart using an open source RDBMS (MySQL et al.) and Emacs (for editing SQL). Over time, data volumes grow – which might require to switch to a more scalable and advanced commercial RDBMS product – and/or sources and models are added which requires a development environment for models that has a repository, SQL generating graphical editors etc. Power Designer or BW/4HANA are examples for the latter.

This blog looks at one of BW/4HANA’s biggest strengths, namely to embrace both, (1) a guided or managed approach – using the highly integrated BW or BW/4 based tools and editors – and (2) a freestyle or SQL-oriented one – as prevalent in many handcrafted data warehouses (DWs) based on some relational database (RDBMS). And it is not only restricted to running those approaches side-by-side! They can also be combined in many ways which allows to tap into the best of both worlds. For instance, data can be loaded into an arbitrary table using basic SQL capabilities to then expose that table to BW/4HANA as if it were an infoprovider that can be secured via BW/4HANA’s rich set of security features.

In fact, many SAP customers have one or more BW systems for (1) and one or more DW systems for (2). Those systems depend on each other as data is copied from one to the other so that each system can provide a coherent view on the data. Keeping such a system landscape in sync is not only a technical challenge. Often, separate IT teams own the respective systems. There exists a natural rivalry; they compete for resources, ownerships, who has the better SLAs, whose requirements gets precedence in situations that affect both teams or systems and so on. Fig. 1 shows that situation.

Fig. 1: Typical customer landscape with a Business Warehouse (BW) and a SQL-based data warehouse side-by-side.

The reason for the organisational and technical separation that is shown in fig. 1 is typically found in that approaches (1) and (2) appear to be mutually exclusive and, thus, ought to be separated. This has become a common perception and practice. Now and as mentioned above, BW/4HANA offers the possibility of not only a coexistence of (1) and (2) in one single system but also synergetic combinations of (1) and (2) – see figure 2.

Fig. 2: BW/4HANA combines the best of both worlds in one and the same system.

Examples for synergies between (1) and (2) – the frequently cited mixed scenarios – have been documented in various presentations, webinars, blogs and the like, sometimes still in the context of BW-on-HANA but all of that is even more applicable now to BW/4HANA as the latter has seen a number of enhancements. Here is a non-exhaustive list of material:

BW/4HANA → SQL: Most of the BW/4HANA based data objects (i.e. infoproviders but also BW queries) can be exposed as SQL-consumable views, potentially with a loss of some semantics.

BW/4HANA ⇄ SQL: There is a number of “exit options” that allow to add SQL, SQL script, R or any other HANA supported code to BW/4HANA processing. The most popular place is the HANA Analysis Process (HAP) in BW/4HANA.

There is an excellent series of short videos that introduce the native data store object (NDSO) for HANA. The NDSO can be considered as a more intelligent table that, in particular, allows to capture deltas. This is especially useful when data is regularly loaded to be transformed or cleansed afterwards: rather than going through the complete data set in the table, one can focus on the changes since the last transformation or cleansing has happened. This reduces the amount of data that needs to be processed and, thus, increases the throughput / performance of the process. Frequently, the effect is significant. The DSO idea has originated from SAP’s Business Warehouse (BW) and has seen the advent of the more versatile and powerful advanced DSO (ADSO) in BW/4HANA.