Performance Tuning and Optimization by Dmitry Tolpeko

Menu

Architecture of a Modern Data Warehouse

The BI and Big Data worlds converge and it affects how we build enterprise data warehouses (EDW) today. In the traditional approach we use a single RDBMS product such Teradata, Oracle, DB2, SQL Server, Netezza, Greenplum or Vertica to serve all needs of BI users. But now it is not enough.

Why? Firstly, we have to deal with too large volume of raw data that is expensive to store in a RDBMS paying $10,000 per each terabyte. Secondly, we need to provide near real time analytics today.

So the modern approach involves three layers, and some analysts call them: cold, warm and hot (see, Forrester’s definition).

Cold Layer

You use Hadoop to store all enterprise data in the format as they arrive from the source systems (Data Lake concept). You do not perform any data transformations, do not create any models. It is a fault-tolerant archive, and you have access to all data for any historical period at any time you need. You do not have to ask system administrators to restore data from a backup anymore. Data is always available.

The archive is active, this means you can run MapReduce, Spark and other data processing frameworks to quite effectively query very large volumes of raw data. Also you can move significant ETL workload to clean data, to build data marts to Hadoop (70% and more).

We can also call the cold layer as the landing zone or staging area.

Warm Layer

This is where our traditional data warehouse RDBMS lives. Now it mostly uses Hadoop as the data source but typically it also gets data directly from the transactional databases (OLTP). Some data marts can be already pre-built in Hadoop, so ETL workload is not so large at this layer.

We still need to use a mature RDBMS for better query performance, data governance, transactional support, MDM, legacy business logic and so on.

Hot Layer

The hot layer help us provide extremely low latency query execution and near real time analytics. Depending on the requirements we can use an in-memory database, NoSQL database or messaging system (MQ).

In a Single System?

You see that the data warehouse architecture evolved to deal with the increased data volumes and requirements for interactive data exploration. Can this cold-warm-hot architecture be implemented in a single product? Not yet:

Hadoop

Spark data processing may be significantly fast compared with MapReduce, but it is still slower than RDBMS when you query relatively small data marts.

RDBMS

RDBMS products should offer a cheap storage option to store huge amounts of data, and should simplify loading of unstructured data. Additionally some sort of schema-on-read SQL support is also required to compete with Hadoop.

In-memory Databases

You simply cannot use them to store all amounts of raw data.

NoSQL Databases

NoSQL databases are mostly key-value stores and they do not offer a rich set of batch analytics features as Hadoop and RDBMS.