Managing Diverse Workloads in the Data Warehouse

The modern data warehouse is not a single system-based architecture that serves all the enterprise users; it has grown into an ecosystem of diverse systems that is geographically distributed on different technology platforms with data shared across the platforms. Along with the growth of the platforms comes the challenge of managing workloads across the platforms.

Traditional database systems have always been challenged with performance and scalability issues in an integrated platform when executing workloads whether on user-generated SQL or from ETL (extract, transform and load) or CDC (change data capture) generated SQL activity. The performance gets worse when executing mixed workloads of loading data, querying data and supporting user-generated ad hoc analytical queries. Different categories of data will generate workloads that are more complex to manage than others.

The situation today is even more complex with the advent of Big Data and its processing complexities. Most of the database vendors have already forayed into connecting into the Big Data world with their own connectors and associated technologies.

Figure 1: Data Footprint of an Enterprise Across Geographies

Figure 1 shows the most common data footprint for enterprises across the world in many time zones and regions. The datacenters may be spread across areas of the world and, in other cases, the servers and systems may be within one datacenter in the USA. The issue of managing data across the datacenters presents many challenges:

Replicating data across the datacenters

Managing workloads across the datacenters, following the model of the sun

Managing user access to the appropriate data across datacenters

Managing query balancing across systems and datacenters

Managing routing and access of data and systems

Managing data in this type of situation requires a combination of algorithms and optimization models: intelligent network routers, flexible data adapters, high-speed data movement, data replication and rebuild, and systems stabilization and optimization. While most of the workload-related algorithms have matured over the last few years, there are still issues with the management and processing of data and queries across the network and systems.

Figure 2: A Clogged Workload Management Scenario

Figure 2 illustrates the situation experienced by system users in the diverse workload-based data warehouse. The mixed workload of business intelligence, analytics, ETL and Big Data queries create a congestion on the entire infrastructure, from storage to network to server and database, that performance and quality of service take a severe beating. There are workarounds and optimizations available, but there are challenges as these tools and techniques cannot integrate across diverse platforms. Even if the enterprise is on the same platform but has different versions of the database on a different server footprint, then managing these systems becomes a bigger issue. The lack of an integrated systems and database management platform which can seamlessly orchestrate the infrastructure while maintaining quality of service and performance has become the most complex hurdle for enterprises today.

To understand the requirements better between we need to understand the workloads better. The most common types of workloads include the following

Database Oriented

Business intelligence workloads – Standard and ad hoc reports executed by business users during different times of the day

Storage Workloads – Related to handling the storage and retrieval of data

These workloads can become complex depending on the size and types of data that will be managed in the enterprise. Figure 3 shows the future state requirement for managing workloads in the next generation data warehouse, business intelligence and analytics architecture.

As shown in Figure 3, the future state data architecture requires a two-part management system. One part of this solution is related to managing the workloads and assigning them to the right systems based on the resources needed. This is similar to traffic patterns in China where vehicles are assigned lanes to avoid undue congestion, and the flow of traffic is uninterrupted. The other part of the system is designed to provide the user application with intelligent assistance to manage the query generation based on the resources available. This is similar to cars of the future with lane and park assist. This kind of streamlining in the infrastructure layers of the data warehouse and Big Data platforms is where the workload management fabric will need to evolve. All the major database vendors are making efforts to get to this state of nirvana. What will it take to provide this kind of workload management capability? There are a few conditions:

Standardization of tools and technologies in the enterprise need to be adopted

With some or all of these requirements laid out, we can build a future state platform where the data is isolated based on the workload environment and users can seamlessly move between different environments as each of them are self contained to manage any type of workload related to the data.

What are the benefits of such a system? Let us see the different benefits that can be derived from this type of architecture

Increased performance efficiencies

Streamlined systems with optimal performance

Effective management of user expectations

Utilization of all infrastructure investments across the geographies

Increased ROI from infrastructure investments

Lower TCO from an infrastructure spend perspective

Standardization of tools and technologies

There are a few steps that every enterprise needs to take to get to these types of benefits:

Decide on which set of heterogeneous technologies to standardize with. There are caveats here – the databases need to be on a similar platform and heterogeneous platforms applications can be on any platform.

Determine how much of time will be needed before ROI can be realized.

Set the overall spending on the new infrastructure. How much can be saved from retired platforms over what period of time?

Calculate the TCO and ROI in the new architecture.

Creating a program with a roadmap, answering some of these questions and moving to an implementation phase will ensure success for creating this future state vision to reality.

Krish Krishnan is Principal at Sixth Sense and a recognized expert in the strategy, architecture and implementation of high performance data warehousing solutions. He is a visionary data warehouse thought leader and an independent analyst, writing and speaking at industry leading conferences, user groups and trade publications. He co-authored a book with Bill Inmon entitled “Building The Unstructured Data Warehouse” and has authored three eBooks, more than 75 articles, viewpoints and case studies on business intelligence, data warehousing, data warehouse appliances and architectures.

Krish Krishnan is Principal at Sixth Sense and a recognized expert in the strategy, architecture and implementation of high performance data warehousing solutions. He is a visionary data warehouse thought leader and an independent analyst, writing and speaking at industry leading conferences, user groups and trade publications. He co-authored a book with Bill Inmon entitled “Building The Unstructured Data Warehouse” and has authored three eBooks, more than 75 articles, viewpoints and case studies on business intelligence, data warehousing, data warehouse appliances and architectures.