Introduction

In a large-scale data warehouse, it is critical to design online analytical processing (OLAP) cubes to provide maximum performance. This paper discusses the design and implementation of a series of test scenarios that include building and querying cubes using Microsoft® SQL Server™ 2000 and Analysis Services. You will see the tests and their results, along with analyses of what the results show about optimization and performance in this environment.

When designing SQL Server Analysis Services cubes, you must make important decisions about the storage mode and the level of aggregation. Storage modes in Analysis Services are listed in the following table.

Storage mode

Description

Relational OLAP (ROLAP)

Fact data and aggregations are stored in the relational database server.

Multidimensional OLAP (MOLAP)

Fact data and aggregations are stored on the OLAP server in an optimized multidimensional format.

Hybrid OLAP (HOLAP)

Fact data is stored in the relational database server, and aggregations are stored on the OLAP server in an optimized multidimensional format.

In Analysis Services, aggregations are precalculated sums of fact table data at some combination of levels from each dimension. These aggregations are used to answer queries and create additional aggregates. When choosing the amount (by percentage) of aggregation to include in a cube, consider the issues of storage space and query response time. Precalculating all possible aggregates would greatly increase the storage requirements for a database. On the other hand, calculating all the aggregates at query time would result in slow query response time. This paper provides empirical observations comparing the different storage modes and various levels of aggregation for a large data set. For additional information about storage and aggregation, consult SQL Server Books Online.

To help you evaluate the different storage modes and aggregation levels, this paper includes:

Cube processing times

A comparison of disk space requirements

A comparison of Multidimensional Expressions (MDX) query execution times

A comparison of query execution times with warm and cold caches

A comparison of the average CPU usage on the relational database server and the analysis server

This paper also includes figures comparing disk space requirements for building OLAP cubes and star schema data marts. Benefits of building and querying OLAP cubes using Analysis Services over traditional querying using SQL are also included.

The tests and results that inform this paper come from a set of business questions for a bank profitability application. We compared execution times of MDX queries (against the OLAP server) with SQL queries (against SQL Server) for the same business question. Some optimization tips that we used in the process of making OLAP cubes have also been included. Further, for some of the tests, you will see performance differences between the previous version of the SQL Server analysis tool (OLAP Services) and the new version (Analysis Services).

This paper explains the design of OLAP cubes used in modeling our bank profitability scenario. For the source data, we used Unisys Corporation’s Banking Profitability 2 Terabyte (TB) SQL Server version 7.0 VLDB. This VLDB is the largest commercial database deployed with SQL Server 7.0 in an enterprise-class environment using Microsoft Windows NT®. The 2+ TB database contains data representing two years worth of transactions conducted by 4 million banking customers with 10 million accounts.

We started with the development version of the Banking Profitability database (approximately 130 gigabytes [GB]), and designed star schema for the business processes that we wanted to represent in the OLAP cubes. This scenario allowed users to query a cube based on a Bank Account Profitability fact table containing approximately 13 million records. The fact table contained information about economic income earned in a two-year period (24 monthly periods from January 1996 through December 1997) from various products and customer segments.

Test System Configuration

The hardware used for this set of experiments is as follows:

Two servers with similar configurations (Unisys e-@ction Aquanta ES5045R servers):

4 Intel Xeon 550-megahertz (MHz) CPUs

512 megabyte (MB) cache

4 GB RAM

Unisys OSM7700 Fiber channel data storage:

RAID 5 disk arrays using five 9 GB disks

Network: 100 MB Ethernet

We used the following servers for testing:

bbnt13 is the name of the RDBMS server running SQL Server 2000 that houses the relational star schema for the VLDBMart database.

bbnt16 is the name of the analysis server running Analysis Services that stores the OLAP cubes and multidimensional data.

Figure 1. System Configuration

Designing the Test Environment

This section describes how we built our test configuration, including identifying common business questions, identifying existing information, creating a star schema, populating the data mart, and building OLAP dimensions and cubes.

Defining Questions That Need to Be Answered

The following questions are representative of the types of questions that a customer profitability system can answer.

Query number

Question

1

What is the average economic income per household for last two years (1996 and 1997) from each of the products?

2

What is the income earned from customers in the high balance category for each year and for various products?

3

What is the moving average of economic income between various months?

4

What is the economic income of various customers as a percentage of total economic income of the ZIP code in which those customers live?

5

What is the average economic income per household for January 1996 from each of the products?

6

What is the economic income for 1996 and 1997, and what is the comparison of income for these years by each customer segment?

7

What is the economic income for first quarter of 1996 and 1997, and what is the comparison of income for these quarters broken out by customer segment?

Creating Bank Data Descriptions

Based on these questions, we identified appropriate source information in the existing VLDB ER diagram. Tables identified are as follows:

The Product table contains information about the type of accounts a bank has, such as checking account or savings account.

The Customer Segment table provides a way to categorize bank customers, such as customers who carry a high balance or customers who are borrowers.

The Period table contains the time period for the profitability information. This database has data for 2 years (1996 and 1997), or 24 one-month periods.

The Region table contains the geographic area in which each bank is located.

The Household table contains customer units that may consist of multiple accounts.

Creating a Star Schema (Dimensional Model) for the Cube to Be Generated

We designed and built a star schema (for the OLAP cube) to answer the identified set of questions about account profitability. The Account_Prof_Fact fact table was constructed from account profitability tables for all periods (January 1996 through December 1997). This fact table records account profitability information, such as economic income and expenses for various products at a monthly snapshot level. This snapshot contains totals for all measures for each month. Five dimensions were identified: Product, Time, Region, Household, and Customer Segment. The following illustration shows the star schema.

Figure 2. Star Schema with Facts and Dimensions

Creating and Populating the SQL Server Data Mart

We used SQL Server 2000 Data Transformation Services (DTS) to populate the fact and dimension tables in the data mart. We combined all the period tables into one fact table, called Account_prof_fact, that has data for all 24 periods for both years. The fact table has approximately 13 million records in it.

Data mart table

Rows

Size

Account_Prof_Fact

13,036,152

5188.00 MB

CustSegmentDim

7

0.03 MB

HouseholdDim

200,001

38.56 MB

ProductDim

14

0.03 MB

RegionDim

51

0.04 MB

TimeDim

24

0.03 MB

Building OLAP Cubes

Next, we created a multidimensional OLAP database called AccountProfitabilityOLAPDatabase. In this database, we created 12 cubes with identical structure but different storage types and aggregation levels. The following illustration shows the cube structure for one of the cubes.

Figure 3. Cube Design

Here is a summary of the 12 cubes built. Although the cubes have identical structure, their storage types and aggregation levels are different.

Cube name

Storage type

Aggregation percentage

AccountProfitabilityCubeM0

MOLAP

0

AccountProfitabilityCubeM30

MOLAP

30

AccountProfitabilityCubeM60

MOLAP

60

AccountProfitabilityCubeM90

MOLAP

90

AccountProfitabilityCubeH0

HOLAP

0

AccountProfitabilityCubeH30

HOLAP

30

AccountProfitabilityCubeH60

HOLAP

60

AccountProfitabilityCubeH90

HOLAP

90

AccountProfitabilityCubeR0

ROLAP

0

AccountProfitabilityCubeR30

ROLAP

30

AccountProfitabilityCubeR60

ROLAP

60

AccountProfitabilityCubeR90

ROLAP

90

We chose eight measures from those shown in Figure 3. The underlying fact table contains 13 million rows. The following table describes the measures included in the cubes.

Fact measure

Simple description

Economic Income

Income that the bank receives from households over a period of time. Economic income signifies net operating profit or loss adjusted from taxes after subtracting return to capital for a particular period of time.

Spread

Spread income that the bank receives from households over a period of time. Spread is the difference in lending and deposit interest rates for a particular period of time.

Fee Revenue

Revenue that the bank receives in terms of fees from households over a period of time.

Loan Loss Provision

Loan loss provision that the bank incurs from households.

Product Cost

Cost associated with a particular product for a period of time.

Expenses

Expenses that the bank incurs from households over a period of time.

Net Income

Net income signifies net operating profit / loss adjusted from taxes after subtracting interest expenses and dividends for a particular period of time.

Transaction Cost

Transaction cost that the bank incurs from households over a period of time.

Based on the dimension tables in the star schema data mart, we included five dimensions in the cubes. Details are shown in the following table.

Dimension in the cube

Number of rows in star schema table

Number of levels in the cube dimension

Size of the dimension in Analysis Services

HouseholdDim

200001

2

19128 KB

ProductDim

14

1

3 KB

RegionDim

51

2

8 KB

TimeDim

24

3

5 KB

CustSegmentDim

7

1

2 KB

The following illustration shows sample data for this cube after it was processed.

Figure 4. Processed Cube Structure

Processing Results for Each Storage Type

In all the graphs that follow, we chose aggregation level percentages of 0, 30, 60 and 90. While most applications use an aggregation level percentage between 30-60, values for 0 and 90 were included to provide data for comparison. Keep in mind that the aggregation percentage number represents the expected improvement in query performance compared to having no precalculated aggregations.

Processing Times for Each Storage Type

The following results were obtained when identically structured cubes using different data storage modes and aggregation levels were processed. (The table and graph show the same results in different formats.)

Processing Times for Each Storage Type:

Figure 5. Processing Times for Cubes

This data shows that:

At 0 percent aggregation, ROLAP took minimal time to process the cube. Fact and dimension table data is not included in the cube, and no aggregations are calculated.

As the aggregation level increases, ROLAP takes longer than MOLAP or HOLAP to process the cube.

The difference in processing times for MOLAP and HOLAP cubes between 30-60 percent aggregation was not substantial.

Processing time increases between 60-90 percent aggregation for MOLAP and HOLAP, but does not increase substantially.

Processing time increases exponentially between 60-90 percent aggregation for ROLAP.

Disk Space Requirements for Each Storage Type

The following illustration shows the space requirements (in MB) for each storage type.

Figure 6. Disk Space Required for Cubes

This data shows that:

MOLAP storage takes more space than HOLAP and ROLAP. (MOLAP cubes contain a copy of original facts and dimensions.)

The difference between disk space used between 0-60 percent is not substantial for MOLAP and HOLAP. It increases as the aggregation level approaches 90 percent.

HOLAP storage uses minimal space. This is because a copy of original facts and dimensions is not made in the OLAP database, and aggregations are stored in the OLAP database in an efficient multidimensional format.

ROLAP storage requires additional space as the aggregation level exceeds 30 percent and as it approaches 90 percent. (The ROLAP space shown here includes the space required by aggregations in the relational database.)

Disk Requirements for MOLAP Cubes Compared to Star Schema

This table compares the amount of space required by the MOLAP cubes to the size of the original star schema (fact table and the dimension tables) in the RDBMS.

Aggregation level (as percentage)

Disk space used by MOLAP cube

Size of the star schema (fact and dimension tables with indexes)

Percentage of data compression done in building MOLAP cubes

60

335.75

5188

93.53

90

353.11

5188

93.19

The space taken by MOLAP cubes as compared to the size of original star schema tables was approximately 7 percent. Even at the 90 percent aggregation level, we achieved approximately the same amount of data compression. Analysis Services compresses rather than explodes data. The additional space required to build MOLAP cubes varies with the number of levels in the dimension, the number of measures, and the type of data.

Developing MDX Queries

In this step we developed MDX queries for our set of business questions (listed previously). An example is shown here:

"What is the economic income for the first quarters of 1996 and 1997 and comparison of income for these quarters/years by each customer segment/product?"

MDX Query

WITH
// These clauses define calculations that are done on-the-fly in answering the query
MEMBER [Measures].[Economic Income for 1997]
AS ' ([Measures].[Economic Income], [TimeDim].[All TimeDim].[1997])'
MEMBER [Measures].[Economic Income for 1996]
AS ' ([Measures].[Economic Income], [TimeDim].[All TimeDim].[1996])'
MEMBER [Measures].[Economic Income change between 1996 & 1997]
AS ' ( [Measures].[Economic Income], [TimeDim].[All TimeDim].[1997] ) –
( [Measures].[Economic Income], [TimeDim].[All TimeDim].[1996] ) '
SELECT
// This part defines the report, defining what goes on rows and columns
{
[Measures].[Economic Income for 1997] ,
[Measures].[Economic Income for 1996] ,
[Measures].[Economic Income change between 1996 & 1997]
} ON columns, {
[CustomerSegmentDim].[Cust Seg Id].members } ON rows
FROM AllAccountProfitCube

Results:

Customer segment

Income for Q1 1996

Income for Q1 1997

Change

High Balance

$4,770,072.07

$4,754,855.55

($15,216.52)

Traditional

$1,010,574.24

$1,023,682.89

$13,108.65

Borrowers

$653,171.90

$673,249.77

$20,077.87

Self Reliant

($9,266.94)

($10,743.89)

($1,476.94)

Heavy Transactors

($818,666.25)

($827,246.87)

($8,580.63)

Credit Challenged

($945,083.51)

($932,928.64)

$12,154.87

Dormant

($795,301.05)

($793,679.22)

$1,621.83

MDX Queries for the remaining business questions are as follows:

MDX query and business question number

MDX Query

1

WITH member [Measures].[Distinct Household Count] as
'Count(
Crossjoin

} ON columns,
{ [TimeDim].[Year].members} ON rows
FROM AllAccountProfitCube

Comparison of MDX with SQL Queries

The following table shows the MDX and SQL queries and execution times for the question, "What is the economic income for 1996 and 1997, and what is the comparison of income for these years by each customer segment?"

Query type

Query

Execution time

MDX

with member [Measures].[Average Economic

Income] as ' Sum({ [Measures].[Economic

Income]} )/ ([Measures].[Distinct Household

Count] )' SELECT{ [Measures].[Average

Economic Income] } ON columns, {

[ProductDim].[Product Id].members}

ON rows FROM AccountProfitabilityH0

4 seconds

SQL

CREATE TABLE #qry1_temp1 (product_id

INT,households int, totalei money )

insert into #qry1_temp1 (product_id,households,

totalei)

select product_id, count(distinct

household_id), sum(economic_income)

from VLDBMart.dbo.Account_prof_fact

group by product_id

select a.prod_name, b.totalei/b.households from

VLDBMart.dbo.ProductDim a, #qry1_temp1 b

where a.product_id=b.product_id

88 seconds

The MDX query is simpler and executes much faster.

We also ran additional sets of like queries in different environments: SQL queries were run against SQL Server, and MDX queries were run against a MOLAP cube stored on the OLAP server. Each query was executed after restarting the server, ensuring that no query results were in cache.

Query number

Time taken by MDX query on OLAP server (using MOLAP with 60 percent aggregation)

Time taken by SQL query on SQL Server

Approximate number of records accessed

1

4 seconds

88 seconds

13 million

6

10 seconds

36 seconds

13 million

7

4 seconds

89 seconds

13 million

Although comparing SQL query execution time with MDX query execution time could be considered as comparing apples and oranges, the results clearly show that querying using MDX and Analysis Services can significantly improve query performance. Because OLAP cubes store precalculated aggregates, query performance in the OLAP environment is generally better than in the relational environment.

Learning to write MDX queries takes some effort; however, considering the performance gains achieved, the effort is worth it. The time difference between SQL queries and MDX queries increases significantly as the number of records in the fact table increases.

Measuring MDX Query Execution Times for Each Storage Type

We wrote MDX queries for all seven of our business questions. Execution times for four of those queries are listed here for various levels of aggregation using MOLAP, ROLAP, or HOLAP, and for cold and warm cache. For the cold cache measurement, no query results were in cache; we restarted the server prior to executing the query. For the warm cache measurement, query results were held in cache. Executing the query and then immediately executing it again resulted in a warm cache. Times were recorded in seconds (not milliseconds). Warm caches yield significantly faster average query processing times than cold caches.

Average Query Times (Cold Cache)

The following graph shows the average query times derived from the previous table for MOLAP, HOLAP, and ROLAP for various levels of aggregations, and using a cold cache.

Figure 7. Comparison of Query Times

The following is a zoomed version of the same graph, showing a more detailed view of the aggregation level from 30-90 percent.

Figure 8. Comparison of Query Times For Most Often Used Queries

This data shows that:

MOLAP is fastest in terms of query performance. There are performance gains associated with an increase in aggregation level.

There are substantial gains in terms of query performance for ROLAP and HOLAP as the aggregation level increases from 0 to 60 percent.

There are no substantial gains in query performance as the aggregation level increases from 60 to 90 percent for these storage types.

The following graph shows behavior of individual sets of seven queries for a MOLAP cube as aggregation levels increase:

Figure 9. Comparison of Query Times For Individual Queries

This graph shows the following results:

On average, query time decreases as aggregation levels increase.

This inverse relationship does not always hold true. The Storage Design wizard tries to design optimal aggregations for a general mix of queries, since it has no information about the specific queries that will be submitted. As the aggregation level increases, different combinations of levels will be used. There is no guarantee that a specific aggregation will remain in the set as the aggregation level increases. This is seen in query 5 in the above data: At one point an aggregation that helped query 5 was removed in favor of other aggregations that helped more overall. This is an argument in favor of using the Usage Based Optimization tool in Analysis Services. It will design aggregations specifically for the queries that real users have been submitting.

Average Query Times (Warm Cache)

The following graph shows the average query times derived from the previous table for MOLAP, HOLAP, and ROLAP for various levels of aggregation, and using a warm cache.

Figure 10. Query Processing Times

This data shows that:

When a query result is available in the server cache, query times are almost instantaneous (less than one second) for all queries regardless of storage modes or aggregation levels. We have shown it taking one second because we have chosen to use seconds as the constant unit of time for all the graphs.

Average Query Times for Cold Cache vs. Warm Cache:

Figure 11. Comparison of Query Times For Cold vs. Warm Cache

The preceding graph shows that once the cache was warm, we received results in less than a second. You can execute your most often used queries as a batch job immediately after finishing the cube processing. That way, it will take less time for business users to get results.

Average CPU Usage While Querying the Cube

The following chart shows the average processor (CPU) time while querying MOLAP, ROLAP, and HOLAP cubes. The red bar shows the average processor time for the computer named bbnt13, which has the SQL Server RDBMS data mart on it. The blue bar shows the average processor time for computer named bbnt16, which is running Analysis Services.

Graphical Representation of Average CPU Usage while querying the MOLAP,

Figure 12. CPU Times for Different Storage Types

Observations on Average CPU Processor Time Taken

MOLAP uses CPU processor time only on the Analysis Services computer. It does not use any processor time from the computer that hosts the SQL Server RDBMS data mart.

ROLAP and HOLAP require more processor usage from the RDBMS server computer than from the OLAP server computer. Even in case of HOLAP, the RDBMS server is used extensively. This may be because the queries chosen for this test scenario did not have many aggregations already designed in Analysis Services. Aggregations had to be calculated at run time from the RDBMS. Also, there could have been additional system activity on the RDBMS at the time of the test.

Performance Optimization Tips

The best ways to optimize your OLAP cubes fall into two categories: reducing the processing time for the OLAP cubes, and reducing the query time.

Tips for Reducing Processing Time

We reduced processing times from several hours to a couple of minutes by using these techniques:

Use a dimensional schema for your data mart. Star schemas fit in well with the OLAP cubes design and performance. Besides having a primary key associated with each of the dimensions and fact tables, we declared foreign key relationships between the facts and the dimension tables. We created a composite index on all foreign keys in the fact table. In addition, we created indexes on individual foreign keys to help speed up the processing operations.

Use the Cube Editor to optimize the schema design and minimize the number of joins needed during cube processing. We increased the process buffer size (available through the Properties dialog box for the OLAP server) to 1 GB on a computer with 4 GB of available memory. We also learned a valuable lesson about aggregation levels. We started with aggregation level of 90 percent, and it took eight or nine hours to finish processing cubes. Better is to employ an aggregation level of 25 percent or so, and then use some performance optimization tips to reduce the processing time if it is high. Then test execution times for some of the most often used queries and slowly increase aggregation levels while monitoring the gains in query times.

Also recommended is to use the Usage-Based Optimization Wizard to build additional aggregations needed to speed up the queries.

Conclusion

In this paper, we investigated the performance characteristics of SQL Server 2000 Analysis Services, and provided empirical observations comparing the different storage modes (ROLAP, MOLAP, and HOLAP), and various levels of aggregation for a large data set in the banking domain. We then provided guidelines and performance and optimization tips for designing OLAP cubes, and for implementing queries against the cubes using MDX.

Most importantly:

As aggregation level increases, more time is taken to process ROLAP cubes as compared to the time taken to process MOLAP and HOLAP cubes.

MOLAP takes more space than HOLAP and ROLAP. HOLAP uses minimum space. The difference between disk space used between 0-60 percent aggregation levels is not substantial for MOLAP, HOLAP, or ROLAP. It increases substantially as the aggregation level approaches 90 percent.

When OLAP cubes use MOLAP storage type, the space taken by MOLAP cubes as compared to the size of original star schema (fact table and the dimension tables) in the RBDMS was typically only 7 percent. This means instead of creating a data explosion, Analysis Services actually compresses data while building OLAP cubes. (Space required to build MOLAP cubes varies with the number of levels in the dimension, the number of measures and type of data.)

MOLAP provides the fastest query performance. We find substantial improvements in terms of query time for ROLAP and HOLAP as the aggregation level increases from 0 to 60 percent. It takes significantly less time to run a query that is already in cache (warm) than it does to run a query for the first time (cold).

MOLAP uses CPU processor time only on the OLAP server computer. MOLAP does not use any processor time from the computer that houses the SQL Server data mart. ROLAP and HOLAP use the CPU on both the RDBMS server computer and the OLAP server computer.

One of the primary design goals of a data warehouse is to provide analysis. And one of the primary goals of interactive analysis, called FASMI (Fast Analysis of Shared Multidimensional Information) is attainable using Analysis Services. This paper brings empirical evidence to bear demonstrating the high performance of cubes built with Analysis Services.

Acknowledgements

We would like to acknowledge the efforts and talents of a few individuals who were instrumental in helping develop and enhance the content for this white paper.

We are thankful to Len Wyatt, Program Manager for Data Warehouse and Analysis Practices at Microsoft for his continued support and review of this paper. One of Len’s presentations was our inspiration for the performance tests conducted for this paper.

We are also thankful to Dr. Raj Tewari, former Database Solutions Practice director at Unisys. Dr. Tewari is also the coauthor of previous paper based on OLAP Services 7.0. We are also grateful to him for his continued inspiration for this paper.

We are especially grateful to Diane Anderson, Manager for Marketing Communications (Global Industries) at Unisys. Without Diane’s help and positive support, this paper and other Analysis Services papers from us would not have become a reality.

Finally, we would like to thank Analysis Services product team at Microsoft for their excellent feedback and reviews of this paper.