Data Warehousing

March 05, 2009

This is the next post in a series of discussions about data warehousing best practices based on this blog entry from February 11. Today, we look at capturing history in a data warehouse. "The data warehouse is able to present a view of the business at a
particular point in time and track Key Performance Indicators (KPI's)
over time." There's not too much to argue here.

One of the primary functions of a data warehouse is to capture history and perform point-in-time reporting and analysis. There are a couple of ways to do this depending on the architecture and requirements. The most common method in dimensional modeling is the Type 2 Slowly Changing Dimension (SCD). Type 2 SCD's track tie a transactional or fact record to it's dimension attributes, as it was at the time it occurred. Variations on the type 2 SCD will allow us to query the transaction as it was at the time it occurred or as it appears now. If we don't care how it was in the past and only care how it looks now, we'll use a Type 1 SCD.

Let's look at an example using a hotel transaction. Suppose we are looking at a transaction that occurred on 12/20/08. The transaction includes the room rate plus taxes that the guest paid, the type of payment he used, the guest's name and rewards number, and the hotel brand name and location.

Let's say on 1/1/09, the hotel company changed the brand name of that property to Fairview Inn from Village Suites. Hotel companies, such as Hilton and Marriott, own a number of brands, and there are occassions that they may convert one brand into another.

On 2/1/09, we run a report showing sales by brand for each month in 2008. Do we include that hotel's 2008 transactions under Fairview or Village? That probably depends on the use. If we are tracking revenue in order to calculate bonuses for the Village Suites VP, then he would certainly want to us to include those transactions in his bucket (as-was). However, if we are using last year sales to project Fairview sales for 2009, then we want to include the changed properties sales with Fairview (as-is).

In this case, we need to be able to query it both "as-is" and "as-was". A flexibile architecture will allow us to do either one. Much of the time, the "as-was" serves no purpose. For example, if we have status codes, and we change the description of a particular code. Most of the time, we'll want to see all records associated with the new description.

Determing which method we use to track history is one of the most important decisions we make in the data warehouse design. It affects the flexibility of the warehouse, as well as performance. We have to balance each. However, it is also one way to derive the greatest value from the solution, since this historical information is rarely available in any consummable form anywhere else.

February 27, 2009

We are continuing a series of discussions about data warehousing best practices based on my blog entry here. This time we take on choosing the right granularity in a data warehouse. The rule of thumb, and the point I made in the original blog entry was that you should always choose the lowest level of granularity. That makes sense in that you can always aggregate to a higher level, but you can't go back down to detail if the data is not stored.

I can give an example of one data warehouse that I am familiar with that has won awards and accolades for best practices. However, a closer look reveals several trade-offs that were made during design that have had far and long lasting effects. During the design process, the decision was made to drop a lot of detail data in order to save space. This data, which many would consider essential to any data warehouse, contained transactional details and customer information. The data warehouse was built for financial use, and the finance organization was unconcerned about these details. They were looking at metrics by products and the chart of accounts.

Was this necessary? Arguably, it was. Since IT projects were funded by the business organization, the finance organization had no incentive to pay for storage or CPU required to process more data than was absolutely required. They simply made the decision to design for their own users' requirements.

However, this decision also had several long term effects that may have not been taken into consideration. Since the warehouse did not contain details that were important to other departments, such as marketing, sales, and operation, the warehouse was rendered useless for a large segment of the company. These organizations, in turn, were then forced to build their own data warehouses from the same source data in order to satisfy their users' needs. The end result being, as a company, they spent more building and maintaining multiple data warehousing systems than by investing more in the first one.

Another effect that had considerable cost was the other organizations' difficulty reconciling their numbers to the finance organization. Since the detail source data was not maintained by finance, and their business rules were buried in thousands of lines of code, users in the other departments spent countless hours trying to resolve differences between the systems.

So, was the right choice to build one true enterprise data warehouse to serve the entire organization?

In theory, the enterprise data warehouse is the best solution. However, we recognize that the larger the organization, the more difficult this is to accomplish. Enterprise level designs in 50 billion dollar + companies become almost a fantasy. There have been a few documented successes, but the kinds of investment involved in these projects take years and millions of dollars.

It is important that there be a chief architect overseeing the development of all information systems. That person must have a handle on the current environment and a vision for integration and reduced costs. I have met a few of these guys who had the title and responsibility but lacked the vision. I have also met a few who were convinced that silos were simply more cost effective and maintainable. Every organization is different and what works for one might not work for another.

So back to the original consideration for this post. Should a data warehouse reflect the lowest level of granularity? In general, yes, it should, but there are a number of other factors that must be considered when making the decision. Ultimately, the answer is that the granularity of the data should reflect the lowest level that has the greatest use to the greatest number of users.

February 19, 2009

In an earlier post, Data Warehouse Basics, I reprinted something that
I written a while back about what makes a data warehouse. In a series of entries, I am
exploring each point in further detail and see if the meaning still holds
up. In this post, we take on "Data is conformed".

Conformity in a data warehouse occurs means essentially two things, conformed dimensions and conformed facts. Conformity allows data warehouses to be distributed across the enterprise and unanticipated new sources to be added. Conformity assures that definitions and uses are consistent throughout the warehouse.

Conformed dimensions assure that users will be able to drill across data sources and arrive at conclusions that were not possible without a data warehouse. This is simply assuring that customer id 1234 for John L Smith in one fact table is the same customer id 1234/John L Smith in another fact table. This example certainly seems obvious. However, determining conformed dimensions is not always easy.

Two departments may view the information very differently. This is often the case between finance and sales or marketing. Finance typically views things as they are tracked in the accounting system. While marketing or sales may view them entirely different. For example, when tracking sales by location hierarchy, finance may have a hierarchy that is based on legal entities. However, marketing may not make any distinction between these legal entities and may be more concerned with brand or geography. These cases may present challenges to modelers and to arriving at conformed dimensions.

Conformed facts ensure that terminology for measures in one fact table mean the same in another. Revenue always means revenue, right? Well, again finance's definition of revenue may be different than marketing's. Once again, we have to ensure that our terminology is consistent. Doing so, will allow us to derive meaningful results across fact tables.

The concepts of conformed dimensions and facts are the cornerstone concepts to dimensional methodology. Without them, the data warehouse becomes a database of siloed data marts.

I extend the concept of conformity to include data naming standards. Consistent naming schemes make the database understandable and usable . So perhaps, we should call this one conformity and consiststency of a data warehouse.

February 11, 2009

In my previous post, Data Warehouse Basics, I reprinted something that I written a while back about what makes a data warehouse. I want to explore each point in further detail and see if the meaning still holds up. In this post, we take on "Data is comprehensive".

The description says that data must consolidated from multiple systems. True, but does that in itself make a data warehouse comprehensive?

Not all data warehouses should include all subjects. It is
only realistic that some data warehouses are directed towards finance,
while others are focused on marketing.

Perhaps, it would be more accurate to say that the data is comprehensive around a particular subject. Our data warehouse should reflect all known details about a particular entity so that analysis produced from it is accurate.

For example, we might be performing customer valuation. However, if we have consolidated data from our only 3 of our 4 customer facing systems, how can be sure that our results are correct?

This happens often in the hospitality industry where the parent company operates several brands and many offer reward programs, in which the customer earns points for stays at any of the chains' brands. In order to evaluate the true value of a customer, we must be able to consolidate information from each of the company's brands. Often, these brands operate on entirely different point-of-sale or property-management-systems. The data warehouse must capture customer transactions from every system in order to place a true valuation on customers.

So from that respect, data must be comprehensive. However, there is another point of comprehensiveness that we should consider. Yes, we should capture every transaction for a customer from every system, but should we store every attribute from the transaction if we are not required to do so?

Experienced data warehouse designers know this can be a fine line to walk. On the one hand, we want to maintain the usability of the warehouse by not over complicating it with clutter that has no business relevance or use. However, when we omit certain data elements, it usually isn't long before someone comes along and requests that the data be added to the warehouse.

The rule of thumb is to include all business information available during the initial design. The cost of additional storage usually pales in comparison to the cost to modify, load, and test additions at a later time.

So to wrap up, yes a data warehouse should be comprehensive around a subject. However, that does not mean that it must include every subject relevant to the company.

A year and a half ago, I posted ITtoolbox about the basics of data warehouse. That post got a few interesting comments even one as recent as last week. I thought I would re-post it here then further explore the topic in subsequent posts. ...

A friend of mine recently asked me to help her develop some ideas for
explaining data warehousing to a group of business users. It seems
there was some confusion in her company regarding why their so-called
data warehouse was really just a data dumping ground and not a true
best practices data warehouse. The list below details what I think
separates a data warehouse from an archive database, data mart, ODS,
etc.

The data warehouse data model should be designed for the following:

1) Data is comprehensive - Data is captured and consolidated from multiple systems.

2)
Data is conformed - This is the famous line "single version of the
truth". Data elements are conformed so that the definitions of
"customer" or "revenue" mean the same thing no matter which system it
originated. Tables are conformed when they can be queried across
dimensions and facts without changing the meaning of the results. This
is what is needed to truly integrate data in the warehouse.

3)
Data is granular - Ideally, we capture and store data at its lowest
level of granularity. You can always aggregate up, but you can't drill
down if the data isn't stored that way.

4) Data is historical -
The data warehouse is able to present a view of the business at a
particular point in time and track Key Performance Indicators (KPI's)
over time.

5) Data is shared - A data warehouse that cannot be queried or otherwise accessed has little value.

January 22, 2009

The latest Gartner Magic Quadrants for Data Warehouse Database Management Systems, Business Intelligence Platforms, CPM Suites have been released. You can find them respectively, here, here, and here. I don't have too much to say about these. Curt Monash has some interesting comments here and here. For the most part, I agree with Curt. These reports are interesting but not always reflective of what is happening in real world implementations.

For example, Sybase is ranked only slighly above HP in ability to execute, yet Sybase has about 1500 Sybase IQ customers and HP might have as many as 30 Neoview customers. And how many of those are in production? Maybe, none of them are.

Yes, Sybase IQ and HP Neoview are two different products with different strategies and target markets. That is exacltly the point. If you only looked at the MQ, then you would think these two products are close, but there would be nothing further from the truth.

Column orientation allows compression algorithms that are not possible or have limited value in a row store. For example, run-length encoding replaces a sequence of repeated values with a number and a count. The sequence, (10, 10, 12, 12, 12, 12) could be replaced with (10*2, 12*4). This not only saves space but allows operations directly against the compressed values, saving CPU time that would normally be spent on decompression. In addition, query operations are being performed on multiple values at once saving even more time. Other compression schemes, such as null suppression, bit-vector, and dictionary encoding may also be used to some extent, depending on the package.

Column store compression may vary between 3 times and 20 times the size of the raw data. It all depends on the profile of the data and the type of compression used. In one case study, Par Accel customer Merkle, claims 6 times compression from the raw data and 100X query performance over Oracle (12 minutes --> 7 seconds).

Query Performance

I touched on the query performance advantages in the last post. Because analytical queries tend to select a few columns over a number of records, a column store can retrieve records more quickly because it doesn't have to scan as much of the disk as a row store. Most row stores process a row at a time, meaning the query executor reads the entire row and then extracts the needed attributes. The column store can treat the entire column as an array, iterating through the column very quickly.

What's the Catch?

You might be thinking, "that sounds great. Why haven't column stores taken over data warehousing?" Well, the trade-off for column stores is that single row selects and updates are not very quick. For that matter, neither are trickle feeds found in near real-time data warehouses. Column stores operate best in batch loads using their own optimized load utilities. For many organizations that tend to standardize on one of the big three database vendors, the need for flexibility outweighs the need for performance.

Most data warehouses serve multiple purposes. Yes, they are used for analytic queries, but they are also used for research, deconstructing history, fraud detection, and many other applications that may not be optimal for a column store.

Recently, we have begun seeing a shift in the market toward creating separate analytic data marts for intense, ad-hoc queries. Data warehouse appliances have seemingly opened the door, as companies are looking for low cost alternatives to beefing up their existing data warehousing hardware. Several column store vendors have come to market in the past 2 or 3 years. It is still early in their evolution, but it will be quite interesting to see if these products begin to reach mainstream status.

January 20, 2009

Unless you are a DBA or System Administrator, you may not have given a lot of thought to how data is stored on disk. Data architects and modelers work in a logical world where data lives in tables and is stored by rows and columns. We are used to thinking of data as 2 dimensional.

The reality is that on disk data is 1 dimensional. It is the work of the DBMS to translate that 1 dimensional data into the 2 dimensional world that we see. Most of the common relational database management systems that we use today are row oriented. This means that on disk, data is stored by row. If you could see customer records on disk, you would see Ann's record, including all the associated data about her, followed by Betty, followed Cathy, etc. The DBMS keeps track of the order and fetches the right records whenever queries are submitted.

This row oriented approach works great for transactional systems. The call center agent who is entering a new customer record or updating an old one is processing one record at a time, and the application is usually filling most every column on the row. To find a particular record, the optimizer scans an index that locates its position on the disk.

However, for analytics, this isn't always as efficient. A typical analytical query would ask for total sales by state for US customers. Then the analyst might proceed to break that down by city, then zip code. If this were written as 1 query, then we would be accessing 3 columns and as much as millions of rows. The database optimizer can either grab every column and every record and then filter the unnecessary columns and records. Or it can scan each row picking only the columns it needs and ignoring records where the country is not the 'US'. Either approach can be a rather resource intensive process.

Column oriented databases, or column-stores, have been around awhile but have received renewed attention in the past couple of years. A number of new vendors have used the column store approach to bring specialized analytical databases to market. The sole function of these databases is to perform analytics.

At the logical level, column stores are no different than row stores. The same star schema model that works for data warehousing in a row store will work in a column store. The difference is that on disk the data is stored by column. This takes just a minute to process if you haven't given it thought before. All the values of the first column are stored together, then all the values of the second, etc. The DBMS keeps track of the row order in the column, just the same as it keeps track of the column order in a row. Does that make sense?

So why is this a good thing? Stay tuned, I will discuss the benefits a columnar approach provides in my next post.

January 16, 2009

Perhaps it is just me, but Sybase has to be one of the most underrated companies in the data management industry. I know Sybase has a fairly large install base, and there is a niche of Sybase fans. However, few companies that have such a large following seem so obscure.

Adaptive Server Enterprise

Sybase was founded in 1984 as Systemware. In 1988, Sybase partnered with Microsoft to port its SQL Server database to Windows and OS/2. Sybase sold its product as Sybase SQL Server, and Microsoft sold the OS/2 version under the name Microsoft SQL Server. In 1993, Sybase ended the agreement after a dispute with Microsoft over royalties. Microsoft struck a deal to keep the windows code base, and the two companies went their separate ways. Depending on who you talk to, Microsoft either struck a great deal, or Sybase got robbed. Sybase soon changed the name to Sybase Adaptive Server Enterprise to distinguish it from the Microsoft product. While the two products have evolved, much about the products still resemble each other today.

Over the years, Sybase has built a highly respectable suite of data
management tools, but it has never had the marketing prowess of the
other big vendors in the industry. Sybase ASE competed heavily with Oracle and Informix through the 90's to become the leading OLTP database vendor. The company gained a niche in certain industries, but it has never rose to the prominence that Oracle and DB2 share today. Still, Sybase ASE has an install base of more than 20,000 customers.

SQLAnywhere

SQLAnywhere began its life as Watcom SQL in 1992. Watcom was acquired by Powersoft in 1994, and Powersoft was acquired by Sybase a year later. Sybase renamed the product to SQL Anywhere. Today, SQLAnywhere is marketed by a Sybase subsidiary, iAnywhere. SQLAnywhere is embedded into many applications, such as Quickbooks. As well, it known for being a leading database solution for mobile technology. An ultralite version runs on mobile platforms, such as Palm, Windows Mobile, and Blackberry. SQL Anywhere, though not necessarily a large scale workhorse database is simple and intuitive and one of the easiests databases to use on the market.

Sybase IQ

Data warehouse professionals have probably heard of Sybase IQ at some point, even if they know little about it. Sybase IQ emerged as an optional engine for Sybase 11 in 1996. Sybase had acquired bit-wise indexing technology in 1994, and put this to work in IQ. However, there was something else that was very different about IQ. Rather than storing data to disk by rows, they used column orientation. This means that the data is literally stored on the disk by columns.

Column oriented storage provides significant performance improvements for queries that involve grouping by columns. It also allows additional compression algorithims that make significant reductions in storage space. (I will write more about column stores in a future blog post.)

For many years, Sybase IQ was the only significant column store DBMS on the market. More recently, vendors, such as ParAccel and Vertica, have used column orientation and are finding success with it. Sybase IQ has made a signifcant niche for itself in the data warehousing market with very little fanfare. For all its notoriety, Teradata has an install base of about 1000 customers. That's nothing to laugh at for the company that many consider to have created the data warehouse industry. Sybase IQ has about 1500 customers.

At various times, Sybase IQ has set benchmark records on the TPC-H. In 2007, Sybase announced that it had built the world's largest data warehouses in the lab. It loaded more than 1 Petabyte of raw transactional data. It took 3 weeks to load, but only filled 260 Terabytes of actual disk space due to compression.

Is Sybase IQ the best data warehouse platform for every situation? No, columnar databases don't work at all for transactional workloads. Often, there is a performance penalty on the loading end, though this has improved somewhat in recent years. In addition, columnar databases work well for certain types of queries, but not so well for others. The important thing is to know your user requirements and use cases.

Sybase software is well respected and has a large install base. Though they may not receive much press or marketing as other companies in the space, Sybase deserves consideration if you are in the market for a DBMS.

January 05, 2009

If there is any doubt how serious HP is about data warehousing, check out this article that appeared in the New York Times last week. Of course, HP has been crowing about business intelligence and data warehousing for several years. After all, the company's CEO Mark Hurd came from Teradata. Since Neoview was introduced in 2006, it has still had little impact in the overall data warehousing market. According to the article, they have sold only about 30 or so units.

Some point to HP's collaboration with Oracle on Exadata as a sign that it is less than confident in its ability to execute with Neoview. Again, HP counters this in the article stating that they are really two products marketed for two different uses. HP sees Neoview as an enterprise class data warehousing machine, designed to store 100's of Teradabytes of data for thousands of users. The Oracle DB machine is aimed at the appliance market, which tends to be smaller data warehouses or specialty data marts.

It still seems to be a mountain for HP to climb and begin competing in this space. Perhaps, the EDS acquisition really will provide some momentum and alignment that propels Neoview forward. In the meantime, I wonder whether 2009 will be another year of consolidation or innovation.