Data Warehousing: Back to Basics

So, you're about to undertake your first data-warehousing project. Where will you begin? Or maybe you're already implementing a warehouse, but the project is going awry and you're trying to get it back on track. What do you need to know to make it successful? Let's step back from the implementation details and examine some analysis and design roadblocks you need to overcome on your road to a successful data warehouse deployment. Along the way, we'll review the common terminology you need to understand and discuss some challenges you'll face on your quest. Following these guidelines can boost your chances for a successful data warehouse deployment.

Common Terms

First, let's define the crucial pieces of the project: a data warehouse, a data mart, and data warehousing. Although they're often used interchangeably, each has a distinct meaning and impact on the project. A data warehouse is the cohesive data model that defines the central data repository for an organization. An important point is that we don't define a warehouse in terms of the number of databases. Instead, we consider it a complete, integrated data model of the enterprise, regardless of how or where the data is stored.

A data mart is a repository containing data specific to a particular business group in an enterprise. All data in a data mart derives from the data warehouse, and all data relates directly to the enterprisewide data model. Often, data marts contain summarized or aggregated data that the user community can easily consume. Another way to differentiate a data warehouse from a data mart is to look at the data's consumers and format. IT analysts and canned reporting utilities consume warehouse data, whose storage is usually coded and cryptic. The user community consumes data mart data, whose storage is usually in a more readable format. For example, to reduce the need for complex queries and assist business users who might be uncomfortable with the SQL language, data tables could contain the denormalized code table values.

Finally, data warehousing is the process of managing the data warehouse and data marts. This process includes all the ongoing support needs of the refresh cycle, database maintenance, and continual refinements to the underlying data model.

One important aspect of developing a warehouse is having a data dictionary that both the project team and the user community use to derive definitions and understand data elements within the warehouse. This statement seems simple, but when you're pulling data together from many source systems, a common implementation problem (which people usually don't identify until after deployment) involves reconciling similarly named data elements that come from different systems and have subtle differences in meaning. An example of this problem in the health care community is the attribute attending physician. One system, which tracks daily patient activity, defines this term as the physician currently responsible for the patient's care. At the same facility, a system that focuses on patient billing defines it as the physician most affiliated with the visit. Both definitions are correct in their contexts, but the difference illustrates a challenge in trying to combine the two systems' data.

The health care example illustrates a symptom of what we consider the biggest challenge in a data-warehousing project: bringing together the right people from the user and development communities to create a project team. The right people have the business knowledge to explain how the data is used, the systems knowledge to extract it, and the analytical and design skills to bring it together into a warehouse. The difference between other projects and building a warehouse is that individual projects usually focus on one business area, whereas building a data warehouse focuses on combining the data and subsequent knowledge from many projects. The team must have the depth and breadth to cover all the systems involved.

Establishing a Vision

Now that we've identified the largest risk area, let's look at some steps you can take to minimize the risk. To put together the right project team, first define the project's vision and begin to establish its scope. After you do so, you'll see more clearly which users and IT staff members you need to involve.

The vision's purpose is to define the project's ultimate mission from a business perspective. In theory, all work on the project directly or indirectly supports the objectives outlined in the vision. Defining a clear, tangible mission for the project is crucial. When articulated properly, the vision defines relative priorities for the team—schedule, features, and cost. You use it to resolve requirement and implementation decisions throughout the development lifecycle: Tailor your decisions to support the mission and priorities of the vision; omit or defer others to later iterations. The vision creates a theme for the project that serves the entire project development cycle. At the highest level, you require all project activities to achieve the vision's objectives.

For example, let's look at a growing health care organization in which each facility maintains a separate information system. The vision for its warehousing project might be to provide the ability to review, analyze, and track historical data across all facilities in an appropriate and meaningful context. This vision describes an objective that implementing a data warehouse can accomplish.

Defining Scope

After you've established the project's vision, you can set its scope. Next to fielding the wrong team, the inability to define the right scope puts a project at most risk for failure. Scope refers to the potential size of the undertaking—what will be delivered successfully in a meaningful time frame. Often a warehousing project tries to deliver too much, which can result in the project falling dramatically behind schedule or even being canceled. The other extreme, building stovepipes, happens when an organization decides to use many small databases to focus on discrete business areas. Although these combined databases might look like a data warehouse, they're really data-access enhancements (or reporting enhancements) to the operational systems. This implementation isn't a true data warehouse because stovepipes are independent units with no cohesive data model tying them together. In the context of data warehousing, stovepipes achieve no enterprise-level business objectives.

Understanding the definitions we gave earlier is important for arriving at the right scope for the project. Although, by definition, the data warehouse takes into consideration the entire business, you don't need to implement it all at once. When you focus on individual business units within the overall model, design and development proceed iteratively, and you implement one or two areas at a time. Iterative development results in a faster return on investment when you prioritize the business area development, rather than waiting to roll out one massive warehouse at the end. From a scope perspective, you control the size, timing, and cost of each iteration without compromising the integrity of the overall data warehouse.

An often-overlooked aspect of the project is building the infrastructure to sustain the process of data warehousing. All too many warehousing projects break down after deployment because people fail to recognize the ongoing support costs (resources, time, and coordination) of refreshing the data. You might have designed the world's best data model and implemented a great database, but if users don't receive data in a reliable and timely manner, they'll consider the project a failure. Depending on the volatility of the source system data, warehouse data can quickly go stale. To determine the warehouse's refresh intervals, you must have project requirements that identify the rate of change in the source system data and how often the user community needs to see those changes reflected.

Our experience shows that building the appropriate infrastructure to support the data warehousing aspect of the project is as important as designing the data model. So factor the ongoing support needs and the corresponding infrastructure development costs (e.g., to sustain the timely refresh of the data) into the project's scope.

The Essence of Warehousing

So far, we've focused on some of the project-planning issues and high-level design considerations involved in building a warehouse. Now it's time to examine the essence of data warehousing: data acquisition, data transformation, and data presentation. These areas constitute the ongoing process of data warehousing and require a full understanding to avoid data refresh problems.

Data acquisition is the task of bringing data from everywhere to the data warehouse. Most businesses have several operational systems that handle the organization's day-to-day processing. These systems serve as the data source for the warehouse. The systems might reside on a mainframe, in a client/server database application, in a third-party application with a proprietary data store, within desktop applications such as spreadsheets and database applications, or any combination of these. The challenge is to identify the data sources and develop a solution for extracting and delivering the data to the warehouse in a timely, scheduled manner.

After collecting the data, you need to transform it. In an ideal organization, all systems would use the same set of codes and definitions for all data elements. In the real world, as we showed earlier, different codes and definitions exist for what appear to be the same data element. Data transformation is the cleansing and validation of data for accuracy, and ensuring that all values conform to a standard definition. After these data transformation tasks are complete, you can add the data to the warehouse.

Finally, you're ready for data presentation. At this point, the warehouse contains a large, normalized data store containing all (or part) of the organization's data. Great! Unfortunately, the users who need this data can't make sense of it because of its cryptic coding schemes and normalized storage. Data presentation involves taking the data from the data warehouse and getting it into the hands of users in a usable, easy-to-understand format. One way to present the data is to deploy a data mart containing summarized, aggregated data. Or you can put an OLAP engine between the warehouse and the user. Another option is to custom-build a reporting tool or deploy third-party solutions. Identify the most effective way to present the data, and implement it.

In completing these tasks, keep in mind that the data that users receive needs to be consistent, accurate, and timely. Failure to ensure quality data delivery could jeopardize the project's success because users won't work with inaccurate or old data. One way to minimize the risks of bad data is to involve users in the cleansing, validation, and transformation steps of the data transformation task. The more input and familiarity users have with the data validations and transformations, the more confident they'll be about the accuracy of the resulting warehouse data. Also, emphasize to the users the importance of their input into the data validation process. Explain to them that their experience and knowledge make them a necessary part of the project team and ensure the data's validity and integrity.

The Rest Is Up to You

So as you embark on your data warehousing adventure, remember these basic ideas. Carefully define the project's vision and the scope of the first iteration. Inform and involve your users. Know and understand the three major tasks of implementation—data acquisition, data transformation, and data presentation. Finally, during design always keep in mind the consistency, accuracy, and timeliness of the ongoing data delivery. Although we can't guarantee that your warehousing project won't fail, following the basics discussed here will give you a much better chance of success.

From the Blogs

Don’t let bad data sneak up on you when and where you least expect it. Ferret out bad data with Melissa Data’s newest Profiling Component for SSIS. Learn how to take control of your data using knowledge-base-driven metadata. The truth shall set you free!...More

Now that we’ve outlined the process to let servers in a SQL Server AlwaysOn Availability Group "talk to each other" by means of setting up linked servers, it’s possible to set up some additional or improved checks on Availability Group Health....More

In my previous post, I provided a high-level outline of the core logic (and rationale behind that logic) that would be needed to set up regular synchronization checks on SQL Server Agent Jobs for servers where AlwaysOn Availability Groups have been deployed. In this post, I’ll walk through the steps--and the code--needed to setup those checks....More