A Definition of Data Warehousing

Written By: M. Reed

Published: August 18 2002

Biographical
Information

Bill
InmonBill
Inmon is universally recognized as the "father of the data warehouse."
He has over 26 years of database technology management experience and
data warehouse design expertise, and has published 36 books and more than
350 articles in major computer journals. His books have been translated
into nine languages. He is known globally for his seminars on developing
data warehouses and has been a keynote speaker for every major computing
association. Before founding Pine Cone Systems, Bill was a co-founder
of Prism Solutions, Inc.

Ralph
KimballRalph
Kimball was co-inventor of the Xerox Star workstation, the first commercial
product to use mice, icons, and windows. He was vice president of applications
at Metaphor Computer Systems, and founder and CEO of Red Brick Systems.
He has a Ph.D. from Stanford in electrical engineering, specializing in
man-machine systems. Ralph is a leading proponent of the dimensional approach
to designing large data warehouses. He currently teaches data warehousing
design skills to IT groups, and helps selected clients with specific data
warehouse designs. Ralph is a columnist for Intelligent Enterprise magazine
and has a relationship with Sagent Technology, Inc., a data warehouse
tool vendor. His book "The Data Warehouse Toolkit" is widely recognized
as the seminal work on the subject.

In
order to clear up some of the confusion that is rampant in the market, here
are some definitions:

Data
Warehouse:

The
term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the
following way: "A warehouse is a subject-oriented, integrated, time-variant
and non-volatile collection of data in support of management's decision making
process".

He
defined the terms in the sentence as follows:

Subject Oriented: Data that gives information about
a particular subject instead of about a company's ongoing operations.

Integrated: Data that is gathered into the data
warehouse from a variety of sources and merged into a coherent whole.

Time-variant: All data in the data warehouse is
identified with a particular time period.

Non-volatile: Data is stable in a data warehouse.
More data is added but data is never removed. This enables management to
gain a consistent picture of the business.

(Source:
"What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995). This
definition remains reasonably accurate almost ten years later. However, a single-subject
data warehouse is typically referred to as a data mart, while data warehouses
are generally enterprise in scope. Also, data warehouses can be volatile. Due
to the large amount of storage required for a data warehouse, (multi-terabyte
data warehouses are not uncommon), only a certain number of periods of history
are kept in the warehouse. For instance, if three years of data are decided
on and loaded into the warehouse, every month the oldest month will be "rolled
off" the database, and the newest month added.

Ralph
Kimball provided a much simpler definition of a data warehouse. As stated in
his book, "The Data Warehouse Toolkit", on page 310, a data warehouse is "a
copy of transaction data specifically structured for query and analysis". This
definition provides less insight and depth than Mr. Inmon's, but is no less
accurate.

Data
Warehousing:

Data
warehousing is essentially what you need to do in order to create a data warehouse,
and what you do with it. It is the process of creating, populating, and then
querying a data warehouse and can involve a number of discrete technologies
such as:

Source System Identification: In order to build
the data warehouse, the appropriate data must be located. Typically, this
will involve both the current OLTP (On-Line Transaction Processing) system
where the "day-to-day" information about the business resides, and historical
data for prior periods, which may be contained in some form of "legacy"
system. Often these legacy systems are not relational databases, so much
effort is required to extract the appropriate data.

Data Warehouse Design and Creation: This describes
the process of designing the warehouse, with care taken to ensure that the
design supports the types of queries the warehouse will be used for. This
is an involved effort that requires both an understanding of the database
schema to be created, and a great deal of interaction with the user community.
The design is often an iterative process and it must be modified a number
of times before the model can be stabilized. Great care must be taken at
this stage, because once the model is populated with large amounts of data,
some of which may be very difficult to recreate, the model can not easily
be changed.

Data Acquisition: This is the process of moving
company data from the source systems into the warehouse. It is often the
most time-consuming and costly effort in the data warehousing project, and
is performed with software products known as ETL (Extract/Transform/Load)
tools. There are currently over 50 ETL tools on the market. The data acquisition
phase can cost millions of dollars and take months or even years to complete.
Data acquisition is then an ongoing, scheduled process, which is executed
to keep the warehouse current to a pre-determined period in time, (i.e.
the warehouse is refreshed monthly).

Changed Data Capture: The periodic update of the
warehouse from the transactional system(s) is complicated by the difficulty
of identifying which records in the source have changed since the last update.
This effort is referred to as "changed data capture". Changed data capture
is a field of endeavor in itself, and many products are on the market to
address it. Some of the technologies that are used in this area are Replication
servers, Publish/Subscribe, Triggers and Stored Procedures, and Database
Log Analysis.

Data Cleansing: This is typically performed in conjunction
with data acquisition (it can be part of the "T" in "ETL"). A data warehouse
that contains incorrect data is not only useless, but also very dangerous.
The whole idea behind a data warehouse is to enable decision-making. If
a high level decision is made based on incorrect data in the warehouse,
the company could suffer severe consequences, or even complete failure.
Data cleansing is a complicated process that validates and, if necessary,
corrects the data before it is inserted into the warehouse. For example,
the company could have three "Customer Name" entries in its various source
systems, one entered as "IBM", one as "I.B.M.", and one as "International
Business Machines". Obviously, these are all the same customer. Someone
in the organization must make a decision as to which is correct, and then
the data cleansing tool will change the others to match the rule. This process
is also referred to as "data scrubbing" or "data quality assurance". It
can be an extremely complex process, especially if some of the warehouse
inputs are from older mainframe file systems (commonly referred to as "flat
files" or "sequential files").

Data Aggregation: This process is often performed
during the "T" phase of ETL, if it is performed at all. Data warehouses
can be designed to store data at the detail level (each individual transaction),
at some aggregate level (summary data), or a combination of both. The advantage
of summarized data is that typical queries against the warehouse run faster.
The disadvantage is that information, which may be needed to answer a query,
is lost during aggregation. The tradeoff must be carefully weighed, because
the decision can not be undone without rebuilding and repopulating the warehouse.
The safest decision is to build the warehouse with a high level of detail,
but the cost in storage can be extreme.

Now
that the warehouse has been built and populated, it becomes possible to extract
meaningful information from it that will provide a competitive advantage and
a return on investment. This is done with tools that fall within the general
rubric of "Business Intelligence".

Business
Intelligence (BI):

A
very broad field indeed, it contains technologies such as Decision Support Systems
(DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP),
Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP,
a combination of MOLAP and ROLAP), and more. BI can be broken down into four
broad fields:

Multi-dimensional Analysis Tools: Tools that allow
the user to look at the data from a number of different "angles". These
tools often use a multi-dimensional database referred to as a "cube".

Query tools: Tools that allow the user to issue
SQL (Structured Query Language) queries against the warehouse and get a
result set back.

Data Mining Tools: Tools that automatically search
for patterns in data. These tools are usually driven by complex statistical
formulas. The easiest way to distinguish data mining from the various forms
of OLAP is that OLAP can only answer questions you know to ask, data mining
answers questions you didn't necessarily know to ask.

Data Visualization Tools: Tools that show graphical
representations of data, including complex three-dimensional data pictures.
The theory is that the user can "see" trends more effectively in this manner
than when looking at complex statistical graphs. Some vendors are making
progress in this area using the Virtual Reality Modeling Language (VRML).

Metadata
Management:

Throughout
the entire process of identifying, acquiring, and querying the data, metadata
management takes place. Metadata is defined as "data about data". An example
is a column in a table. The datatype (for instance a string or integer) of the
column is one piece of metadata. The name of the column is another. The actual
value in the column for a particular row is not metadata - it is data. Metadata
is stored in a Metadata Repository and provides extremely useful information
to all of the tools mentioned previously. Metadata management has developed
into an exacting science that can provide huge returns to an organization. It
can assist companies in analyzing the impact of changes to database tables,
tracking owners of individual data elements ("data stewards"), and much more.
It is also required to build the warehouse, since the ETL tool needs to know
the metadata attributes of the sources and targets in order to "map" the data
properly. The BI tools need the metadata for similar reasons.

Summary:

Data
Warehousing is a complex field, with many vendors vying for market awareness.
The complexity of the technology and the interactions between the various tools,
and the high price points for the products require companies to perform careful
technology evaluation before embarking on a warehousing project. However, the
potential for enormous returns on investment and competitive advantage make
data warehousing difficult to ignore.