Building a Data Warehouse with MySQL and Perl

Introduction

Data warehousing is the practice of building databases specifically to support complex ad-hoc querying by end-users. It isn't a new concept, but it doesn't seem to be a common technique in the Perl database community. In this article, I'll explain the basic parts of a data warehouse and describe how we created one with Perl and MySQL.

At my place of employment, Plus Three LP, we designed a data warehouse as part of our Arcos product to provide efficient access to large databases of supporters for our clients, who are mainly political campaigns and nonprofit groups engaged in fundraising. The warehouse is used for reporting and for list segmentation, which is the process of finding a subset of people to which to send a particular email. The chief architect of our data warehouse at Plus Three is Perrin Harkins, who also helped me with this article. Thanks, Perrin!

What Is a Data Warehouse?

If you have experience with database programming, then most likely you've spent your time working on operational databases. Operational databases serve as order entry systems for online stores, content storage for content management systems, post storage for blogs, and much more. The key attribute of an operational database is that it is designed to be modified frequently and with maximum ease. A business will often have more than one operational database; Arcos has two: Krang's content-management database and a database used by our front-end applications.

Operational databases are most commonly designed using normalized modeling, often using third-normal form or entity-relationship modeling. Normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.

Consider, for example, Figure 1--a simple normalized representation for an order-management database. The normalized layout means changing a category name, or a SKU type means altering a single row.

Figure 1. A simple normalized schema

Data warehouses differ from operational databases in the way they are designed; they are optimized for efficient querying and not for updating. Data warehouses provide a read-only version of the data in the operational databases, which is optimized for querying. The kind of modeling most commonly used in warehouse design is called dimensional modeling, and the schemas produced are known as star schemas. In dimensional modeling, a database is organized around a small number of fact tables. Each row in a fact table is a single measurable event: a single sale, a single hit to a web page, etc.

Surrounding each fact table is a set of dimension tables. Dimension tables add descriptive detail to the facts in the fact table. An important aspect of dimensional modeling is that dimension tables are expected to be both verbose and simple in their construction. Denormalized data is the rule in dimension table design.

Figure 2 shows the previous example in a possible dimensional layout. The normalized data from Figure 1 has been flattened into a single fact table with three dimension tables. If this is your first exposure to dimensional modeling, you're probably recoiling in horror, particularly at the sight of the "Date_Dimension." However, consider how much easier it would be to write reporting SQL for this schema--finding all the orders for a particular category or the total of all orders made on holidays. Consider how much improvement is possible for the kind of complexity regularly found in real-world normalized schemas.

Figure 2. A simple dimensional schema

Dimensional modeling is a complex topic, and not one that I'm even qualified to teach. If you decide to build a data warehouse yourself, you'll need to learn more about it.

What Are Data Warehouses Good For?

Data warehouses provide for fast, simple access to the fundamental data of an organization. There are many valid uses for data warehousing, but the most common uses focus on consolidation and simplification of data.

Consolidation of data is necessary when you have multiple different database systems, and you need the ability to write queries that examine data from all of them. These operational databases may be physically separate and may run on different database software. Providing a central place where all the data for an organization is collected and made accessible can be of very high value.

Simplification of data is a more subtle benefit. Data in normalized form is often highly complex. Normalization tends to result in a "spider web" pattern, where many tables link to many other tables, creating a dense "web" of links. The central tenant of normalized data design, that each piece of information should be present in exactly one row, necessitates the creation of many tables to store each unique type of data in an application.

Dimensional modeling offers a chance to reduce the level of complexity in your database. By reducing complex chains of tables into dimension tables, the schema becomes smaller and performance tends to improve. For example, our main operational database has 79 tables. Our warehouse has just 18 tables! The two databases contain almost all the same information, but the warehouse schema stores the data in denormalized dimension and fact tables.