JUMP TO

BROUGHT TO YOU BY

Great question. While there are many kinds of databases, the structure of your Magento BI Data Warehouse is called a referential database.

In the simplest terms, a database is a collection of data. A really well organized collection of data. Databases are designed to support a business or other functional need by capturing information that:

describes a complex array of entities, concepts and actions

is large in volume; a database could easily contain anywhere from a few hundred to a few billion (or more) records

Databases are meant for storing large sets of data for easy access - for example, looking up a record, printing out an invoice or uncovering fascinating insights about your business! To accomplish this within a large, complex data set, the really (really) well organized part is incredibly important. In this article, we'll cover:

Why are we telling you all this? To fully leverage your Data Warehouse, it’s vital that you understand how databases work. It looks like a lot (we know), but we promise: it’ll make you into a data guru.

How is a referential database organized?

The first step to organizing a data set is to sort the data into unique entities, such as a user, order or product. These entities are then each described in a single table.

The rows in a table are the instances of that entity occurring. A row may also be referred to as a record. The columns in a table are describing attributes about the entity.

For example, in an orders table, the rows are describing individual orders and the columns are describing attributes such as when the order was placed and the total amount paid.

When a database is initially designed, choices are made about what the tables look like. When making these choices, it is considered best practice to follow a process called normalization.

What is normalization?

(It's not as scary as it sounds!)

Normalization dictates the process by which data is organized. Normalization allows for a flexible database design that optimizes storage, facilitates everyday operational processes, reduces redundancy, and minimizes the risk of errors when data is modified. In fact, it's very likely that your database is organized following normalization principles.

Magento BI will denormalize the tables replicated in your Data Warehouse because we want to optimize for analysis, not daily operations. For this reason, understanding the basics of normalization is a fundamental concept to understanding how to best leverage your data warehouse.

To understand normalization, let's look at a data set that isn’t normalized. In theory, you can design database tables anyway you like and force all the data into one table:

Why isn’t this table normalized? There are a few reasons:

there is not a unique identifier called a Primary Key

the entities users, orders and items are all described in a single table

there is redundant information (name and user_joined)

there are multi-value columns (user_address_state)

While the data in this table is correct, it’s not organized in a flexible manner to accommodate changes or updates to the data. To allow this, the data set must be normalized.

How does normalization work?

Step 1: Create separate tables with primary keys

To normalize this data set, the table must be broken out so that unique entities - in this case, users, orders, and items - are described within their own table.

In addition, a Primary Key (PK) must be assigned to each table. A Primary Key is an unchanging column (or set of columns) that produces unique values within a table.

For example, in the users table the Primary Key could be called id. The values in this id column must appear uniquely; the value "123" appears once and only once. The user assigned the value "123" is always user "123" and this value does not change.

Once the entities are broken out into their own tables and given Primary Keys, the data set will look something like this:

Step 2: Remove multi-value columns and additional redundancies

While this is leaps and bounds better than before, there’s still more we can do to improve these tables. Lets consider the following:

In the users table, the multi-value column user_address_state is prone to errors if there's a need to add, edit or delete a state value for a given user. If a user can have multiple addresses, it's better to treat them as a separate entity and create a user_address table.

In the items table, the product information sku and name is redundant. These are not attributes about the sales item but about the product. These attributes should be captured in a table describing products.

The next step in normalizing this data set is to further break out the tables into user_address and product tables, and then assigning them a Primary Key:

Step 3: Create references between tables

At this point, these tables are organized in a normalized manner. But how do they link together? For a given order, how do you know when the user who created the order joined the site? For a given sales item, how do you know the product name?

This is accomplished through references. To establish a reference, a Reference Key column is created inside a table that links to the Primary Key column of another table.

For example, inside of the orders table, the user_id column is a Reference Key created to identify who placed the order and it references the id column of the users table.

The final step to normalize this data set is creating references between tables by pointing Reference Keys to Primary Keys, allowing data look-ups between tables:

The initial data set user_orders is now normalized. This structure is better because it can easily accommodate changes. Imagine that you need to create a new address record for a user. Now you can simply add a new row to the addresses table which follows the user throughout the database by using the user_id.