Introduction

Materialized views are certainly possible in PostgreSQL. Because of PostgreSQL's powerful PL/pgSQL language, and the functional trigger system, materialized views are somewhat easy to implement. I will examine several methods of implementing materialized views in PostgreSQL.

Definitions

A materialized view is a table that actually contains rows, but behaves like a view. That is, the data in the table changes when the data in the underlying tables changes.

There are several different types of materialized views:

Snapshot materialized views are the simplest to implement. They are only updated when manually refreshed.

Eager materialized views are updated as soon as any change is made to the database that would affect it. Eagerly updated materialized views may have incorrect data if the view it is based on has dependencies on mutable functions like now().

Lazy materialized views are updated when the transaction commits. They too may fall out of sync with the base view if the view depends on mutable functions like now().

Very Lazy materialized views are functionally equivalent to Snapshot materialized views. The only difference is that changes are recorded incrementally and applied when the table is manually refreshed. (This may be faster than a full snapshot upon refresh.)

Why Materialized Views?

Before we get too deep into how to implement materialized views, let's first examine why we may want to use materialized views.

You may notice that certain queries are very slow. You may have exhausted all the techniques in the standard bag of techniques to speed up those queries. In the end, you realize that getting the queries to run as fast as you like simply isn't possible without completely restructuring the data.

What you end up doing is storing pre-queried bits of information so that you don't have to run the real query when you need the data. This is typically called "caching" outside of the database world.

What you are really doing is creating a materialized view. You are taking a view and turning it into a real table that holds real data, rather than a gateway to a SELECT query.

Caveats and Assumptions

I assume you are not new to PostgreSQL. I assume that you have a fairly solid understanding of how a database works, and in particular, the PostgreSQL database. I also assume you are comfortable with PL/pgSQL, PostgreSQL's SQL
syntax, and tools to access and modify the database. All the information you need is in the PostgreSQL documentation.

Note that unless otherwise noted, all of the commands are executed in the database as a root user. If you know what you are doing, you don't have to do this as a root user. If you don't know what a root user is, or how to create one, then read the previous paragraph again.

Snapshot Materialized Views

Snapshot materialized views are very easy to implement. They will serve as a foundation for all other materialized views I discuss in this article.

I use a system where the materialized view is based off of a view. I assume that the view definition will never change. All bets are off if it does. Of course, I do expect that the data in the view will change as data in the database is modified.

matviews Table

I create a table called matviews to store the information about a materialized view.

This table will allow us to keep track of our materialized views, how they should behave, and their current status.

create_matview Function

Here is a function written in PL/pgSQL to insert a row into the matviews table and to create the materialized view. Pass in the name of the materialized view, and the name of the view that it is based on. Note that you have to create the view first, of course.

This function will see if a materialized view with that name is already created. If so, it raises an exception. Otherwise, it creates a new table from the view, and inserts a row into the matviews table.

refresh_matview Function

Finally, you need a way to refresh the materialized views so that the data does not become completely stale. This function only needs the name of the matview. It uses a brute-force algorithm that will delete all the rows and reinsert them from the view.

Note that you may want to drop the indexes on your materialized view before executing this, and recreate them after it finishes. This step could be automated, perhaps.

Since a lot of players play games each day, and running the view is kind of expensive, you decide you want to implement a materialized view on player_total_score_v. This is done with the following command.

Even though the scores in player_total_score_mv isn't going to change to reflect the most current scores until the refresh is run, the players will learn to accept that.

Eager Materialized View

An eager materialized view will be updated whenever the view changes. This is done with a system of triggers on all of the underlying tables. Dependencies on mutable functions (like now()) will cause the materialized view to become corrupt, but that can be corrected with minor refreshes, that only refresh affected rows.

Underlying Tables

First, let's step back and consider what actually makes up the data in a materialized view, or where the data in the view is coming from. Obviously, information in the materialized view will come from any tables mentioned in the view definition. If the view definition relies on other views, then we'll have to consider all the tables that compose that view as well.

Relation Between Materialized Views and Underlying Tables

Now we need to consider how the data in the underlying table relates to or affects the data in the materialized view.

One-to-one. The simplest case is a one-to-one relation. The view is merely selecting all or some of the columns from a table. Multiple rows in the view do not depend on the same row in the underlying table. In the example below, if the username of a single user changes, then only one row in the user_v will change.

Many-to-one. A more complicated case is many-to-one. Many rows in the view depend on a single row in an underlying relation. This is most common when you are joining two tables. In the example below, if the groupname changes, many rows in the user_group_v may change.

One-to-many. Another complicated case is one-to-many. One row in the view is derived from multiple rows in the underlying table. This is most common with aggregates. In this example, the total_score is derived from many rows in game_score.

Many-to-many. There is also some rare cases where many rows in the view will be affected by many rows in the underlying tables. These are usually a combination of the relations above.

Other. There are also cases when the data in the view isn't related to the data in the underlying tables at all. For instance, this would occur if a column depended on the value of now(), or some other configuration. Although this cannot be handled perfectly, it can be satisfactorily approached. For instance, you may only summarize data that is old, or you may decide to update the data in distinct intervals, as for our snapshot solution above.

We need to keep in mind all the cases as we design the functions and triggers below.

mv_refresh_row Function

We first need to design an mv_refresh_row function. I don't know how to make a generic function that will work for all materialized views, so we have to hand-craft one for each materialized view we create. It isn't hard to do. This simple algorithm will get you through this process.

Identify the primary key of the materialized view. If there isn't one, you must redefine the view so as to create one. The function will accept the primary key as an argument.

In the function, first delete the row with that primary key from the materialized view.

Next, select the row with that primary key from the view and insert it into the materialized view.

We'll see an example below.

mv_refresh Function

If the view relies on some mutable functions, then you will have to run a refresh function that will only refresh those rows that are affected. Most commonly, this occurs when there is some sort of time-dependence.

A lot of thought needs to go into the mv_refresh function. There may be a way to write a generic one for all views, but for now, you'll have to hand-craft your own. I don't even have a generic algorithm to pass along. I think this is largely dependent on the mutable and how the mutable behaves.

Table Triggers

You will have to create triggers for every action on every underlying table. I write three triggers for each table, one each for INSERT, UPDATE, and DELETE. I feel it is more efficient than writing one function that handles all three cases.

The basic algorithm for the trigger functions follows. I'll show you a concrete example below of these functions.

Identify the primary key value(s) for the materialized view of the affected rows. If there is a many-to-one or many-to-many, many rows in the materialized view will be affected.

For DELETE and INSERT, merely call the mv_refresh_row function for each primary key value.

For UPDATE, identify whether the update is going to change which row(s) in the materialized view this row will affect. If so, then you'll have to refresh rows for both the old and new values. Otherwise, only the old or new values will do.

Apply the triggers so that they are called after the operation is performed.

Examples

These examples are long and contrived; I am using them merely to demonstrate how it all works. Don't bother trying to make sense of the tables. I couldn't come up with any real world examples that would show all three instances.

Notice that one row in a may contribute to multiple rows in b_v. Only one row in b contribute to one row in b_v Also, many rows in c contribute to a single row in b_v. The primary key of b_v is b_id. Also, the view has a dependence on the mutable function now().

Lazy Materialized Views

The lazy materialized view would record which rows in the materialized views need to be updated, and update them when the transaction is committed. This useful if many change will affect the same rows, and will also allow those changes to be made much more quickly.

Currently, I know of no way to put a hook in when the transaction is committed. Pending that development, this materialized view scheme cannot be implemented.

If there was such a hook, it would be implemented like the Very Lazy method below, but the refresh would be called before the end of the transaction.

Very Lazy Materialized Views

Very lazy materialized views would record which rows in the materialized view need to be updated, but won't update until directed to. This would be useful if you are committing multiple transactions that affect the materialized views, but don't want to actually update the materialized view until later. Does this sound familiar? It should, because it is functionally equivalent to the snapshot materialized view I described earlier.

Recording the Changes in matview_changes

Replacing the mv_refresh_row function with one that merely records the change to be made at a later time is pretty easy. First, we need a table to store our changes -- one that could store any change to any materialized view (assuming the primary key is a single column of integers).

matview_queue_refresh_row() Function

Next, we need to create a generic matview_queue_refresh_row() function that inserts the primary key into matview_changes if it is not already present.

Modify the Eager Materialized View System

The very lazy materialized view works mostly like the eager materialized view system, with a few modification. Apply the following changes on top of the eager materialized view setup above.

Modify the triggers you defined for eager materialized views above so that it calls matview_queue_refresh_row() rather than mv_refresh_row().

Modify (or create, if it doesn't exist yet) mv_refresh() so that it first applies the mutable-function dependencies first, and then performs the actual changes in bulk with mv_refresh_row(), deleting all the changes from the matview_changes table.

Refresh Strategies

Now that refreshing the materialized view won't be so expensive, we can look at some alternative refresh strategies that are out of the question for snapshot materialized views.

We can implement a rule on select such that the materialized view is refreshed whenever data is selected from it.

We can improve performance by implementing a rule that will only refresh the materialized view if 1 second, 10 seconds, 1 minute, or 1 hour has passed since the last refresh and the data is being selected.

We can forget about the rules altogether and have the materialized view get refreshed at a preset time interval - 10 second, 1 minute, 10 minutes, etc.

Example

Borrowing from our eager example, we are now going to implement the triggers and mv_refresh() function so that it uses the very lazy updating technique. It also implements a rule that will refresh the materialized view if 10 seconds have passed since the last refresh.

Switching Between Techniques

With the right support functions, and a column to track the type of materialized view, it should be possible to switch between the various techniques. Right now, I don't have a way to do this because generating the triggers, the mv_refresh_row function, and the mv_refresh function is not generic. However, should I discover a way to make them generic, then it should be possible to track what type of technique the materialized view is using in the matviews table. Then, I would write a generic function to apply the necessary changes to switch between techniques.

Comparison of Techniques

Let's review the various techniques and their benefits and disadvantages.

Snapshot Materialized Views

The snapshots are really easy to implement. However, they take a lot of work to regenerate the data. This is good if it is going to take a lot of work anyway, but bad if there are only a few small changes to make at each update. The refresh function must be called regularly.

The data will be out-of-sync with the view as soon as the data starts to change. This may be good or bad, depending on what you want.

Eager Materialized Views

The eager updated materialized views are always updated - even in a transaction. This comes at a cost - changes that affect the materialized view are going to be more expensive. This is fine if you don't change the data much, but can lead to problems when you do bulk imports or modifications.

The data may fall out of sync if there is a dependence on mutable functions. A regular specialized refresh function can be called to remedy the data. However, discovering the correct algorithm for this function can be difficult.

Lazy Materialized Views

Lazy materialized views offer a balance between the eager and snapshot types. Changes to the data are not instantly propagated, allowing multiple changes to the same records to be applied in one shot rather than mutliple shots. However, the data is not consistent during a transaction, and so it must be carefully handled.

Like eager materialized views, the data may fall out of sync if there is a dependence on a mutable function. A similar solution must be implemented.

The only drawback to lazy materialized views is that I don't know how to implement them yet, as I can't put a hook in before the transaction is committed.

Very Lazy Materialized Views

Very lazy materialized views are like snapshots, except the updates can be lighter-weight. This is good if there are few updates to the data. Like snapshots, the data will begin to be out of sync as soon as the data changes, but the refresh function is much faster and uses less resources.

General Suggestions

Know When to Use a Materialized View

It doesn't take a genius to realize that materialized views are not simple. If you can find a way to improve your database performance with partitions or indexes, by all means, do so. However, certain situations are best served with materialized views. These situations are obvious in that there are relatively few data modifications compared to the queries being performed, and the queries are very complicated and heavy-weight. It also goes without saying that the different kinds of materialized views are useful in different situations.

Avoid Mutable Function Dependence

Avoid mutable functions if you can. Sometimes, like the example above with the expires column, the solution is obvious. Othertimes, it is not. If possible, summarize the mutable function in a column in the table. For instance, we could've added a column called expired, set it to a boolean, and ran a query to update that column with respect to the expires column nightly. Then we could create the view and thus the materialized view with a dependence on the expired column, and ignore the expires column. This would've made the materialized view simpler and always synchronized with the data.

Keep it Organized

Due to the sheer volume of SQL code required to write a materialized view that is either eagerly updated or very lazily updated (nearly 1,000 in one instance I have implemented), it is important to combine all the functions together into a script that is executed in one transaction. I also like to have a partner script to back out all the changes. This way, when I create the materialized view quickly, and I can quickly back out if things go bad. It also helps me to keep track of all the changes in one neat bundle.

Note on Time Dependence

(New as of 2009-07-28)

Having worked with several systems where we cache data that has time-dependence in it, I would like to point out the following techniques and their deficiencies.

Snapshot

A snapshot is just like a picture that a camera takes. What it sees is what is happening at the moment in time the picture is taken. However, as time marches on, that picture becomes less and less descriptive of the current state of the world.

Records in a snapshot may be modified, deleted, or even inserted as time marches on. Thus, data with a time-dependency may either be wrong, existing when they should not, or missing when they should exist. There is no way to tell from looking at the snapshot.

Of course, if the underlying data changes, then the snapshot can be updated using any of the methods I talk about above. At which point, the snapshot is valid but only for that instance when the update was performed.

Snapshot with Windows

An alternative is a snapshot with windows. This is a snapshot with a disclaimer that describes how long the data is valid for. It even contains data that should exist in the future with a disclaimer that the data isn't valid yet. Many records will be duplicated, but without overlapping valid time periods.

This is, surprisingly, the fundamental concept of PostgreSQL data storage itself. Every row in the database system has a transaction ID representing when the record should first exist and a transaction ID representing when the record will no longer exist. Every query only sees records that are valid for their transaction ID.

When you look at such a snapshot, you have to filter out all the data that isn't valid because you are not in the right window, just like PostgreSQL filters out rows that are not valid for your transaction ID.

Of course, if the underlying data changes, then you can update the data in the snapshot according to any of the materialized view methods I describe above.

Snapshot with Scheduled Refreshes

What I said about snapshots above isn't really true. Snapshots are valid until they aren't valid. However, looking at the data, we can predict exactly when the data is no longer valid, and even schedule an action to update the data as it becomes invalid.

There is a small problem with this. There is always going to be a small period of time between the data becoming invalid and the scheduled refresh completing. Even if you are very clever, perhaps by pre-generating the future values, and having them ready to go, you are not going to be able to make these two events occur at exactly the same point in time, even with multi-core processors and parallel memory. In many cases, however, this small period of time is perfectly tolerable.

Snapshot with Lazy Refreshes

Rather than have scheduled refreshes, you can simply mark the data as wrong and then have whoever is interested in seeing the correct data rebuild it and store the result. If you are very clever, you can combine this with the "Snapshot with Windows" method to include records that will be correct in the future.

Many caching methods I have seen rely on the processes who use the cache to keep it organized and fresh. This isn't unreasonable in the database world, either.

However, there is a significant development cost for this kind of behavior. A typical session would look like:

Examine records you are interested in looking at.

If there are old records, refresh them.

Actually run the query you were interested in again with the fresh data.

Why These Techniques Work (And Other Mutable Functions Won't)

The only reason these techniques work is due to the linear and progressive nature of time. Time is predictable, even though each moment in time is different from the rest.

Other mutable functions may not be predictable. For instance, the random function or a function relying on data from a different source. As long as these are not predictable, then we cannot use techniques like this.

However, if you can make the data predictable, or bend the rules a bit to make them predictable, then you should be able to use methods similar to these to solve the mutable problem.

Future Directions

I would like to begin work on generic functions to implement the mv_refresh_row() functions, as well as the triggers and trigger functions. I think I will need to begin writing code in C for this to work properly.

If the generic functions become available, I would like to drop the snapshot method and instead replace the mv_refresh() function with one that will determine if it makes more sense to delete all the rows and reselect them, or apply the changes incrementally.

I would also like to investigate how to put a hook into the transaction process. Having that ability would add a powerful materialized view to the existing arsenal.

Current Direction

Today, I recommend the following.

Understand what Materialized Views really are, and their costs, and why they are not a panacea. Yes, you get great performance improvements, but they come with a huge cost and overhead as well.

Understand you particular data needs in detail. If you need materialized views, it may make sense to implement them outside of the database where you have greater control over the data.

I do not recommend that PostgreSQL add Materialized Views in its core. Why?

Materialized Views are insanely complicated, and you cannot get past their arbitrary requirements and limitations. In other words, they are not a very good abstraction because they leak details of the underlying implementations.

Implementing Materialized Views properly is not easy. Until it is, we should avoid a general solution, relying instead on particular, detailed solutions.

Comments Welcome!

I welcome your comments on this document, as well as any questions you may have. Please email me at jgardner@jonathangardner.net and CC the SQL, Performance, or Hackers list on the PostgreSQL site.