Why PostgreSQL?

Over the last few years, two open source database servers have become dominant on the Web. These servers are MySQL and PostgreSQL. In order to provide the best service available, Heroku decided to use PostgreSQL, and provide this service as the default choice for all database hosting.

In the early days of Heroku PostgreSQL, this was somewhat surprising, as MySQL was easily the most predominantly used database server at the time. However, over time, opinion has shifted more toward using PostgreSQL.

But why PostgreSQL?

The initial choice was made based on the fact that it was believed to be operationally more reliable than MySQL, which is very important when managing hundreds of thousands of databases on a daily basis. Over time, Heroku found PostgreSQL to get better and better as the development community around it added updated and new functionality.

Additionally, and a very key reason, PostgreSQL is an open system and always will be (unlike MySQL). This means that now and moving forward, as long as you are using PostgreSQL for your database server, you will not be subject to any vendor lock-in. Thus, as a user, you are able to take your data wherever you please.

There are a number of other technical reasons as to why PostgreSQL is the favored option, too: for instance, transactional data definition languages (DDLs), fast index creation with concurrent indexes, extensibility, and partial indexing and constraints.

Transactional DDL

If you’ve ever made a change to your database and have had something fail mid-way, either due to a constraint or some other means, you understand what pain can come of quickly untangling the resultant mess. Typically, changes to a schema are intended to be run holistically and if they fail, you want to fully roll back.

Some other databases, such as Oracle in recent versions and SQL Server, do support this. PostgreSQL, however, supports wrapping your DDL inside a transaction. This means if an error does occur, you can simply roll back and have the previous DDL statements rolled back with it, leaving your schema migrations as safe as your data, and your application in a consistent state.

Concurrent Indexes

When you create an index with most traditional databases, it holds a lock on the table while it creates the index. This means that the table is more or less unusable during that time. When you’re starting out, this isn’t a problem, but as your data grows and you add indexes to improve performance, it could mean downtime just to add an index (not ideal in a production environment).

Not surprisingly, PostgreSQL has a great means of adding an index without holding that lock. Simply invoking CREATE INDEX CONCURRENTLY instead of CREATE INDEX will create your index without holding the lock.

Of course, with many features, there are caveats. In the case of creating your index concurrently, it does take somewhere on the order of two to three times longer, and cannot be done within a transaction.

Extensibility

Do you need to go beyond standard PostgreSQL? There’s a good chance that someone else has, and that there’s already an extension for it.

Extensions take PostgreSQL further with things such as geospatial support, JSON data types, key/value stores, and connecting to external data sources (e.g., Oracle, MySQL, and Redis).

Partial Indexing

In a similar fashion to affecting only part of your data, you may care about an index on only a portion of your data, or you may care about placing a constraint only where a certain condition is true.

Take an example case of the white pages. Within the white pages, you only have one active address, but you’ve had multiple addresses over recent years.

You likely wouldn’t care about the past addresses being indexed, but would want everyone’s current address to be indexed. With partial indexes, this becomes simple and straightforward:

CREATEINDEXidx_address_currentONaddress(user_id)WHEREcurrentISTrue;

What Heroku Gives You

There are a number of reasons why you should use Heroku for hosting your PostgreSQL databases, some of which may not be immediately apparent. For instance, let’s take a look at typical requirements of a database administrator (DBA):

Set up and administer databases (perform backup, recovery, monitoring, and tuning as necessary).

Devise a scalable solution with the highest performance and greatest redundancy.

Deploy upgrades and patches.

Implement and test disaster recovery.

In essence, a qualified DBA needs to be able to set up, manage, secure, back up, scale, architect, and optimize data structures for day-to-day use.

With PostgreSQL, as with the rest of the Heroku platform, most of this is done for you. In fact, Heroku will do everything aside from architect and optimize your database. There is no need to worry about patching for security or stability, and there is no need to worry about backups; everything is taken care of for you. All that is left for you to do is to worry about your data and how to store it efficiently.

Because Heroku takes on a lot of the responsibilities of a DBA, you are freed up to consider the parts that are important to your particular application and your particular use case. Meanwhile, Heroku is doing its part to keep your database healthy, functioning, and running well, 24 hours a day.

Development versus Production

At this point, it is worth mentioning that there are two classes of service provided by Heroku: development and production.

The Heroku PostgreSQL development plans are designed to offer the database features required for development and testing, without the production-grade operations, monitoring, and support found in paid production plans. Advanced features such as fork, follow, and automatic database backups are not available on the development plans (although manual backups are available).

Production plans are suitable for production-scale applications. Additionally, the production databases offer a number of advantages over shared, including direct access (via PSQL or any native PostgreSQL library), stored procedures, and PostgreSQL 9.1 support.

Choosing the Right Plan

Heroku PostgreSQL plans vary primarily by the size of their in-memory data cache. The quoted cache size constitutes the total amount of RAM given to the PostgreSQL service. While a small amount of RAM is used for managing each connection and other tasks, PostgreSQL will take advantage of almost all this RAM for its caching needs.

PostgreSQL constantly manages the cache of your data: rows you’ve written, indexes you’ve made, and metadata PostgreSQL keeps. When the data needed for a query is entirely in that cache, performance is very fast. Queries made from cached data are often 100 to 1,000 times faster than from the full dataset. Well-engineered, high performance web applications will have 99% or more of their queries served from cache.

Conversely, having to fall back to disk is at least an order of magnitude slower. Additionally, columns with large data types (e.g., large text columns) are stored out-of-line via The Oversized-Attribute Storage Technique (TOAST), and accessing large amounts of this data can be slow.

Unfortunately, there is no great way of knowing exactly how much cache you need for a given use case. This is not a very satisfactory answer, but it is the truth. The recommended way is to move between plans, watch your application performance, and see what works best for your application’s access patterns.

Access patterns vary greatly from application to application. Many applications only access a small, recently changed portion of their overall data. PostgreSQL will automatically keep that portion in cache as time goes on, and as a result these applications can perform well on the lower cost plans.

Applications that frequently access all of their data don’t have that luxury; these can experience dramatic increases in performance by ensuring that their datasets fit completely in memory. However, you will eventually reach the point where you have more data than the largest plan allows, and you will have to take more advanced approaches such as splitting your data across several databases.

Heroku Postgres’s many plans are segmented in two broad tiers: starter and production. Each tier contains several individual plans. Although these two tiers share many features, there are several differences that will determine which plan is most appropriate for your use case.

Shared Features

The starter and production tier database plans all share the following features:

Together, these features combine to provide a safe and resilient home base for your data.

Starter Features

Note

The starter tier database plans are not intended for production-caliber applications or applications with high-uptime requirements. Therefore, if you are hosting an application that you consider to be critical, it is advised that you consider using the production tier database plans.

The starter tier, which includes the dev and basic plans, has the following limitations:

Enforced row limits of 10,000 rows for dev and 10,000,000 for basic plans.

Max of 20 connections.

The lack of an in-memory cache limits the performance capabilities since the data can’t be accessed on low-latency storage.

Fork and follow, used to create replica databases and master-slave setups, are not supported.

Expected uptime of 99.5% each month.

In addition to these feature and resource limitations, starter tier database plans do not automatically record daily data snapshots. You will need to use the PGBackups add-on to manually configure this level of data protection and retention.

Production Features

As the name implies, the production tier of Heroku PostgreSQL is intended for production applications and includes the following feature additions to the starter tier:

No row limitations

Increasing amounts of in-memory cache

Fork and follow support

Max of 500 connections

Expected uptime of 99.95% each month

Management of production tier database plans is also much more robust, including:

Automatic daily snapshots with one-month retention

Priority service restoration on disruptions

Getting Started

So how do we provision and use a Heroku PostgreSQL instance? Fortunately, this is a very simple task.

Tip

A prerequisite of following this walkthrough is to have a Heroku account, verified with a credit card. To sign up, if you haven’t already, visit http://postgres.heroku.com/signup.

To provision a database, follow these steps:

Go to your database dashboard, the page that loads immediately after logging in.

Click the + in the header.

Select the Dev Plan (which is free) when you are presented with a list of the options available.

Click Create Database. You should now see a message that the database is being created. A few seconds later, this page will change to show you the name of your new database (e.g., “protected-wave-27”).

Click the database name.

Your PostgreSQL database is now provisioned and ready to use. From the current screen, you can see information such as the host, database name, and credentials needed to connect.

By clicking the cog at the upper-right corner of the Connection Settings panel, you are able to select one of the commonly used methods of connecting to your database, and get a pregenerated string that you can use in your application. For instance, if you were to select ActiveRecord (the Ruby on Rails ORM), you would see something similar to this:

You will also be able to see other information, such as the database type and some basic statistics regarding size, version, and availability. From here, you are also able to see your current data clips and snapshots. More on that later.

Importing and Exporting Data

Once you have your database, how do you get your preexising data in? How would we get our data out of Heroku PostgreSQL for further querying offline? In order to make this process as efficient as possible, Heroku has made sure that there is no lock-in, and made sure that you can access your data in the normal PostgreSQL ways.

Importing Data

In many cases, you will now want to import some data into your database for use by your application. This is done via the use of a dump file from another PostgreSQL database:

$ pg_dump -Fc --no-acl --no-owner my_old_postgresql > data.dump

Alternatively, a dump can be created via the use of PGAdmin, the GUI tool for PostgreSQL administration. To do so, select a database from the Object Browser and click Tools → Backup. Set the filename to data.dump, use the COMPRESS format, and (under Dump Options #1) choose not to save Privilege or Tablespace.

The primary option for importing this data into your database is to use pg_restore.

Once complete, you should be able to connect to your database, and query the database as if it were local.

Exporting Data

So, now that you’ve got some data in your database, how do you get it out again? You could query it and write it all down on some slips of paper, but this will take a fair while, so it’s probably best to use one of the many more efficient methods.

Snapshots

Let’s say you need to do a full database export. You can download a database snapshot, which is essentially identical to the dump file that you created when importing. To create a snapshot, log in to your Heroku PostgreSQL account and go to your database list. Select the database you’re interested in and click the + in the Snapshots section. A few seconds later, the snapshot will have been created and can be downloaded.

Once downloaded, restore the dump to your local database as normal.

CSV Exports

But what if you don’t want to download the whole database? Heroku will let you export the results of a single query as a comma-separated values (CSV) file via a SQL command (as built into PostgreSQL itself). In order to do this, connect to your PostgreSQL database (as described earlier in the chapter) and open up a new query window. Once done, issuing a query such as the following will create a new file locally containing the query data:

COPY(SELECT*FROMproducts)TOdump.csvDELIMITER'\t'

where:

SELECT*FROMproducts

is the query that you wish to download.

PGBackups

A very common requirement is backups. Although Heroku strives to ensure that there is no data loss in the event of some sort of digital catastrophe, it’s always a good idea to make sure you keep backups for your own peace of mind.

In order to use PGBackups, you must first provision the add-on. Note that you will have no access to data backups without this add-on:

Data Clips

One common need of a database administrator is to share data in her database with other people. Generally speaking, as these people are rarely technical, sharing dump files with them might not be the best approach. What’s more, it is generally a common need that this data is provided on a periodic basis. For instance, you may need to deliver an export of year-to-date sales by month to your manager each and every month.

Sharing information on the Internet is done by sharing URLs. URLs identify locations, books, videos, and even source code. Until now, there hasn’t been a convenient way to share data inside a database. That’s why Heroku introduced data clips. They are a fast and easy way to unlock the data in your database in the form of a secure URL.

Data clips allow the results of SQL queries on a Heroku PostgreSQL database to be easily shared. Simply create a query against a database within your databases list, and then share the resulting URL with coworkers, colleagues, or the world.

Data clips can be shared through email, Twitter, irc, or any other medium, because they are just URLs. The recipients of a data clip are able to view the data in their browser or download it in JSON, CSV, XML, or Microsoft Excel formats.

As data changes rapidly in databases, so can data clips. They can either be locked to a point in time or set to refresh with live data. When locked to a point in time, data clips are guaranteed to show an unchanging snapshot of data, even if they are viewed hours, days, or years after the clip was created. Alternatively, when data clips are set to “now,” they provide a live view into the database in its current form.

Followers

So, now that we’ve talked about some of the data-in and data-out tasks that we commonly have to do day to day, let’s talk more about the features that relate more to running Heroku PostgreSQL databases in an operational environment.

Although your database is hosted by Heroku and administered by some of the best DBAs in the business, there is still risk. Your database is running on one machine for instance, and should this go pop, then you’re looking at some downtime until the database is moved to another location. This is why Heroku developed followers.

Followers are essentially the same as replicated databases. This means that you have a single master database that receives read and write queries and a follower, which is another identical database that mirrors the master database from a data point of view.

There are a few reasons to do this. First, as upgrading (or downgrading) your database requires replacement of the database itself, creating a follower first in the new plan allows you to simply switch across and then decommission the old database with very little downtime.

Another reason is to allow you to create a backup database that you can have in case of a failure of the first. You could use followers to create a sharding system where write queries are directed at the master, and read queries are directed at a follower to alleviate load on a single database.

Note, however, that followers cannot be instantly created, and that they take time to populate. Therefore, once you’ve created a follower, you will need to interrogate its completeness via the Heroku command-line interface:

Fast Database Changeovers

By now, you should see that it is possible to swap out your database for a new, more powerful one with very little downtime.

This changeover uses followers to minimize the downtime in migrating between databases. At a high level, a follower is created in order to move your data from one database to another. Once it has received the majority of the data and is closely following your main database, you will prevent new data from being written (usually by enabling maintenance mode on your app). The follower will then fully catch up to the main database and be promoted to be the primary database for the application.

Note

One thing that makes this entire process much simpler is to temporarily set your application as read-only. This won’t suit some applications, but making sure that your data isn’t changing will ease the process significantly.

In order to carry out a database changeover, follow this simple process:

Create a follower for your database.

Wait for the follower to catch up using the technique we’ve explained.

As it’s important that no transactions are changing data, enable maintenance mode on your application:

$ heroku maintenance:on --app sushi

And scale every dyno down (in this instance, the workers and fast_workers dynos):

At this point, your original main database is now unused and can be exported, destroyed, or simply left as is. Note, though, that you will still be charged.

Forking

One last feature that is worth mentioning is forking. There is a common need to debug issues that occur in a live production environment—typically issues whereby a user has somehow managed to get himself into a certain state. In these situations, the problems are usually data related, so the developer tasked to debug the issue needs to have access to this data. You could download this data and re-create it locally, but if your database is large this could be very inefficient.

Another scenario could be that you have a deployment coming up and you’d like to test some database changes against the production data before going live with the changes. Again, downloading a database that could be gigabytes in size would be a bad idea.

Therefore, Heroku has developed fork, which is essentially the practice of taking an existing Heroku PostgreSQL database and making a direct copy of it and the data associated with it at that particular time. This copy does not follow your main database or change in any other way, so it’s a good way of cloning a database for interrogation or testing against.

As these forked databases are just like any other database, you can connect a staging version of your application to it and run your code as normal, all without affecting your production application.

Preparing a database fork can take anywhere from several minutes to several hours, depending on the size of your dataset.

Other Features

Aside from those already described, the Heroku team is constantly adding new features to their PostgreSQL offering on a weekly basis. This partly comes from the benefit of managing one of the largest fleets of PostgreSQL databases on the Web. The team, by virtue of the sheer number of databases under its control, is able to gather a vast amount of usage data, and a comprehensive list of pain points that other users are suffering. By using this data and contributing back to the open source PostgreSQL project, the platform can constantly improve.

Let’s look at some of the key added features that PostgreSQL offers.

Extension Support

Databases are the well-known solution for storing data for an application. However, they sometimes lack functionality required by application developers, such as data encryption or cross-database reporting. As a result, developers are forced to write the needed functionality at their application layer. PostgreSQL 9.1, which already has an extensive collection of data types and functions, took the first step toward mitigating this by creating an extension system that allows the database’s functionality to be expanded.

Extensions allow related pieces of functionality, such as datatypes and functions, to be bundled together and installed in a database with a single command.

Heroku began supporting extensions in March 2012 with the release of hstore, the schemaless datatype for SQL. Users have taken advantage of hstore to increase their development agility by avoiding the need to predefine their schemas.

Tip

Improved Visibility

Visibility into your data has long been a problem for many application developers. In the current version of PostgreSQL (9.2 at the time of writing), all queries are normalized and data about them is recorded. This allows you to gain insight such as:

How often is a query run?

How much time is spent running the query?

How much data is returned?

Each of these key pieces of data is critical when it comes to effectively optimizing your database’s performance.

The old way of poring through logs is no longer needed to gain this insight. Now your database contains what it needs to help you improve performance within a PostgreSQL database.

Ensuring such functionality is committed back to the PostgreSQL core is very important, as it prevents lock-in and creates a better ecosystem for the community as a whole.

Let’s take a look at how we can begin using some of this. First turn on statement tracking with CREATE EXTENSION pg_stat_statements;. Then run the next query and you’ll receive all of your top run queries:

JSON Support

Developers are always looking for more extensibility and power when working with and storing their data.

With PostgreSQL 9.2, there’s even more robust support for NoSQL within your SQL database in the form of JSON. By using the JSON datatype, your JSON is validated as proper JSON before it’s allowed to be committed.

Beyond the datatype itself, there are several new functions available. These are record_to_json, row_to_json, and array_to_json. Using these functions we can turn a row/record, or even an array of values, immediately into JSON to be used within an application or returned via an API:

Range Type Support

The range datatype is another example of powerful data flexibility. It is a single column consisting of a to-and-from value. Your range can exist as a range of timestamps, can be alpha-numeric or numeric, and can even have constraints placed on it to enforce common range conditions.

For example, this schema ensures that in creating a class schedule we can’t have two classes at the same time: