tl;dr: Don’t use product_info (a view, not a table) in this query, move WHERE clauses for product_name and version_string into the infos CTE, strictly limit the number of columns in tables being joined

This query is unfortunately doomed because it is using product_info — a view which already contains data from product_versions. There are four other tables which we don’t care about for the query that are included in the view.

As a result, you get a self-join many times over. A hint at the horrors of what Postgres decides to do with this is here:

Whenever you see so many nested joins, subquery sorts and sequence scans mushed together in a staircase, that’s a signal that we should investigate whether the query we’re running is really what we thought it was.

While @peterbe dug through code with me, he mentioned that product_info was a view! Now all the self-JOINs made sense and I started refactoring.

The product_info view was being deconstructed into it’s component parts, which already included product_versions (resulting in a self-join) and including a bunch of junk that for the purposes of this query, we don’t really care about. So, as the first step, I just made a copy of the SELECT query from the view (you can get that by running \d+ product_info in psql or you can dig it out of the socorro/external/postgresql/procs/views section of our code.

This is pretty sad. The Sort at the top of Mt. Sadness. There are a series of sorts further down that are just HUGE because we’re tossing 45k records that must be joined to each other, and the width of the query is 294 — 294 columns in addition to our 45k rows.

The obvious (but sadly not always effective) thing to try is to see if we can filter our rows out earlier. Because we’re using infos, conveniently, that looks possible without too much trouble.

That just leaves sorting out the rapid beta self-join, which based on my tests should be pretty easy to continue to do in the body of the main SELECT, at line 125.

With the changes I proposed, the estimated duration of this query is ~200 ms in stage and the query plan looks like:

Setting things up initially was pretty simple, but explaining it to a coworker after I’d set everything up for myself proved slightly more difficult. Below are my notes on the differences between Alembic and some other migration tool.

Terminology

Alembic calls each migration a revision. Revisions know what order to be run in because each revision is given a down_revision to identify its parent. If down_revision is None, that revision is the very first revision according to Alembic. You can put your whole schema in that revision, or you can just start adding changes to this initial revision. Alembic doesn’t complain either way.

A best practice would likely be putting your entire model into the first revision. I may go back and “fix” this for us later. I opted to just have the default use case be to create a database fresh with a tool we call setupdb_app.py.

If you’re looking to migrate to using alembic, you’ll also need to use SQLAlchemy. I used sqlautocode for my initial schema reflection, and there’s a new tool sqlacodegen you may want to check out for generating your SQLAlchemy models for the first time.

Preparation: edit config and activate a virtualenv

I also put an alembic.ini-dist file into our project’s config/ directory, and modified alembic/env.py to include our model.

To get started working with an existing install, you’ll need to modify alembic.ini-dist, and copy it to config/alembic.ini to fit your environment – setting the connection string and the path to the alembic directory are the two most important settings. We have a script which creates databases from our models.py called setupdb_app.py. This script takes --database_name as a command-line argument. My default for our project is to use breakpad.

We use a virtualenv called socorro-virtualenv. The virtualenv is created automatically if you run make test. If you’re creating a standalone virtualenv, you can do that with virtualenv socorro-virtualenv. Activate this with . socorro-virtualenv/bin/activate.

Creating a revision

Create a fresh database to work from. For Socorro, the command is: PYTHONPATH=. socorro/external/postgresql/setupdb_app.py --database_name=breakpad

Edit models.py with the change to the schema

Run: PYTHONPATH=. alembic -c config/alembic.ini revision -m 'your message about the migration'. The output will include the name of the new file.

If all goes well, your revision is ready! If something goes wrong, edit and try again. The revision will automatically rollback if there are any errors.

Downgrades are a little tricky to properly execute. In an ideal world, you’d be able to revert the underlying code, but preserve only the commit containing the migration. More on this in a future blog post!

Creating a revision using --autogenerate

This is very similar to the above, with the addition of --autogenerate to your revision command. This should do the right thing, but definitely check your generated file for accuracy.

Create a fresh database to work from. For Socorro, the command is: PYTHONPATH=. socorro/external/postgresql/setupdb_app.py --database_name=breakpad

Edit models.py with the change to the schema

Run: PYTHONPATH=. alembic -c config/alembic.ini revision --autogenerate -m 'your message about the migration'. The output will include the name of the new file.

If all goes well, your revision is ready! If something goes wrong, edit and try again. The revision will automatically rollback if there are any errors.

Production deployment

You’ll need to deploy an alembic.ini on your production database system and probably a virtualenv to support your python modules.

We deploy our virtualenvs with our application, so this step was pretty simple for everything except for alembic itself. The virtualenv put in full, static paths for the python binaries and had some dependencies that I haven’t figured out yet for actually running alembic. To get around this, I created a virualenv locally on the system for the postgres user. Having your postgres user run the migrations locally is a must for me because I need to access the filesystem to pull in new versions of user defined functions stashed in the directory my model lives in.

I just deploy a new release of our application code on the database server locally, and then I run alembic against the versions directory that’s deployed.

FAQ

And here’s an FAQ for the common problems folks ran into:

OOPS I forgot to create a database before I created a revision!

To “fix” this, try:

Create the database from scratch using your current models.py.

Run: PYTHONPATH=. alembic -c config/alembic.ini downgrade -1

Run: PYTHONPATH=. alembic -c config/alembic.ini upgrade +1

Assuming your downgrade function works, this should allow you reverse the latest revision and then test your migration.

Error message: “Only a single head supported so far.”

I’m using schemas, and alembic doesn’t recognize them when I try to use --autogenerate.

See include_symbol. And be sure to add this to both the “offline” and “online” versions of the revision code in env.py.

Error message: Target database is not up to date.

This means you’ve got a file in your versions directory that contains one or more migrations that haven’t been applied to the current database. You can either apply them with alembic upgrade head or have a look in that directory and remove the migration(s) that you don’t want.

We’re seeing Postgres become the default technology choice for new tech companies. As those companies grow or are acquired, we’ll see Postgres becoming an important part of many more organizations.

It’s no coincidence that Bruce posted this morning about “The New Postgres Era“. The quality of the Postgres community’s software products is recognized, and the developers who use our software are succeeding.

A friend told me about their sharding scheme last night, and it made me very curious about how others are handling this problem. This question about database design turns into a devops issue, so it’s something really the entire development group and devops and DBAs need to be aware of and concerned about. And it’s not a problem exclusive to Postgres.Continue reading →

Postgres Open‘s website is back up! Next week, we’ll be announcing sponsorship opportunities for 2012 and opening our Call for Speakers shortly after.

Some IT guys discovered Postgres and made a series of videos about what they’re learning. They came from Windows backgrounds, so their observations were a great introduction to what it’s like for non-Linux-y people when they first try Postgres out.Continue reading →

Today, PostgreSQL Global Development Group released new versions of all active branches. This includes three security bugfixes, two of which are pretty obscure and one that fixes a possible security issue with restoring un-sanitized output from pg_dump. Details about the security issues are included in the release announcement.Continue reading →

I’ve been sick for a few days, so I settled in with a nice cup of tea and started in on the tremendous backlog I’ve got on pgsql-hackers. I put patch status at the end of each paragraph.Continue reading →

Slides (as of this moment) are here: Mistakes were made. I changed quite a bit of the beginning and end, given how big the audience is. Previous talks, we’ve usually ended with a fun “omg, here’s the craziest story I know” session. I imagine we’ll get a little bit of that today.Continue reading →