SQL Database Version Control – apgdiff

In “Database Development: Table Structure Changes,” Kenneth Downs argued that build scripts to modify table structures “are usually the least tested part of the system.” He suggested using a single metadata file that describes the database and can be placed under version control, and a difference engine as the tool of choice to perform database upgrades.

Another PostgreSQL Diff Tool (aka apgdiff) is a utility that implements such an engine. For its metadata, apgdiff takes two pg_dump -s output files, representing previous and new database schemas, and generates ALTER TABLE and other SQL statements needed to upgrade from previous to new.

In this installment of our series on using VCS tools while developing a SQL-based application, we’ll see how Carol and Dave can use apgdiff to control the changes described in Version Control, Part 2: SQL Databases.

Version 0.1

To start off, Carol creates the film table in her database, caroldev, using the CREATE TABLE script, version 0.1, which we saw in the previous article. She also creates a general development database, moviesdev, and issues the following commands:

$ pg_dump -s moviesdev > film0.sql
$ pg_dump -s caroldev > film.sql

The film0.sql file is essentially empty, consisting only of some SET, REVOKE and GRANT statements. The film.sql file includes statements to CREATE and ALTER the film table, as can be seen in the following diff:

In this initial phase, the diff and apgdiff outputs are not that … different. The value of apgdiff will be more obvious in subsequent versions.

As part of the 0.1 release process, Carol commits film.sql to the VCS (she could also commit film0.sql as the film.sql base version, if desired). She may also want to store the apgdiff output in the VCS, since the latter is what will actually be used to upgrade the production database(s).

Version 0.2

For the next version, Dave clones the VCS repository and uses the film.sql script or the saved apgdiff output to create the film table in his own database. He then issues an ALTER TABLE to add the length column, invokes pg_dump -s to create a new film.sql and commits that to his repository. Carol uses another ALTER TABLE to add the rating column in her database, and stores the new pg_dump output to her repository.

Manny, the integration manager, merges the new film.sql versions into the project-wide repository (he has to deal with table ownership statements but perhaps the pg_dump outputs can be stripped of such statements prior to committing them to the VCS). Manny then uses apgdiff to produce the statements needed to upgrade to version 0.2:

Version 0.3

For the next round, Carol and Dave issue the statements shown under Version 0.3 in the previous article and commit newer versions of film.sql which Manny then integrates into the project repository. The apgdiff output comparing the two versions is as follows:

You may notice a small discrepancy when running the above (tested with PostgreSQL 8.4.5 and apgdiff 2.3) to upgrade a second database and then compare the output from the pg_dump -s commands. In the original database, the category_category_id_seq sequence is owned by category.category_id, but this is missing from the upgraded database.

Review

Apgdiff appears to be a useful tool to manage changes to PostgreSQL databases in conjunction with a VCS. It does require that you run pg_dump -s and that you store its output in the VCS. However, apgdiff could be used against a set of manually-maintained SQL scripts.

Apgdiff requires Java 1.6 and Debian, Ubuntu and Gentoo have native apgdiff packages. It supports several (but not all) PostgreSQL features. For example, it omits the TYPEs, DOMAINs and AGGREGATEs used in the Pagila sample database (it was able to recreate 14 out of 21 Pagila tables). It depends on parsing SQL DDL syntax —albeit not fully, so it is susceptible to syntax changes between PostgreSQL releases. It cannot deal with RENAMEs, e.g., if the rating column is renamed to mpaa_rating, apgdiff will drop the first and add the second—not ideal in a populated database.

Apgdiff does not operate on the actual database being upgraded but merely on textual representations. On the plus side, it can be used to generate backout scripts, e.g., by diffing film.sql version 0.3 vs. 0.2 you can generate statements to undo the changes from 0.2 to 0.3. The next tool we’ll review is particularly concerned with this aspect of the upgrade process.

17 thoughts on “SQL Database Version Control – apgdiff”

thank you for this article. i’m the author of apgdiff and just noticed you drived some traffic to apgdiff site :-) i linked to this post from Links page from apgdiff website.

i think this is exatly the kind of article devs need, but i myself did not have time to investigate this topic and write on it … so you saved me some time :-)

i will fix the mentioned issues someday, just need to find some time among the flood of other work. also, i accept patches for apgdiff in case somebody makes some fix or enhancement and wants to share it.

What about the patch I submitted to you for TYPEs support, about 2 years ago ? ;)
I guess I’ll have to adapt it to the latest version of apgdiff, now.
If you have any comments about it, please tell me so I may improve it.

:x well, i work on the tool when i have some spare time, and 2008 i proly did not have much time for apgdiff, if any. in 2010 i rewrote apgdiff completely so it now uses its own sql parser instead of parsing the statements using regular expressions so the patch is now obsolete and needs rewrite to be usable with latest version of apgdiff. sorry for that, i should first apply it and then do the rewrite.

apgdiff is pretty cool, but as you say, it doesn’t support all PostgreSQL features. So if you use any of those features, it is worse than useless as a tool to help you manage your database upgrades, because it might miss important details. Seeing how fast PostgreSQL develops and adds new features, I’m somewhat discouraged because I don’t see how a tool like apgdiff can keep up and stay useful unless a lot more resources are put into it.

Like any open source project, apgdiff depends on the interest of its audience/community. My own project (Pyrseas) also depends on such interest. Without revealing too much, one of the tools I’m working on outputs the database structure in YAML format. Josh Berkus said that would be a valuable tool on its own, but warned me that dealing with all the db objects supported by Postgres –and their dependencies, would be quite a task. Ideally, this could become part of pg_dump, e.g., pg_dump -s --format=yaml (or json), but that is even more daunting. I’m working, presumably like Miroslav, on the features that interest me. Whether the tool eventually supports all features is contingent on others finding the tool interesting and useful enough that they’re willing to spend some time and effort extending it.

As I told you, I’m using Pyrseas in Windows. A workmate has shown me apgdiff and then I’ve found that you already commented it in your “Schema VC” page. I’ve tested it and I’m trying to compare it with Pyrseas. Trying to find advantages and inconvenients, because at the end, I should choose one of them. At the moment, I’ve synchronized diferent a database with 500 tables with several changes and both succeed.

I wonder why you’ve developed Pyrseas, that has supposed a big development efford for you, when there was already apgdiff?

Finally, have you had in mind a way to synchronize master tables data and not only the schema?

As explained in the original post above, when developing Pyrseas I was first influenced by Andromeda. Using a standardized description of the database schema in a YAML format (YAML schema description or YSD) is IMHO much easier to work with. Trying to compare SQL to SQL requires the ability to parse SQL (something that the Postgres DBMS already does and that can sometimes change from release to release). Furthermore, if a tool such as apgdiff ever wanted to go beyond Postgres, e.g., support SQL Server or Oracle (maybe even the EnterpriseDb Oracle-like extensions to Postgres), then it would have to be able to parse different dialects of SQL. OTOH, the YSD can be parsed into Python structures almost effortlessly thanks to PyYAML. In addition, the YSD could be extended to support other DBMSs much more easily.

Although a YSD doesn’t require SQL parsing, building the YSD from the database requires reading its catalogs (something that a tool such as apgdiff gets for free thanks to pg_dump) and thus requires a database connection while running the tools. A SQL-to-SQL schema comparator can be connectionless which can be regarded as an advantage but could also be a shortcoming since it’s not operating on live data.

Another inducement to developing Pyrseas was outside of the realm of version control but –like Andromeda– involves creating flexible database UI applications, the subject of another series of posts.

I have been thinking about other features that could be added to Pyrseas that would make it more useful in a general development environment, where it’s not just the schema that changes, or where you may want to apply or undo changes in a pre-planned mode (a la Liquibase). I’d love to hear what your needs are (although I can’t promise anything). May I suggest that you join our mailing list?

Just a question to the writers of Agpdiff. I read something about including PRIMARY KEY and CONSTRAINT in the table definition on GitHub but the example was incomplete. Is it possible with agpdiff 2.4 to have the PRIMARY KEY and CONTSTRAINT defined in the table definition?

Niclas, I’m afraid you’re barking up the wrong tree. If you have an issue with Apgdiff, I think that your best bet may be on GitHub. They had a mailing list but it doesn’t seem to be active or accessible.