2011 December 30

Announcing postgresql-simple

I’ve done a bit of database programming off and on over the years, and when I started into a larger project and decided to use Haskell and PostgreSQL, I didn’t understand exactly how bad of a choice Haskell was for PostgreSQL development at the time. So postgresql-simple and postgresql-libpq is hopefully a small step towards remedying the situation. This work is a fork of Bryan O’Sullivan’s mysql-simple library and Grant Monroe’s libpq binding, with some code informed by Chris Done’s pgsql-simple library, and I’d to thank all for their terrific work.

So, there are five things that, taken together, distinguish postgresql-simple from the other PostgreSQL options on hackage:

Chris Done’s pgsql-simple is a pure Haskell implementation of a small subset of the PostgreSQL frontend/backend protocol. I do think there is technical merit to a native Haskell implementation, and in particular I’ve identified a potentially significant advantage that cannot be achieved via libpq. However writing a native implementation is a significant undertaking, and such efforts invariably lag behind in terms of features. And though pgsql-simple is experimental software that hasn’t been officially released and I don’t want to be unfair to it, pgsql-simple performs badly. I sped up a program I had been using to test pgsql-simple by over a factor of 20 simply by switching to postgresql-simple.

Among other things using libpq means that postgresql-simple gets the full range of connection and authentication options out of box, including support for Unix Domain Sockets, SSL, and GSSAPI.

2. postgresql-simple uses libpq-based string escaping.

HSQL doesn’t provide parameterized SQL statements at all, requiring programmers to dynamically generate sql and handle escaping issues themselves. And experience has shown punting on this issue is unacceptable, as dynamically generating SQL is risky and forcing programmers to handle escaping issues is an order of magnitude worse.

HDBC and pgsql-simple support parameterized SQL, but they attempt to do the escaping themselves. The problem is that this is usually a bit more subtle than it first appears, leading to bugs and potential vulnerabilities. For example, in PostgreSQL, escape syntax depends on the value of the standard_conforming_strings server setting, which libpq will detect and accommodate accordingly.

In PostgreSQL, a prepared statement allows one to send off a parameterized sql statement to a backend, and then re-use that statement as many times as you like by just filling in the parameters. This means the text of the prepared statement does not repeatedly traverse the wire, the backend does not parse the sql query multiple times, the backend re-uses the query plan generated when the statement was prepared, and you get to use protocol-level parameters which can eliminate the need for escaping strings and converting numbers to base-10 and back.

Note that prepared statements are very often, but not universally, advantageous. Starting from nothing, they require two round trips to get any information out, so preparation can be disadvantageous for one-shot queries. Also, occasionally re-planning a query can be advantageous, so infrequently executed queries can be harmed by preparation as well.

HDBC nominally supports prepared statements, but in fact all HDBC-postgresql does is cache some of the preprocessing of the query. No preparation in the database sense ever occurs. So while the lack of support for prepared statements is a definite disadvantage, neither does postgresql-simple pretend to prepare statements when it doesn’t.

4. postgresql-simple provides a simple API that most programmers would find familiar

Takusen appears to be the one PostgreSQL database access library for Haskell that gets basic implementation details more-or-less correct. Unfortunately it exports an esoteric API that is not applicable in all situations. In particular, web applications often use various forms of connection caching or pooling, which is fundamentally incompatible with the deterministic connection resource guarantees provided by Takusen.

Also, the *-simple libraries have a relatively nice interface compared to HDBC. Ultimately that was the breaking point that caused me to spend the time to create postgresql-simple; no one database library did everything I needed, and while working on some new code where my mangled fork of HDBC made the most sense, I realized I really wished I was using pgsql-simple instead. Also, I paid attention to ensure that application code could add support for user-defined PostgreSQL types with a minimum amount of fuss and without modifying the library, something that neither HDBC nor pgsql-simple could really do.

5. Support for Listen/Notify

Listen/Notify is the perfect solution when you want to write a program that responds to changes made to a PostgreSQL database. Informing these programs when changes are available consumes less resources and provides lower latencies than periodic polling. And since you can use a rule or trigger to send the notification, these notifications can be robust; you don’t have to assume that the program making the changes even knows that anybody wants to be informed of the changes. Listen/notify is aware of and consistent with transactions; notifications don’t get sent until the transaction commits. Even if you are willing to put that kind of logic in the database clients, using listen/notify solves the otherwise sticky problem of finding and coordinating with the other clients.

Looking Forward

Creating a mid-level database access library based on libpq is a significant undertaking, and there is an awful lot that isn’t supported, including prepared statements, binary data formats, copy in/copy out support, more and better support for PostgreSQL data types, and PostGIS support. Many other things could be improved, such as using libpq asynchronously, better end-to-end typechecking that discovers more errors at compile time, and other interface improvements. For example, SqlQQ is a simplistic quasiquoter to improve the literal syntax for multi-line SQL queries; one could certainly imagine extending the syntax to support including a $haskellvariable as a SQL parameter instead of going through the syntactic indirection of ? in the SQL string with haskellvariable in a separate Haskell parameter.

Funnily enough, I ran across this thread in which PostgreSQL luminaries were complaining about the quality of Python’s PostgreSQL libraries. And Python is considerably ahead of Haskell in this regard.

Like this:

Related

As a total beginner ramping up to create a tiny website based on Snap and PostgreSQL, your post and library comes at a perfect time – thanks!

In the Perl world there is an common database interface module, DBI, that is used to connect to all kinds of databases, using drivers specifically for each, i.e. DBD::Pg, DBD::mysql, DBD::DB2, DBD::Oracle, DBD::SQLite etc. Does a similar Haskell-project exist? It seems that a lot of generic stuff gets duplicated if you need a full database interface module per database (and switching backends become more of a hassle).

Yes, HDBC attempts to be a DBI-like interface, with different backends. And while I see value in that approach, I also think that there is frequently good reason to commit to a particular database and use all the implementation-specific features and advantages conferred. Unfortunately standard SQL seems a little too restrictive for many purposes, and even then implementations violate the standard altogether too often.

Hopefully mysql-simple and postgresql-simple will be sufficiently close to each other that moving backends won’t be too difficult, if you’ve paid attention to what database types you use, and steer clear of a lot of useful features, etc.

It looks like the library doesn’t use postgres query parameters underneath; it does the substitution itself, is that right?. Did you choose to do that so that you could identify errors in Haskell before querying the database?

In HDBC, there’s a parsec parser that replaces ‘?’ with $1, $2, and so on, and uses query parameters in the postgres query. We found that the parser was horribly inefficient, and that it was using a majority of the CPU time and memory allocation in our program. Hopefully yours is a lot faster, since it is written in attoparsec. Have you profiled your parser?

We have no need for the library to replace ‘?’ with $1 for us. We want to be able to write our queries using $-style parameters, but it looks like postgresql-simple does not support that. Would you consider adding it?

Yeah, that parsing step is what HDBC’s preparation step caches. IIRC, I think I removed that step and used the $n parameters in some of my apps. I haven’t personally benchmarked postgresql-simple’s parser, which was written by Bryan O’Sullivan and imported from mysql-simple with no changes.

Sure, I’d add support for that, though I don’t know when I’d get to it. Offering a patch would likely speed the process…

Ahh, so I wasn’t correctly recalling the internals of HDBC-postgresql (or even the details of libpq, for that matter.) HDBC does not do escaping itself, but rather uses execParams. The reason postgresql-simple doesn’t use protocol level parameters is because mysql-simple didn’t, and this was the most straightforward port of mysql-simple to PostgreSQL I could manage.

However, my understanding of the client protocol is that only prepared statements support protocol-level parameters; what is unclear to me at the moment is whether execParams is just a convenient composition of prepare and execPrepared that involves two round trips to the server to excute a query, or if execParams uses a certain trick so that there is only one round trip involved. I figure there is a decent probability of either possibility, though I would definitely favor the latter. I suppose I’ll have to fire up WireShark and/or read the source.

There is one definite disadvantage of execParams, and that is it doesn’t allow multiple SQL statements to be bundled in one request, which is a pretty silly and arbitrary restriction in my opinion. This means that more round-trips are needed in some cases, and that parameter values will have to be sent multiple times in others.

There is a protocol-level trick that you can issue addition requests before the backend finishes responding to the current request, and this trick could be used to reduce the number of round trips. However, libpq does not support this trick, and must finish receiving the response for a given request before it can issue the next one, though perhaps execParams is a special-case of this trick.

And this is the advantage of a native Haskell solution that I was referring to; a native Haskell library can use this trick and, with the careful use of non-strict evaluation, not muck up the interface exposed to the programmer. From the programmers perspective it would still be a bunch of sequential, fully synchronous requests, but the library would be able to issue a few more queries before it has gotten the complete response to the current query. (As long as it has all the results needed to issue the next query, of course.)