One of the questions I'm getting in response the talks and posts about filesystem performance is "What about I/O schedulers?" Sadly I didn't have much reliable data on this topic before, and I've already published that some time ago. But I've been asking myself this question too, so naturally I decided to additional benchmarks, hopefully improving the reliability somewhat. So let's look what I've measured.

The main reason why I haven't been very happy with the previous measurements is that the setup was not particularly representative of actual deployments. Most serious deployments don't run on a single SSD device, but at lease separate the data directory and WAL somehow. As I finally received the additional SSD drives I ordered some time ago, I've modified the setup this way - one SSD for WAL, one SSD for data. This also allows setting different I/O schedulers for WAL and data separately - maybe one scheduler is good for WAL and a different one for data?

Those results suggest that in terms of average throughput, the I/O scheduler on XLOG does not actually matter, while the I/O scheduler on data directory can indeed make a significant difference.

While discussing results of various filesystem benchmarks - both here and in talks given on conferences, one of the things I've been questioning was the negligible impact of TRIM/DISCARD mount option.

I was speculating that maybe TRIM only really matters when the drive gets almost full, because until then the controller has enough room where to write the incoming data without getting the internal garbage collection under pressure. But I've recently did a a few pgbench runs aiming to test exactly this, using a rather large scale (filling more than 90GB of the 100GB drive), yet still no diference.

Another speculation was that maybe this particular SSD is really good and there's so much additional overprovisioning that the TRIM still makes no difference. After all, it's a good SSD frin Intel (S3700) meant for write-intensive data center applications, so I wasn't entirely surprised by that.

As you may be aware, PostgreSQL splits data files into 8kB pages by default. It's possible to use different page sizes (up to 32kB), but that requires compiling custom packages so almost no one does that. It's quite rare to come across an installation using different page size at a customer, for example.

Why 8kB data pages and not a different size? Clearly, smaller or larger pages might be more efficient in some cases, and indeed some other databases use different page sizes. For example InnoDB uses 16KB pages by default, and some databases even allow a mix of page sizes. So there seems to be a gain in using different page sizes, otherwise engineers working on those products would not waste time implementing it.

So what factors determine the optimal page size? And how do SSD disks change the results?

The reason why I do benchmarks is that, in my experience, many recommended settings are effectively myths. Many are perfectly fine, but a surprisingly large part either was never quite true or is based on data that got obsolete for various reasons (e.g. changes in hardware or software). So the verification of the facts is rather imporant to make good decisions, which is why I've been recently benchmarking file systems available on Linux, briefly discussed in the posts on EXT4 vs. XFS, BTRFS and Reiser.

In this post, I'd like to briefly look at a potential myth when it comes to databases and I/O schedulers. The purpose of the scheduler is to decide in what order to submit I/O requests (received from the user) to the hardware - it may decide to reorder them, coalesce requests into larger continuous chunks, prioritize requests from some processes etc.

In the discussion, I've been asked to benchmark ReiserFS. I haven't really planned to benchmark this file system because ... Let's say ReiserFS does not have the best reputation when it comes to reliability due to historical reasons. According to some sources the motto of the development was essentially "performance over correctness" which is approach database folks don't really like. I don't know how much that's true for the current state of ReiserFS, or how that applies to the new ReiserFS 4 (which is not merged and I don't think it'll ever happen).

But database people are rather conservative when it comes to storage reliability, so even if all the issues got fixed and the current version is reliable, it's unlikely to become very popular in this group soon. Factor in the fact that the development of ReiserFS pretty much stopped (in favor of the Reiser4, which did not get merged for years), so the code pretty much in maintenance mode only. Which is not necessarily bad, but it also means no significant reworks necessary for new features and performance improvements (which was the purpose of Reiser4).

So let's see if ReiserFS 3 actually performs so much better than the common file systems (EXT4 and XFS), as the users often believe ...

I usually don't write rant-style posts, but today I've decided to make an exception. Over the past few months I've been working on a benchmark comparing how PostgreSQL performs on a variety of Linux filesystems, both traditional ones (EXT3, EXT4, XFS) and new ones (BTRFS, ZFS, F2FS). Sometimes the results came out a bit worse than I hoped for, but most of the time the filesystems behaved quite reasonably and predictably. The one exception is BTRFS ...

Now, don't get me wrong - I'm well aware that filesystem engineering is complex task and takes non-trivial amount of time, especially when the filesystem aims to integrate so much functionality as BTRFS (some would say way too much). Dave Chinner stated that it takes 8-10 years for a filesystem to mature, and I have no reason not to trust his words.

I'm not a XFS/EXT4 zealot, I'm actually a huge fan of filesystem improvements - there's no reason to think that we can't do better on new types of storage (SSD, NVRAM) that were not available when current mature filesystems like EXT4 or XFS were designed, for example. Or that we can't provide better filesystem by adding features that were previously available only with additional tools (e.g. snapshotting, which could be done only with LVM before).

But perhaps a certain level of maturity should be reached before claiming a filesystem is "production ready" - it was quite funny hearing such statements a few years back when BTRFS did not even have a fsck tool, it's much less funny hearing it now when some distributions are either considering using BTRFS as a default filesystem or already did the switch.

I don't care if Facebook is using BTRFS for parts of their infrastructure, because using BTRFS for carefully selected part of the infrastructure says very little about general maturity. No other filesystem caused me so much frustration and despair during the testing as BTRFS.

In case you haven't noticed, the schedule for pgconf.eu 2015 was published a few days ago. One of my talks is called PostgreSQL Performance on EXT4, XFS, F2FS, BTRFS and ZFS and aims to compare PostgreSQL performance of modern Linux file systems (and also impact of various tuning options like write barriers, discard etc.). It's not an entirely new talk - it's a reworked (and significantly improved) version of a talk I gave on 5432 conference in May.

One of the rather surprising results was the EXT4 vs XFS comparison - even though XFS is usually presented and perceived as the faster option (and EXT4 as the more "conservative" choice), the results were quite clearly in favor of EXT4. The difference was ~10%, so not a negligible difference. Let me briefly discuss this and also show some of the updated results that I'll present in Vienna. And come to pgconf.eu and see the whole updated talk!

Some time ago I explained that there really are two kinds of statistics in PostgreSQL, and I explained what are the common issues with statistics tracking database activity, and how (not to) fix them.

That however does not mean there are no issues with data distribution (planner) statistics, and as that's one of my areas of interest, in this post I'll discuss the usual issues with data distribution statistics, usually observed by the user as slow (or even failing) queries. And I'll also mention what are the possible remedies available (if any).

The chart pretty much shows exponential speedup for vast majority of queries - the longer the duration on 9.3, the higher the speedup on 9.4. That's pretty awesome, IMNSHO. What exactly caused that will be discussed later (spoiler: it's thanks to GIN fastscan). Also notice that almost no queries are slower on 9.4, and those few examples are not significantly slower.

I spend more and more time debugging PostgreSQL internals - analyzing bugs in my patches, working on new patches etc. That requires looking at structures used by the internals - optimizer, planner, executor etc. And those structures are often quite complex, nested in various ways so exploring the structure with a simple print gets very tedious very quickly:

Now it's time to discuss results of the second benchmark - TPC-DS, which models an analytical workload, i.e. queries processing large amounts of data, with aggregations, large sorts, joins of large tables, TOP-N queries etc. That is very different from pgbench, executing queries manipulating individual rows mostly through primary keys, etc.

The one chart you should remember from this post is this one, illustrating how long it takes to execute 41 queries (subset of TPC-DS queries compatible with PostgreSQL since 7.4) on a 16GB dataset (raw CSV size, after loading into database it occupies about 5x the size because of overhead, indexes etc.).

The numbers are runtime in seconds (on the i5-2500k machine), and apparently while on PostgreSQL 8.0 it took ~5100 seconds (85 minutes), on 9.4 it takes only ~1200 seconds (20 minutes). That's a huge improvement.

Notice the 8.0 results are marked with a star - that's because on 8.0 one of the queries did not complete within an arbitrary limit of 30 minutes, so it was cancelled and was counted as taking 1h. Based on several experiments, I believe the actual runtime would be even longer than that - in any case it was much longer than on PostgreSQL 8.1, where this particular query got significantly improved by bitmap index scans.

So, in the introduction post I briefly described what was the motivation of this whole effort, what hardware and benchmarks were used, etc. So let's see the results for the first benchmark - the well known pgbench and how they evolved since PostgreSQL 7.4.

If there's a one chart in this blog post that you should remember, it's probably the following one - it shows throughput (transactions per second, y-axis) for various numbers of clients (x-axis), for PostgreSQL releases since 7.4.

Note: If a version is not shown, it means "same (or almost the same) as the previous version" - for example 9.3 and 9.4 give almost the same performance as 9.2 in this particular test, so only 9.2 is on the chart.

When I gave my Performance Archaeology talk at pgconf.eu 2014 in Madrid, I mentioned I plan to do a few blog posts about that topic soon. It's been four months since the conference, so it's about time to fulfill the promise. In my defense, a lot changed since that talk (e.g. I moved to 2ndQuadrant.

If you look at the pgconf.eu talk, you'll see it has about four sections:

I've decided to keep this structure for the blog posts too, so this post is a short explanation into the benchmarking, why I did that and what to be careful about when interpreting the data. And then there will be three blog posts, each discussing results for one of the benchmarks.

The series of blog posts may be seen as a different way to present the talk, but in case you'd like to see the actual talk (say, on a local meetup), let me know.

So ... Prague PostgreSQL Developer Day 2015, the local PostgreSQL conference, happened about two weeks ago. Now that we collected all the available feedback, it's probably the right time for a short report and sketch of plans for next year.

The first year of Prague PostgreSQL Developer Day (P2D2) happened in 2008, and from the very beginning was organized as a community event for developers - from students of software engineering to people who use PostgreSQL at work.

We've changed the venue a few times, but in most cases we've just moved from one university / faculty to another one, and the same happened this year for capacity reasons. The previous venue at Charles University served us well, but we couldn't stuff more than 120 people in, and we usually reached that limit within a week after opening the registration. The new venue, located at Czech Technical University can handle up to ~180 people, which should be enough for the near future - this year we registered 150 people, but a few more ended on a wait list.

About two weeks ago I posted a performance comparison of PostgreSQL compiled using various compilers and compiler versions. The conclusion was that for pgbench-like workloads (lots of small transactions), the compilers make very little difference. Maybe one or two percent, if you're using a reasonably fresh compiler version. For analytical workloads (queries processing large amounts of data), the compilers make a bit more difference - gcc 4.9.2 giving the best results (about 15% faster than gcc 4.1.2, with the other compilers/versions somewhere between those two).

Those results were however measured with the default optimization level (which is -O2 for PostgreSQL), and one of the questions in the discussion below that article is what difference would the other optimizations level (like -O3, -march=native and -flto) do. So here we go!

I mentioned there are two main reasons why memory contexts are introduced.
First, to simplify the memory management - tracking life cycle of all the
allocated pieces, making it less burdensome for the developers and also
preventing some usual memory leak scenarios etc.

The second goal is (of course) improving performance. Because everyone
knows (or rather believes) that the generic allocators (malloc/free
implementations provided by your OS - kernel/libc/...) are slow. But
is that still true?

After all, there are papers like Reconsidering Custom Memory Allocation
essentially demonstrating that most custom allocators don't really perform
any better than the a generic one (although they claim that allocators based
on regions - aka blocks - are a notable exception).

Also, we're getting a new kernel version every few months, presumably
getting improvements in the memory management too. So maybe there were
some significant improvements rendering the additional complexity of
custom allocator pointless?

We can't just zap the memory contexts completely, as we'd loose the
first benefit (tracking the allocated pieces). But maybe we could
change the implementation so that it really is just a thin wrapper around
malloc/free with a simple trackinng ... so let's try that.

On Linux, a "compiler" is usually a synonym to gcc, but clang is gaining more and more adoption. Over the years, phoronix published severalarticlescomparing of performance of various clang and gcc versions, suggesting that while clang improves over time, gcc still wins in most benchmarks - except maybe "compilation time" where clang is a clear winner. But none of the benchmarks is really a database-style application, so the question is how much difference can you get by switching a compiler (or a compiler version). So I did a bunch of tests, with gcc versions 4.1-4.9, clang 3.1-3.5, and just for fun with icc 2013 and 2015. And here are the results.

This is the post I promised last week, explaining a few common issues with memory contexts. The issues mostly lead to excessive overhead, i.e. excessive usage of resources - usually memory. And that's exactly what this post is about, so whenever I say "overhead" you can read "excessive memory usage." I will also try to give advices on how to avoid those issues or minimize the impact.

unused space (because of 2^N chunks) - expected ~25% for randomly sized chunks, but can get much worse

reuse not working efficiently - we'll see some examples how this can happen

If you haven't read that post, it's probably the right time to do that. Also, if you want to learn more about memory management and allocator implementations, there's a great post at IBM developerWorks explaining it quite well and also listing many interesting additional resources (e.g. various malloc implementations).

So let's see some usual (but somehow unexpected) examples of palloc overhead.

Last week I explained (or attempted to) the basics of memory contexts in PostgreSQL. It was mostly about the motivation behind the concept of memory contexts, and some basic overview of how it all works together in PostgreSQL.

I planned to write a follow-up post about various "gotchas" related to memory contexts, but as I was writing that post I ended up explaining more and more details about internals of AllocSet (the only memory context implementation in PostgreSQL). So I've decided to split that post into two parts - first one (that you're reading) explains the internals of allocation sets. The next post will finally deal with the gotchas.

If I had to name one thing that surprised me the most back when I started messing with C and PostgreSQL, I'd probably name memory contexts. I never met this concept before, so it seemd rather strange, and there's not much documentation introducing it. I recently read an interesting paper summarizing architecture of a database system (by Hellerstein, Stonebraker and Hamilton), and there's actually devote a whole section (7.2 Memory Allocator) to memory contexts (aka allocators). The section explicitly mentions PostgreSQL as having a fairly sophisticated allocator, but sadly it's very short (only ~2 pages) and describes only the general ideas, without going discussing the code and challenges - which is understandable, because the are many possible implementations. BTW the paper is very nice, definitely recommend reading it.

But this blog is a good place to present details of the PostgreSQL memory contexts, including the issues you'll face when using them. If you're a seasoned PostgreSQL hacker, chances are you know all of this (feel free to point out any inaccuracies), but if you're just starting hacking PostgreSQL in C, this blog post might be useful for you.

Last week, I briefly explained the recent improvements in count_distinct, a custom alternative to COUNT(DISTINCT) aggregate. I presented some basic performance comparison illustrating that count_distinct is in most cases significantly faster than the native COUNT(DISTINCT) aggregate. However, performance was not the initial goal of the changes, and the improvement in this respect is not that big (a few percent, maybe, which is negligible when compared to ~3x speedup against COUNT(DISTINCT)). The main issue was excessive memory consumption, and I promised to present the improvements. So here we go!

If I had to show you a single chart illustrating the improvements, it'd be this one. It compares the amount of memory necessary to aggregate the large tables from the previous post (100M rows). Remember, the lower the values the better, and red is the new implementation:

While the old implementation used 1.5GB - 2GB in most cases, the new implementation needs just ~600MB. I don't want to brag, but that's a pretty significant improvement (also you could also say I messed up the initial implementation and now I merely fixed it).

Almost a year ago, I wrote a custom experimental aggregate replacing COUNT(DISTINCT). The problem with the native COUNT(DISTINCT) is that it forces a sort on the input relation, and when the amount of data is significant (say, tens of millions rows), that may be a significant performance drag. And sometimes we really need to do COUNT(DISTINCT).

Extensibility is one of the great strengths of PostgreSQL - users of most other databases can only dream about things like defining custom aggregates. And this extension point was exactly the foundation for my idea was - implementing a custom aggregate, counting the distinct items in a more efficient way.

That's how count_distinct was conceived, and the principle was really simple. For each group, a small hash table is maintained, making it trivial to keep track of distinct values (and counting them). And it worked quite well - instead of the COUNT(DISTINCT) query

SELECTCOUNT(DISTINCTcolumn)FROMtable;

you can call the customa aggregate

SELECTCOUNT_DISTINCT(column)FROMtable;

and in most cases it was much faster (usually ~3x), without doing the sort, etc. The main disadvantage was memory consumption - the overhead of the additional hash table structure and palloc overhead was bad (and in some cases quite terrible - consuming an order of magnitude more memory than the amount of data being processed). I experimented with various hash table variants, allocation schemes, but either the impact on performance was unacceptable, or the memory consumption was not much lower. Until about a month ago ...

Every now and then, someone gets confused about resetting stats in PostgreSQL, and the impact it may have on planner and other parts of the database. Perhaps the documentation might be a tad more clear on this, because while it certainly can be confusing for those who never had to deal with stats in PostgreSQL before. But it's not just about newbies - I wrote a patch in this area for 9.3, and I get occasionally confused too.

The most surprising fact for most users is that 'stats' may actually mean one of two things - statistics describing distribution of the data, and monitoring statistics, tracking some interesting counters about the operation of the database system itself. Each of the kinds has different purposes, is stored differently, and the impact when the data is missing is very different.

So let's see what is the purpose of the two kinds of stats, what are the common issues and what happens when the data get missing for some reason.

Let's clarify some basic terms first. First, what is a distribution? Most of the time, it's a PostgreSQL extension in a fancy package, especially with a META.json specification containing additional information that are not available for plain extensions - description, links to git repositories, etc. Also, it may contain prerequisities - e.g. which PostgreSQL versions it's compatible with, etc. (more on this later).

It's a bit more complicated though - the distribution may pack multiple extensions (e.g. the "pgTAP" distribution used in META.json example packs "pgtap" and "schematap" extensions). Also, there are distributions that pack other kinds of software, not PostgreSQL extensions - for example omnipitr provides command-line utilities for PITR, and so on. Most distributions however pack a single PostgreSQL extension, and these two terms are frequently used as synonyms (after all, the X in PGXN stands for "eXtension").

The distributions are versioned (just like rpm, deb or any other packages), and each version has a "release status" with three possible values - unstable (alpha version), testing (beta version) and stable (ready for production). This is important when doing stats, because the unstable/testing versions are somehow expected to have bugs, and what really matters are stable versions (because that's what people are supposed to install on production). More specifically, what matters is the last stable version of the distribution.

So let's see some stats why the pgxnclient install (i.e. essentially make install) fails ...

Over the past few years, I wrote a number of extensions, and most of them are available on PGXN. I also use extensions published by other users (and I even had a talk about it at Nordic PostgreSQL Day 2014). My experience both as a user and a developer is that it's quite difficult to keep the extensions working, for a number of reasons.

For example when the extension aims to support a range of PostgreSQL versions, the internal APIs change from time to time, and suddenly an extension is broken. The PGXN packaging means additional complexity, which sometimes leads to other issues (e.g. when you increase the version in the control file, but forget to do that in the META.json file).

So eventually I started to do "semi-automatic" testing of my extensions - building, installing and testing the extensions across multiple PostgreSQL versions, using dumb shell scripts. This improved quality of my extensions, but had no impact on the extensions published by others. I could probably run the tests for all extensions, and somehow report the issues, however that seemed really tedious and time consuming.

But what if there was something like a pgbuildfarm.org, but testing extensions? Something that would run the tests on a range of PostgreSQL versions, and then report the results along with all the information useful when debugging the failure (like logs from compilation, loading the extension or diffs from the regression tests) to some central server. So I went and did that, and I'd like to present pgxn-tester.org to you.