Of course trying to understand the explain above, as a start point, is pretty futile. Let's start with something simpler. But even before that I want you to understand one very important thing:

PostgreSQL knows

That means that PostgreSQL keeps some meta-information (information about information). Rowcounts, number of distinct values, most common values, and so on. For large tables these are based on random sample, but in general, it (Pg) is pretty good at knowing stuff about our data.

So, the Seq scan line, informs us that we are scanning table in sequential mode. And that the table is named “test" (though in here lies one of the biggest problems with explain, as it's not showing schema, which did bite my ass more than couple of times).

What do you think would be the best way to run the query? Sequentially scan the table, or use index?

If you say: use index of course, there is index on this column, so it will make it faster – I'll ask: what about case where the table has one row, and that it has some_column = 123?

To do seq scan, I just need to read one page (8192 bytes) from table, and I get the row. To use index, I have to read page from index, check it to find if we have row matching condition in the table and then read page from table.

In total – more than twice the work!

You could say – sure, but that's for very small tables, so the speed doesn't matter. OK. So let's imagine a table that has 10 billion rows, and all of them have some_column = 123. Index doesn't help at all, and in reality it makes the situation much, much worse.

Of course – if you have million rows, and just one has some_column = 123 – index scan will be clearly better.

So – it is impossible to tell whether given query will use index, or even if it should use index (I'm talking about general case) – you need to know more. And this leads us to simple thing: depending on situation one way of getting data will be better or worse then other.

PostgreSQL (up to a point) examines all possible plans. It knows how many rows you have, it knows how many rows will (likely) match the criteria, so it can make pretty smart decisions.

But how are the decisions made? That's exactly what the first set of numbers in explain shows. It's the cost.

Some people think that cost is estimate shown in seconds. It's not. It's unit is “fetching single page in sequential manner". It is about time and resource usage.

By default, Pg used Index Scan. Why? It's simple – it's the cheapest in this case. Total cost of 8.29, while bitmap heap scan (whatever that would be) would be 8.30 and seq scan – 18.5.

OK, But cost shows two numbers: number..number. What is this about, and why am I talking about the second number only? If we'd get into consideration first number then seq scan is winner, as it has there 0 (zero), while index scan is 0.28, and bitmap heap scan – 4.28.

So, the range (number .. number) is because it shows cost for starting the operation row and cost for getting all rows (By all, I mean all returned by this operation, not all in table).

What is the starting cost? Well, for seq scan there is none – you just read page, and return rows. That's all. But for example, for sorting dataset – you have to read all the data, and actually sort it, before you can consider returning even first of the rows. Which can be nicely seen in this explain:

Please note that startup cost for Sort is 22.88, while total cost is just 23.61. So returning rows from Sort is trivial (in terms of cost), but sorting them – not so much.

Next information in explain is “rows". This is estimate, how many rows does PostgreSQL think that this operation is capable of returning (it might return less, for example in case of LIMIT). This is also important for some operations – joins for example. As joining two tables that have 20 rows together can be done in many ways, and it doesn't really matter how, but when you join 1 million row table with 1 billion row table, the way you do the join is very important (I'm not talking about “inner join/left join/…" but rather about “hash join", “nested loop", “merge join" – if you don't understand, don't worry – will write about it later).

This number can be of course misestimated – for many reasons. Sometimes it doesn't matter, and sometimes it does. But we'll talk about misestimates later.

Final bit of information is width. This is PostgreSQL idea on how many bytes, on average, there are in single row returned from given operation. For example:

There are 5 operations there: sort, hash join, seq scan, hash, and seq scan. PostgreSQL executes the top one – sort. Which in turn executes the ones directly below (Hash join) and gets data from them. Hash join, to return data to sort, has to run seq scan (on pg_proc) and hash (#4). And the final hash, to be able to return data, has to run seq scan on pg_namespace.

It is critical to understand that some operations can return data immediately, and, what's even more important, gradually. For example – Seq Scan. And some others cannot. For example – in here we see that Hash (#4) has the same cost for startup as it's “suboperation" seq scan – for “all rows". This means, that for hashing operation to start (well, to be able to return even single row), it has to read in all the rows from suboperation(s).

The part about returning rows gradually becomes very important when you'll start writing functions. Let's consider such functions:

CREATEORREPLACEFUNCTION public.test()

RETURNS SETOF INTEGER

LANGUAGE plpgsql

AS $function$

DECLARE

i int4;

BEGIN

FOR i IN1..3 loop

RETURNNEXT i;

perform pg_sleep(1);

END loop;

RETURN;

END;

$function$;

If you don't understand it, don't worry. The functions returns 3 rows, each contains single integer – 1, 2 and 3. The important bit, though, is that it sleeps for 1 second after returning each row.

The same 3 seconds. Why? Because PL/pgSQL (and most, if not all, other PL/* languages) cannot return partial results. It looks like it can – with “return next", but all these are stored in a buffer and returned together when function execution ends.

On the other hand – “normal" operations, usually, can return partial data. It can be seen with something trivial like seq scan, on non-trivial table:

CREATETABLE t ASSELECT i AS id,
repeat('depesz',100)::text AS payload
FROM generate_series(1,1000000) i;

As it can be seen – seq scan ended very fast – as soon as it satisfied Limit's appetite for exactly 1 row.

Please also note that in here, even the costs (which are not the best thing for comparing queries) show that top node (seq scan in first, and limit in second query) have very different values for returning all rows – 185834.82 vs. 0.02.

So, the first 4 numbers for any given operation (two for cost plus rows and width) are all estimates. They might be correct, but but they as well might not.

The other 4 numbers, which you get when you run “EXPLAIN ANALYZE query" or “EXPLAIN ( ANALYZE on ) query" show the reality.

Time is again a range, but this time is real. It is how much time PostgreSQL actually did spend working on given operation (on average, because it could have run the same operation multiple times). And just as with cost – time is a range. Startup time, and time to return all data. Let's check this plan:

As you can see – Limit has startup time of 0.008 (millisecond, that's the unit in here). This is because Seq Scan (which Limit called to get data) took 0.007ms to return first row, and then there was 0.001ms of processing within limit itself.

Afterwards (after returning 1st row), limit kept getting data from Seq Scan, until it got 100 rows. Then it terminated Seq scan (which happened 0.133ms after start of query), and it finished after another 0.019 ms.

Actual rows value, just as name suggests, shows how many rows (on average) this operation returned. And loops show how many times this operation was in total ran.

In what case would an operation be called more than once? For example in some cases of joins, or subqueries. It looks like this plan.

Please note that loops in 3rd operation is 2. This means that this Seq Scan was ran twice, returning, on average 1 row, and it took, on average, 0.160ms to finish. So the total time spent in this particular operation is 2 * 0.160ms = 0.32ms (that's what is in exclusive/inclusive columns on explain.depesz.com).

Very often poor performance of a query comes from the fact that it had to loop many times over something. As in here.

(of course it doesn't mean that it's Pg fault that it chose such plan – maybe there simply weren't other options, or other options were estimated as even more expensive).

In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent in here of almost 79ms.

I think that wraps theoretical information required to read explains. You will probably still don't understand what the operations or other information mean, but at the very least – you will know what the numbers mean, and what's the difference between explain (shows costs in abstract units, which are based on random-sample estimates) and explain analyze (shows real life times, rowcounts and execution counts, in units that can be compared with different queries).

As always, I'm afraid that I skipped a lot of things that might be important but just escaped me, or (even worse) I assumed that these are “obvious". If you'll find anything missing, please let me know, I'll fill in ASAP.

But, before, let me just say, that I plan to extend this blogpost in 2-3 another posts that would cover more about:

what are various operations, how they work, and what to expect when you see them in explain output

what are statistics, how Pg gets them, how to view them, and how to get the best out of it

I sort of expected an unit / decimal point mixup with 0.008 milliseconds and the timings after that, but the post is correct, that’s 8 microseconds. I do wish PostgreSQL would print the units (or at least default to seconds).

@Tobu: once you get used to it, it just makes sense. Adding “ms” to every value would take too much space. And defaulting to second would make the times even worse. I see lots of queries that run in less than 1ms. and this would look absolutely awful: 0.000008

I wouldn’t change seq scan cost, as this is basically an unit. If you have your database on SSD, it might be good to lower random_page_cost significantly. As for the rest – test it. Play with it on your dataset and see what comes out of it.