schemaname, tablename and attname columns seem to be obvious. Inherited simply says whether values for this table contain values from any tables that did inherit this column.

So, if I'd have:

CREATETABLE z () inherits (test);

and then I'd put some data in table z, then statistics for table test would have “inherited = true".

The rest of columns means:

null_frac – how many rows contains null value in given column. This is a fraction, so it's value from 0 to 1

avg_width – average width of data in this column. In case of constant-width types (like int4 here) it is not really interesting, but in case of any datatype with variable width (like text/varchar/numeric) – it is potentially interesting.

n_distinct – very intersting value. If it is positive ( 1+ ) – it will be be simply estimated number (not fraction!) of distinct values – this we can see in case of all_the_same column, there n_distinct is correctly 1. If it is negative, though it's meaning is different. It means then which fraction of rows has unique value. So, in case of almost_unique, stats suggest that 92.146% of rows have unique value (which is a bit short of 95.142% which I showed earlier). The values can be incorrect due to the “random sample" thing I mentioned earlier, and will explain more in a bit.

most_common_vals – array of most common values in this table

most_common_freqs – how frequent are the values from most_common_vals – again, it's fraction, so it can be at most 1 (though then we'd have only one value in most_common_vals). In her, in almost_unique, we see that Pg “thinks" that values 21606, 27889, 120502, 289914, 417495, 951355 are the ones that happen most often – which are not, but this is, again, caused by “random sample" effect

histogram_bounds – array of values which divide (or should divide – again “random sample" thing) whole recordset into groups of the same number of rows. That is – number of rows with almost_unique between 2 and 10560 is the same (more or less) as number of rows with almost_unique between 931785 and 940716

correlation – this is interesting statistic – it shows whether there is correlation between physical row ordering on disk, and values. This can go from -1 to 1, and generally the closer it is to -1/1 – the more correlation there is. For example – after doing “CLUSTER test using i2" – that is reordering table in almost_unique order, I got correlation of -0.919358 – much better than shown above -0.000468686.

most_common_elems, most_common_elem_freqs and elem_count_histogram are like most_common_vals, most_common_freqs and histogram_bounds but for non-scalar datatypes (think: arrays, tsvectors and alike).

Based on this data, PostgreSQL can estimate how many rows will be returned by any given part of query, and based on this information it can decide whether it's better to use seq scan or index scan or bitmap index scan. And when joining – which one should be faster – Hash Join, Merge Join or perhaps Nested Loop.

If you looked at the data above you could have asked yourself – it's pretty wide output – there are many values in the most_common_vals/most_common_freqs/histogram_bounds arrays. Why are there so many?

Reason is simple – it's configuration. In postgresql.conf you can find default_statistics_target variable. This variable tells Pg how many values to keep in these arrays. In my case (default) it's 100. But you can easily change it. Either by changing postgresql.conf, or even on a per-column basis, with:

Please note that the second analyze tested only 3000 rows – not 30000 as the first one.

This is the “random sample".

Analyzing all rows in a table would be prohibitively expensive in any medium or large table.

So, Pg does something rather smart.

First – it reads random part of pages in a table (reminder: each page is 8kB of data). How many – 300 * statistics_target.

Which means, in my case, with default_statistics_target = 100, it would read 30000 pages. (my table doesn't have that many, so instead it did read all of them).

From these pages, ANALYZE gets just information about live and dead rows. Afterwards, it gets data on random sample of rows – again 300 * statistics target, and calculated the column statistics based on these data.

In my case – table had 100,000 rows, but with default_statistics_target = 100, only a third was analyzed. With statistics target the count of rows that is analyzed is even lower – just 3000.

You could have said: OK, but then these statistics are not accurate. It could be that some super-common value just doesn't happen to be in the scanned rows. Sure. You're right. It's possible. Though not really likely. I mean – you are getting random part of data. The chances that you'll get the x% of the table that just doesn't happen to have any row with some value that exists in all other rows are small.

This also means that sometimes, it can happen, that running analyze will “break" your queries. For example – you'll get statistics on another pages, and it will happen that some values will get skipped (or, on the contrary – you will get in most_common_vals things that aren't really all that common, just Pg happened to pick right pages/rows to see it). And based on such stats, Pg will generate suboptimal plans.

If such case would hit you, the solution is rather simple – bump statistics target. This will make analyze work harder, and scan more rows, so the chances of this happening again will get even smaller.

There is a drawback in setting large targets, though. ANALYZE has to work more, of course, but this is maintenance thing, so we don't really care about it (usually). The problem is that having more data in pg_statistic means that more data has to be taken into consideration by Pg planner. So, while it might look tempting to set default_statistics_target to it's max of 10,000, in reality I haven't seen database which would have it set that high.

Current default of 100 is there since 8.4. Before that it was set at 10, and it was pretty common to see on irc suggestions to increase it. Now, with default 100, you're more or less set.

One final thing I have to talk about, though I really don't want, are settings that make Pg planner use different operations.

First – why I don't want to talk about it – I know for a fact that this can be easily abused. So please remember – these settings are for debugging problems, not for solving problems. Application that would use them in normal operation mode is at least suspected if not outright broken. And yes, I know that sometimes you have to. But the sometimes is very rare.

For example – setting enable_seqscan to false (which can be done with SET command in SQL session, you don't have to modify postgresql.conf) will cause planner to use anything else it might, just to avoid seq scan.

Since sometimes it's not possible to avoid seq scan (there are no indexes on the table, for example) – these settings don't actually disable operations, just associate huge cost with using them.

For example. With our test table, we know what searching with “all_the_same = 123" will use seq scan, as it's cheap:

And now we see that when there is no other option – just a seq scan (it's interesting it didn't choose to do index scan on i2, after all, it has pointers to all rows in the table), the cost skyrocketed to 10,000,000,000 – which is exactly what enable_* = false does.

I think that's about it. If you read the whole series you should have enough knowledge to understand what's going on, and, more importantly, why.

In the 1st blogpost you wrote about settings:
seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 4.0 # same scale as above
cpu_tuple_cost = 0.01 # same scale as above
cpu_index_tuple_cost = 0.005 # same scale as above
cpu_operator_cost = 0.0025 # same scale as above

Here you introduced new portion
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_indexonlyscan = on
enable_material = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on

How do these settings treat to each other ? Seems that they have rather similar meaning, or not ?

They have very different meaning. Not sure how you got into conclusion that they are similar. *_cost values are some values attached to specific low-level operations. enable_* are basically a tools to skyrocket cost of some high-level operations, so that pg will not use it.