Flattening (en)

Flattening presents one of a few stools in PostgreSQL for beginner (and sometimes for advanced user). The inspiration of this article was conversation with my partner, who could't use one specific index in PostgreSQL. In this point of view is not PostgreSQL accommodating program. If we want to understand reasons, when PostgreSQL stopped find optimal working plan, We have to understand process of selection optimal executive plan.
The trying variation of problematic SQL query (like derived table)in the time, when PostgreSQL gets into a troubles is really close to meet with flattening. Workaround is easy, but you have to notify, that the our problems make just flattening.

This problem is not frequent an I have heard about it only in discuss. Who watch discussion on pg_performance can immediately identify and dispatch it. But who has so much time to watch every discussion? Flattening doesn't caused problems in normal factors and it is very useful technique, that make faster many queries. Without this technique can't we effectively use views in PostgreSQL.

When prediction fails to start

The basic hypothesis right behavior of optimalizer is accurate presumption of operation predicate.
It is based on column statistic. To every attributes from a table is joined histogram of frequency single class's value. If I can get predicate to one concrete class I will predict effect of predicate by number of rows in the table. The column statistic is actualized by order ANALYZE. PostgreSQL use default set up until this order is started. Default set up suppose that number of rows is inversely proportional (for 3 columns 1770 rows). The next suppositions is steady split values of this attributes. As well as in Prachett is probability of satisfaction these two supposition 1:1000000.
The order ANALYZE update statistics and we are able to do credible presumption:

Data has a character, that can't by exactly approximated by histogram about n classes, when n is default set up and valuated right 10. Imagine, that you have 1000 collection spots. In the towns are 80% and minimally once a day every collection spot generate one notation. The other 20% are located in the unpopulated part of a country and this 20% generates one notation per month together. We can't be surprised that the histogram with ten classes is not included this 20% places. Optimal queries on this 80% collection spots and non optimal queries on this 20% of minorities collection spots.It is not a mistake of PostgreSQL. Just we are not able accurately represent reality with this resolution level. The increase of the class number can help us. Technically is number of class limited by one thousand but the real number is something around three hundred. The need more classes signalize that something in input data is not in order and it is necessary to change data model or sort data

I will do an example: I will randomly generate pairs from space [1..800, 1..100]. It is eight hundred collection spots. I can measure value 1..100. I dispatch 90% notation for interval 50-100 to simulate inequality in dates.

One of the reasons why PostgreSQL fail is because of random generating data which are independent. So the test aggregate is not to large. Real data has fortunately different character.

The exchange of number don't have to help and we have to prescribe query. Exactly two variants of queries or shatter queries and write saved procedure (in our example shouldn't we use predicate "a"). Should we automatically increase statistics on hundreds of classes? Absolutely no. We are not able to create new index ahead and it is the same reason. It will be about previous optimalization. In SQL should we sort problems when they break out. Not before, not later. We can use logging the slow queries. The control this logo is normal quality of good dba.

Now I think that we know about areas where can optimalization of queries fail. Flattening can make problems only in this areas. So if everything work well we don't care about the details. Realy ther are quite non typical incidents. We know that optimalization make problems sometimes. But we don't know how find resolution which can solve presumption of predicate's effect without exceptions.

Flattening

What is Flattening? Flattening is a method when we transform SQL query with derived table into equivalent SQL query - but without derived table. For example:

select * from (select * from data where a = 10) s where s.b = 20; --> select * from data where a = 10 and b = 20;

Where is a problem? In understanding of braces.In normal programing languages the braces changes sequence of scoring. So we think that it firstly analyze predicate a = 10 (and use index upon a), and than upon the output analyze predicate b = 20. It is mistake. In this case the braces are only for correct syntactic entry and it doesn't cause the sequence of the count. So if PostgreSQL preferred index b, we can query try thousands times and PostgreSQL still preferred index b. Because of Flattening, nested query is been showed and Pg choose between indexes upon a and b - still choose index upon b. So we can't preferred index in PostgreSQL. One step back. Why nested query? We use nested query in cases when we think that we are able to make Pg what we exactly want to - when the data is not in order and we don't want to lose control. It is really not be done.

To my example. Who will write so hard query? No one. But everyone who works with views use it. The query from example is similar like

select * from view where b = 20
create view view as select * from data where a = 10.

It is the reason why is flattening in PostgreSQL so important. Because of it is easy to use views without another work with no other query into a table.

We have to compel PostgreSQL to respect braces. Flattening can't be used if the nested query contain words LIMIT or OFFSET. LIMIT is not functional and it changes the property of query. But OFFSET 0 is only a insertion and it is why it's used. One time people think about introduction of hint - it wasn't accepted for problems which hint of flatting can cause - there will be problem with introduction of hint for index. In addition is OFFSET 0 public secret. But you can find it. So why make up new non optimal syntax? If you are interested in this theme you should search in pg_performance keywords flattening and "OFFSET 0". It is why it shouldn't contain LIMIT.

I don't think that exist some other problems in it. PostgreSQL just work and tricks are not the right resolution. I've remembered if we want to compel PostgreSQL - 7.4 version to use indexes we have to explicitly change types of numeral types. On the other hand 7.4 version is reliable database but now is retired ( 7.4 was the first useful version). The 8.x version are much better and allow programators concentrate to their work - without tricking.