Tuesday, October 28. 2008

The age old question of why or why is my table index not being used
is probably the most common question that ever gets asked even by expert database users.

In this brief article we will cover the most common reasons and try to order by statistical significance.

For the uninitiated, how do you know when an index is not being used? You use EXPLAIN, EXPLAIN ANALYZE or PgAdmin's nifty graphical Explain plan described in
Reading PgAdmin Graphical Explain Plans.

Okay your query is not using an index, why and what can you do about it?

Problem: Outdated stats. Now this is less common of an issue if you have auto-vacuuming turned on, but if you have recently bulk loaded a table or added new indexes and then try to
do a query, then the statistics may be out of date.

Solution: vacuum analyze verbose sometable I add the verbose in because it is fun to see what vacuuming is doing and if you are as impatient as I am and you tables are big, its nice to get some confirmation that something indeed is happening. You can also just vacuum analyze verbose if you want to run against all tables.

Problem: The planner has decided its faster to do a table scan than an index scan : This can happen if a) your table is relatively small, or the field you are indexing has a lot of duplicates.
Solution: Case in point, boolean fields are not terribly useful to index
since 50% of your data is one thing and 50% is another. However they are good candidates to use for Partial indexes e.g. to only index data that is active.

Problem: You set up an index that is incompatible with how you are actually filtering a field. There are a couple of variants of this situation. The old

LIKE '%me' will never use an index, but LIKE 'me%' can possibly use an index.

The upper lower trap - you defined your index like: CREATE INDEX idx_faults_name
ON faults
USING btree(fault_name);,
But you are running a query like this: SELECT * FROM faults where UPPER(fault_name) LIKE 'CAR%'
Possible fix: CREATE INDEX idx_faults_name
ON faults
USING btree(upper(fault_name));

This is one that I wasn't even aware of. It is always nice to read about other peoples problems via newgroups because you discover all these problems you never even knew you had.
If your server was init with a non-C locale, doing the above still
doesn't work. This came up in Pgsql-novice newsgroup recently and Tom Lane provided an answer which works. This I suspect bites more people
than is known. The solution
Possible fix: CREATE INDEX idx_faults_uname_varchar_pattern
ON faults
USING btree(upper(fault_name) varchar_pattern_ops);

However even with the above solution it appears you may still need the variant below for exact matches and IN clauses:

We haven't quite proven if this an issue with database encoding, data itself or difference in versions of 8.2 vs 8.3. It seems for 8.3 for our particular dataset in UTF-8 the below is still needed for exact matches,
however for similar use-case in a 8.2 database in SQL-ASCII, the varchar_pattern_ops is enough for both exact and LIKE matches.

Dum newbie user - Just incompatible data type. E.g. doing something like creating an index on a date field and then doing a text compare with dates by casting your date to text.

Not all indexes can be used. Although PostgreSQL 8.1+ versions support what is known as Bitmap Index Scan, which allows multiple indexes on a table to be used in a query by creating in-memory bitmap indexes. If you have got many indexes, don't expect all
possible candidate indexes to be used. Sometimes a table scan is just more efficient.

Problem: The planner isn't perfect. Solution Cry and pray for a brighter day. Actually I've been pretty impressed with PostgreSQL's planning abilities compared to other databases. Some would say, If only I could provide hints, I could make this faster.
I tend to think hints are a bad idea and the best solution is just to make the planner better. The problem with hints is they take away one beauty of databases. That is the idea that the database knows the state of the data better than you do
and constantly updates that knowledge. Hints can quickly become stale where as a well-primed planner, will constantly be changing strategy as the database changes and that is really what makes database programming unique among various modes of programming.

You're right, a varchar_pattern_ops index won't work for IN (or more generally, plain "=") queries. This is because its equality operator is ~=~ ... which, in the original conception, didn't necessarily work like =. pattern_ops uses straight strcmp comparison while regular varchar comparison is based on strcoll, and in non-C locales strcoll can say that two strings are equal even if they aren't bitwise equal.

For the last release or so there's been a hard-wired requirement that varchar = be plain bitwise equality, which is what ~=~ does. So for 8.4 we have gotten rid of ~=~ and made regular = be the equality member of the varchar_pattern_ops opclass.

In short: as of 8.4 a pattern_ops index will be able to service plain "=" and hence IN queries, but in earlier releases you really do need two indexes if you're not using C locale.

One of the common causes for using a sequential scan instead of an indexed one you didn't mention is that it will happen if the query is accessing a large portion of the table. Sometimes people wonder why the index isn't being used for a query that is accessing, say 30% of the rows. The reality is that if you're accessing that many, given how typical rows are clustered together you might as well avoid the overhead of using the index and just look at the whole thing.

With default parameters I believe that crossover point (where it's considered faster to just access the whole thing) is normally at 20% of the table--if the planner believes you're going to see more than that, it might as well just fetch the whole thing. Which is normally correct, but can be wrong particularly for very large tables where the data is packed tightly (a typical example is a large historical archive you're pulling a section out of). That sort of thing is where temporary changes to random_page_cost and enable_seqscan can be handy as pseudo-hints to the optimizer.

You say:
"...I tend to think hints are a bad idea and the best solution is just to make the planner better..."

You're kidding, right?

Denying that, in certain, simple cases a hint like "use index_X" is somehow destroying the "beauty" of the database is living in a fantasy world. Mr. Fetter makes an excellent case by talking about how he makes "temporary changes" to system operation variables in order to bend the optimizer to his way of seeing things.

If, in fact, the person writing the query knows about the structure of the table, and the relation returned rows have to a particular index, it seems far more reasonable to let the person give the optimizer a hint on where to start. That's why they're called hints!

Assuming that the optimizer can be made all-knowing and all-seeing seems pretty irrational to me, but that's just my view.

I still say that hints are a bad idea. In reality I find they are more abused than useful and in some cases they lead to really stupid things if some other dba comes along not knowing there is a hint living in code somewhere and restructures the table such that the hint is more than useless.

The statistics of the database know more about the distribution of your data than you do. Hints require you constantly tweak your queries as new indexes are added or mroe data is added.

Hints should not live in queries. If they must exist, they should live as part of the table definition or someother meta data elsewhere. (which seems even more fantasy thinking I suppose)

first of all, I wanted to say that I have a profund respect for postgre conceptors as it is a hard and complex job to build a solid database software as postgre.

However, I don't understand your strong opposition to SQL Hints which seems quite old (cf. http://archives.postgresql.org/pgsql-performance/2003-12/msg00181.php :
Tom Lane is "Doctor Optimization," and if you look at past discussion threads of this sort, you'll see that he tends to rather strongly oppose the introduction of "hints.")

Of course, making the optimizer as best as possible is the right way.

Of course you should use the right data-type, etc.

Of course hints can "lead to really stupid things".

Of course "The statistics of the database know more about the distribution of your data than you do" ... except in some rare cases where the optimizer cannot go the right way!

A simple case is with a (indexed) column with few values (ex: a "status" column with "new" or "old" possible values only).

If you have well written your application, you should use bind variables. So your query from the planner point of view should look like something as :
SELECT * from table where status = $status;

Let say that data repartition is 1% of "new" lines and 99% of "old" lines.

What's the best for our query? Even the most clever optimizer could not answer!

The best for "old" status would be a FULL TABLE SCAN.
The best for "new" status would be an INDEX RANGE SCAN.

And that's a really simple example and there are many tricky situations where an execution plan should go the wrong way...

Moreover, in production situations, changing the query or the design is a mess! And a hint can sometimes simply fix the problem!

Why this should be so bad to fix a heavy SQL request in 5 minutes instead of struggling during months to convince the development team or whoever to change their design and still have users complaining of the performances?

And no matter for me to change the DB settings (as far as it is well configured) for a single query!

Ok, so hints are bad because they go with the query. I can agree with that, but only because they are not more visible within the db.
Perhaps a happy medium is to register queries (a la stored procedures) and then also associate and register hints with parameters, or at least comments, from the designer that can give some context to the reason for the hint.
It seems like the problem is between what the query designer knows about what will be returned vs what the optimizer can guess based on table statistics - with limited statistics, and zero knowledge about the reason for the query. In this scenario, the optimizer should work great *most* of the time, but will not be optimal sometimes. Ah well, enough speculation for now.
Thanks,
-T

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.Enter the string from the spam-prevention image above: