Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I have a table with a multicolumn index, and I have doubts about the proper sorting of the indexes to get the maximum performance on the queries.

The scenario:

PostgreSQL 8.4, table with about 1MM rows

Values in column c1 can have about 100 different values. We can assume the values are evenly distributed, so we have about 10000 rows for every possible value.

Column c2 can have 1000 different values. We have 1000 rows for every possible value.

When searching data, the condition always includes values for these two columns, so the table has a multicolumn index combining c1 and c2. I have read about the importance of properly ordering the columns in a multicolumn index if you have queries using just one column for filtering. This is not the case in our scenario.

My question is this one:

Given the fact that one of the filters selects a much smaller set of data, could I improve performance if the first index is the most selective one (the one which allows a smaller set)? I had never considered this question until I saw the graphics from the referenced article:

Do all your queries have WHERE c1=@ParameterA AND c2=@ParameterB?
–
ypercubeJan 19 '13 at 13:55

No queries with just WHERE c1=@ParameterA or WHERE c2=@ParameterB or WHERE c1=@ParameterA GROUP BY c2 ? Or WHERE c1=@ParameterA AND c2 BETWEEN @ParamB1 AND @ParamB2?
–
ypercubeJan 19 '13 at 13:56

No. The queries use values from the two columns for filtering. I have no queries using just one column for filtering. Al of them are: WHERE c1=@ParameterA AND c2=@ParameterB
–
jap1968Jan 19 '13 at 13:57

And the filtering is always equality checks (=), never range conditions (inequality or BETWEEN)?
–
ypercubeJan 19 '13 at 13:59

There are also conditions like this: WHERE c1 = "abc" AND c2 LIKE "ab%"
–
jap1968Jan 19 '13 at 14:00

Answer

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

One of the answers (now deleted) claims the less selective column should go first. I claim it doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

If, as you say, the queries involving these 2 columns, are all equality checks of both columns, e.g.:

WHERE c1=@ParameterA AND c2=@ParameterB

do not bother with this. I doubt there will be any difference and if there is one, it will be negligible. You can always test of course, with your data and your server settings. Different versions of a DBMS can behave slightly differently regarding optimization.

The order inside the index would matter for other types of queries, having checks of one column only, or inequality conditions, or conditions on one column and grouping in the other, etc.

If I were to choose one of the two orders, I'd choose to put the less selective column first. Consider a table with columns year and month. It's more probable that you need a WHERE year = 2000 condition or a WHERE year BETWEEN 2000 AND 2013 or a WHERE (year, month) BETWEEN (1999, 6) AND (2000, 5).

A query of the type WHERE month = 7 GROUP BY year may be wanted sure (Find people born on July), but would be less often. That depends of course on the actual data stored in your table. Choose one order for now, say the (c1, c2) and you can always add another index later (c2, c1).

Update, after the OP's comment:

There are also conditions like this: WHERE c1 = 'abc' AND c2 LIKE 'ab%'

This type of query if exactly a range condition on c2 column and would need a (c1, c2) index. If you also have queries of the reverse type: