Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

Most of us will probably agree that using database indexes is good. Too many indexes and performance can actually be degraded.

As a general rule, which fields should be indexed?
Which fields should not be indexed?
What are the rules for using indexes while striking a balance between too many and not enough indexes in order achieve performance improvements, not degradation?

7 Answers
7

Short

The "too many indexes" rule is a bit misleading I think.

Long

Given that the average database is around 98% reads (or higher) reads need to be optimised. An INSERT is a read if there is a unique index, for example. Or the WHERE on an update. I once read that even a write intensive database is still 85% reads.

What you do have is poor quality indexing. Examples:

wide clustered indexes (SQL Server especially)

non-monotonic clustered indexed

overlapping indexes (eg cold, cole and cold, cole, colf)

many single column indexes (also overlapping with more useful indexes) that are useless for your queries

no INCLUDEs, not covering (eg all single column indexes)

...

Note it's quite typical to have indexes several times bigger than your actual data even in OLTP systems.

Generally, I'd start with the

clustered index (usually PK)

unique indexes (not constraints, these can't be covering)

foreign key columns

Then I'd look at:

common queries and see what I need. A query running every seconds needs tuning. The report at Sunday 4am can wait.

with SQL Server, the weighted missing index DMVs

Saying that, I have broken these rules for some systems after seeing how things panned out (10 billion rows later) to tune a system. But I'd never consider not indexing unless I could demonstrate why I'm doing so.

Quite simply one of the best series of articles written on which indexes to chose and why would be by Gail Shaw. You can find the articles by clicking here

The question you ask can be answered 50 different ways. It really all boils down to the data you have and how it will be queried. A general rule is that you should always have a clustered index on each table to avoid heaps. Clustered indexes should typically be as small as possible. If the table has a clustered index then all index records on the leaf pages of the non-clustered index will store the respective clustered index's record value for bookmark lookups. If a table is a heap then SQL will create a unique identifier for bookmark lookups. I can't recall the size it is either 8 or 16 bytes. This could end up being a much larger datatype then say an INT. Imagine having 8 non-clustered indexes on a heap table.

You should profile your database usage and load, and identify bottlenecks due to missing indexes - or due to too many indexes. Then you have to choose the proper index - and that require a good knowledge of the specific database indexing techniques.

Even with all of the above links, you need to look at what Kimberly Tripp has written regarding the care, feeding, and use of indexes.

For starters, follow this link to Kimberly's collection of her index-related blog posts. You are able to explore specific topics using the "On this page" and "Categories" widgets on the left-side of your browser window.

I want to add here that different databases require different strategies. Let's compare MySQL w/InnoDB and PostgreSQL for example.

InnoDB

InnoDB tables are basically a b-tree index of the primary key which are extended to include the row information in the index entry. Physical order scans are not supported and all scans happen in logical order. This means two things:

A sequential scan in Innodb generates a lot of random disk I/O, and

The primary key index must be traversed regardless of whether one is using a secondary index.

Primary key lookups are faster in this model than in any other approach.

In this case it is very important to index enough fields in multi-page tables. The typical rule is index everything you want to filter by.

PostgreSQL

PostgreSQL uses heap files, one table per file (some tables may be many files) where tuples are allocated from that heap's free space. Physical order scans are supported. For a logical order scan to work, an index must be added.

Primary keys in PostgreSQL are basically a subset of unique indexes where no values may be NULL. UNIQUE constraints are done using implicit indexes, and several other index types are supported with different operations possible in the index.

This means:

Primary key lookups, assuming a reasonably large tablerequire hitting an index file and a table file. This is significantly slower than MySQL's approach where the index only must be traversed and the row is contained in the index.

Physical order scans perform much better, reducing random disk I/O where significant numbers of rows are to be processed.

Secondary index scans perform better than MySQL's because only one index must be traversed to get to the physical portion of the table.

In this model, indexes are often necessary but the planner has more freedom when to use an index, and the implications of not using one are often less severe. The tables are more generally optimized (rather than specializing in pkey lookups) and so fewer indexes are required.