Introducing Dexter, the Automatic Indexer for Postgres

Your database knows which queries are running. It also has a pretty good idea of which indexes are best for a given query. And since indexes don’t change the results of a query, they’re really just a performance optimization. So why do we always need a human to choose them?

Introducing Dexter. Dexter indexes your database for you. You can still do it yourself, but Dexter will do a pretty good job.

Dexter works in two phases:

Collect queries

Generate indexes

We’ll walk through each of them.

Phase 1: Collect

And parses out the query and duration. It uses fingerprinting to group queries. Queries with the same parse tree but different values are grouped together. For instance, both of the following queries have the same fingerprint.

The data is aggregated to get the total execution time by fingerprint. You can get similar information from the pg_stat_statements view, except queries in the view are normalized. This means, you get:

SELECT * FROM ratings WHERE user_id = ?;

instead of

SELECT * FROM ratings WHERE user_id = 3;

However, we need the actual values to determine costs in the next step. To prevent over-indexing, you can set a threshold for the total execution time before a query is considered for indexing.

Phase 2. Generate

To generate indexes, Dexter creates hypothetical indexes to try to speed up the slow queries we’ve just collected. Hypothetical indexes show how a query’s execution plan would change if an actual index existed. They take virtually no time to create, don’t require any disk space, and are only visible to the current session. You can read more about hypothetical indexes here.

The main steps Dexter takes are:

Filter out queries on system tables and other databases

Analyze tables for up-to-date planner statistics if they haven’t been analyzed recently

Get the initial cost of queries

Create hypothetical indexes on columns that aren’t already indexes

Get costs again and see if any hypothetical indexes were used

While fairly straightforward, this approach is extremely powerful, as it uses the Postgres query planner to figure out the best index(es) for a query. Hypothetical indexes that were used AND significantly reduced cost are selected to be indexes.

To be safe, indexes are only logged by default. This allows you to use Dexter for index suggestions if you want to manually verify them first. When you let Dexter create indexes, they’re created concurrently to limit the impact on database performance.

Trade-offs and Limitations

The big advantage of indexes is faster data retrieval. On the flip side, indexes add overhead to write operations, like INSERT, UPDATE, and DELETE, as indexes must be updated as well. Indexes also take up disk space.

Because of this, you may not want to index write-heavy tables. Dexter does not currently try to identify these tables automatically, but you can pass them in by hand.

As for other limitations, Dexter does not try to create multicolumn indexes (edit: this is no longer the case). Dexter also assumes the search_path for queries is the same as the user running Dexter. You’ll still need to create unique constraints on your own. Dexter also requires the HypoPG extension, which isn’t available on some hosted providers like Heroku and Amazon RDS.

Thanks

This software wouldn’t be possible without HypoPG, which allows you to create hypothetical indexes, and pg_query, which allows you to parse and fingerprint queries. A big thanks to Dalibo and Lukas Fittl respectively.