Heroku has a feature called expensive queries that can help you out. It shows historical performance data about the queries running on your database: most time consuming, most frequently invoked, slowest execution time, and slowest I/O.

Recently, I used this feature to identify and address some slow queries for a site I run on Heroku named CodeTriage (the best way to get started contributing to open source). Looking at the expensive queries data for CodeTriage, I saw this:

On the right is the query, on the left are two graphs; one graph showing the number of times the query was called, and another beneath that showing the average time it took to execute the query. You can see from the bottom graph that the average execution time can be up to 8 seconds, yikes! Ideally, I want my response time averages to be around 50 ms and perc 95 to be sub-second time, so waiting 8 seconds for a single query to finish isn't good.

To find this on your own apps you can follow directions on the expensive queries documentation. The documentation will direct you to your database list page where you can select the database you’d like to optimize. From there, scroll down and find the expensive queries near the bottom.

Once you've chosen a slow query, you’ll need to determine why it's slow. To accomplish this use EXPLAIN ANALYZE:

In this case, I'm using Kubernetes because they currently have the highest issue count, so querying on that page will likely give me the worst performance.

We see the total time spent was 82 ms, which isn't bad for one of the "slowest" queries, but we've seen that some can be way worse. Most single queries should be aiming for around a 1 ms query time.

We see that before the query can be made it has to sort the data, this is because we are using an order on an offset clause. Sorting is a very expensive operation, you can see that it says the "actual time" can take between 27.725 ms and 81.220 ms just to sort the data, which is pretty tough. If we can get rid of this sort then we can drastically improve our query.

One way to do this is... you guessed it, add an index. Unlike last week though, the issues table is HUGE. While the table we indexed last week only had around 2K entries, each of those entries can have a virtually unbounded number of issues. In the case of Kubernetes there are 5K+ issues, and that's only the state=open ones. The closed issue count is much larger than that, and it will only grow over time. We want to be mindful of taking up too much database size, so instead of indexing ALL the data, we can instead apply a partial index. I'm almost never querying for state=closed when it comes to issues, so we can ignore those while building our index. Here's the migration I used to add a partial index:

Wow, from 80+ ms to less than half a millisecond. That's some improvement. The index keeps our data already sorted, so we don't have to re-sort it on every query. All elements in the index are guaranteed to be state=open so the database doesn't have to do more work there. The database can simply scan the index removing elements where repo_id is not matching our target.

For this case it is EXTREMELY fast, but can you imagine a case where it isn't so fast?

Perhaps you noticed that we still have to iterate over issues until we're able to find ones matching a given Repo ID. I'm guessing that since this repo has the most issues, it's able to easily find 20 issues with state=open. What if we pick a different repo?

I looked up the oldest open issue and found it in Journey. Journey has an ID of 10 in the database. If we do the same query and look at Journey:

Yikes. Previously we're only using 0.27 ms, now we're back up to 21 ms. This might not have been the "8 second" query we were seeing before, but it's definitely slower than the first query we profiled.

Even though we've got an index on created_at Postgres has decided not to use it. It's reverting back to a sorting algorithm and using an index on repo_id to pull the data. Once it has issues then it iterates over each to remove where the state is not open.

In this case, there are only 20 total issues for Journey, so grabbing all the issues and iterating and sorting manually was deemed to be faster. Does this mean our index is worthless? Well considering this repo only has 1 subscriber, it's not the case we need to be optimizing for. Also if lots of people visit that page (maybe because of this article), then Postgres will speed up the query by using the cache. The second time I ran the exact same explain query, it was much faster:

Total runtime: 0.092 ms

Postgres already had everything it needed in the cache. Does this mean we're totally out of the woods then? Going back to my expensive queries page after a few days, I saw that my 8 second worst case is gone, but I still have a 2 second query every now and then.

This is still a 75% performance increase (in worst case performance) so the index is still useful. One really useful feature of Postgres is the ability to combine multiple indexes. In this case, even though we have an index on created_at and an index on repo_id, Postgres does not seem to think it's faster to combine the two and use that result. To fix this issue we can add an index that has both created_at and repo_id, which maybe I'll explore in the future.

Before we go, I want to circle back to how we found our slow query test case. I had to know a bit about the data and make some assumptions about the worst case scenarios. I had to guess that Kubernetes was our worst offender, which ended up not being true. Is there a better way than guess and check?

It turns out that Heroku will output slow queries into your app's logs. Unlike the expensive queries, these logs also contain the parameters used in the query, and not just the query. If you have a logging addon such as Papertrail, you can search your logs for duration and get a result like this:

In this case, we can see that our 2.4 second query (the last query in the logs above) is using a repo id of 1348 and an offset of 760, which brings up another important point. As the offset goes up, the cost of scanning our index will also go up, so it turns out that we had a worse case than my initial guess (Kubernetes) and my second guess (Journey). It is likely that this repo has lots of issues that are old, and this query isn't made often, so that the data is not in cache. By using the logs we can find the exact worst case scenario without all the guessing.

Before you start writing that comment message, yes, I know that offset pagination is broken and there are other ways to paginate. I may start to look at alternative pagination options, or even getting rid of some of the pagination on the site altogether.

Indexes help a ton, but you have to understand the different ways your application will use them. It's not enough to profile with 1 query before and after, you need to profile a few different queries with different performance characteristics. In my case not only did I add an index, I went back to the expensive index page which let me know that my queries were still taking a long time (~2 seconds).

Performance tuning isn't about magic fixes, it's about finding a toolchain you understand, and iterating on a process until you get the results you want.