Say No to Randos (in Your Database)

When I used my first ORM, I wondered “why didn’t they include a random() method?” It seemed like such an easy thing to add. While there are many reasons you may want to pull a record out of your database at random, you shouldn’t be using SQL’s ORDER BY RANDOM() unless you’ll only be randomizing a limited number of records. In this post, we’ll examine how such a simple looking SQL operator can cause a lot of performance pain, and a few different techniques we can use to fix it.

Recently, I was running the heroku pg:outliers command to see what it looked like after some of those optimizations, and I was surprised to find I was spending 32% of database time in two queries with a RANDOM() in them.

SELECT
"repos".*
FROM "repos"
WHERE
(repos.id not in (?,?))
ORDER BY
random()
LIMIT $1

This query is used once a week to encourage users to sign up to “triage” issues on an open source repo if they have an account, but aren’t subscribed to help out. We send an email with 3 repo suggestions including a random repo. That seems like a good use of RANDOM() after all, we literally want a random result. Why is this bad?

While we’re telling Postgres to only give us 1 record, the ORDER BY random() LIMIT 1 doesn’t only do that. It orders ALL the records before returning one.

While you might think it’s doing something like Array#sample it’s really doing Array#shuffle.first. When I wrote this code, it was pretty dang fast because I only had a few repos in the database. But now there are 2,761 repos and growing. And EVERY time this query executes, the database must load rows for each of those repos and spend CPU power to shuffle them.

You can see another query that was doing the same thing with the user table:

There are are some pretty severe caveats here to watch out for. My implementation caches the max id, which is fine for my use cases, but it might not be for yours. It’s possible to do this entirely in SQL using something like:

As always, benchmark your SQL queries before and after an optimization. This implementation doesn’t handle sparsely populated ID values very well, and doesn’t account for randomly selecting a max id that is greater than one available based on WHERE conditions. Essentially, if you were to do it “right”, you would need to apply the same WHERE conditions to the subquery for MAX(id) as to your main query.

For my cases, it’s fine if I get some failures, and I know that I’m only applying the most basic of WHERE conditions. Your needs might not be so flexible.

If you’re thinking “is there no built-in way to do this?”, it turns out there is TABLESAMPLE, which was introduced in Postgres 9.5. Thanks to @HotFusionMan for introducing it to me.

Here’s the best blog I’ve found on using TABLESAMPLE. The downside is that it’s not “truly random” (if that matters to your application), and you cannot use it to only retrieve 1 result. I was able to hack it by doing a query that the only table sampled 1%. Then I used that 1% to get ids and then limited to the first record. Something like:

While this works and is much faster than ORDER BY RANDOM() for queries returning LOTS of data (thousands or tens of thousands of rows), it’s very slow for queries that have very little data.

When I was optimizing https://www.codetriage.com I found another query that uses RANDOM(). It is being used to find open source issues for a specific repo. Due to the way issues are stored, the IDs are not very contiguous so my previous (sampling >= to a random id) trick wouldn’t work as well. I need a more robust way to randomize the data, and I thought perhaps TABALESAMPLE might perform better.

While some repos have thousands of issues, 50% have 27 or fewer issues. When I used the TABLESAMPLE technique for this query, it made my small queries really slow, and my previously slow queries fast. Since my numbers skew towards the small side for that query, it wasn’t a net gain, so I stuck to the original RANDOM() method.

Update: To claify I know that random() is not slow, and the article does not say that it is. It says ORDER BY random() is slow. It’s the ordering of a large number of articles that cause the pain.

Update 2: No, using a random offset is not faster than the >= method with IDs. In fact using an offset for a large amount of records can also be a huge perf problem.

Have you replaced RANDOM() with another more efficient technique? Let me know about it to Twitter @schneems.

Subscribe to my Newsletter 😻 🤠

Keep Reading 🚀

Today I have an unusual proposition for you. I’m spending a bunch of time to try to get Beto elected to Texas Senate, so I’ve not been able to write as much technical content. Rather than slow down on my door knocking, I’m looking to pick up the pace, and I want you to do it with me. Starting today, I’m offering anyone who phone banks or “block walks” (knocks on doors) the opportunity to win some of my technical time. Here’s how it’s going to work.

You might know rubocop as the linter that helps enforce your code styles, but did you know you can use it to make your code faster? In this post, we’ll look at static performance analysis and then at the end there’s a video of me live coding a PR that introduces a new performance cop to rubocop.

Rails 5.2 was just released last month with a major new feature: Active Storage. Active Storage provides file uploads and attachments for Active Record models with a variety of backing services (like AWS S3). While libraries like Paperclip exist to do similar work, this is the first time that such a feature has been shipped with Rails. At Heroku, we consider cloud storage a best practice, so we’ve ensured that it works on our platform. In this post, we’ll share how we prepared for the release of Rails 5.2, and how you can deploy an app today using the new Active Storage functionality.