Selecting random rows from a table

The question of how to select some random items from a table is one that comes up fairly often in the IRC channel (and as the subject of blog posts, such as this one from depesz).
While there is a simple solution of this form (let’s assume for now that we want to select 5 uniformly random rows from table “items”):

select * from items order by random() limit 5;

this is unfortunately slow if the table has more than a small number of rows. It is possible to do far better, though there aren’t really any perfect solutions that don’t resort to procedural logic in some way. Let’s start by looking at how to do better in plain SQL.
Obviously, to get a faster solution, we need to be able to write the query in a way that can make use of some index. Either the table already has a serial column, or one can be added simply for this purpose. Then we can just look up some random ids and return those rows….

Well, it’s not quite that simple. The obvious obstacle is that a serial column is not likely to contain only contiguous values; there will be holes in the sequence both for deleted rows, and also for rows that never existed (rolled back during insert, sequence values skipped during recovery, etc.). However, as long as the column values are not too sparse, we can just generate more random values than we need, and hope we get enough that actually match:

There are a number of subtleties here that aren’t immediately obvious.

We use floor() rather than just a cast to integer, because casting rounds the value rather than truncating towards zero.

We need to generate more values than we need not only to account for missing values of item_id, but also to allow for the fact that the subquery may return duplicate values (which IN will remove).

We use a LIMIT clause in the subselect as a way to tell the optimizer how many rows are involved; without that, it’s much more likely to choose a suboptimal plan due to expecting the generate_series() call to return 1000 rows.

We still need the order by random() at the outer level; if we get more rows than we need, which is the normal case, we have to avoid bias in which ones we ultimately return. (Some of the possible plans for the IN join will return the result rows in a non-random order.)

This query only works if the sparseness of the id column is within a reasonable range. In the example above we’re fetching up to 15 rows hoping to get 5; if half the item_id values between min_id and max_id are missing, then the query will come up short approximately 6% of the time, whereas if 2/3rds of the values are missing, the query will come up short 40% of the time.

It’s not really feasible to try and make this query always work regardless of how sparse the data is; increasing the number of attempts in the inner subquery will eventually result in it using a plan that scans the whole items table anyway, at which point we might as well have used order by random() anyway. But for the (apparently common) case where the ids are not too sparse, this method is extremely fast.

11 Responses to Selecting random rows from a table

I actually googled for this just three days ago, but none of the solutions that came up was this neat.

I had a go initially with doing a similar calculation on row numbers ain an OFFSET (xxx) LiMIT 1, just to find out that OFFSET didn’t like subqueries. My problem was of the run-once kind, so a simple ORDER BY random() was sufficent. But I’ll add your solution to my mental toolbox, I can think of other complex queries where similar solutions might save the day.

I think that, in large tables, if we have serial column and holes are not hugue, best solution can be iterating select from random id (only one –and a few times zero– row per query, but quite fast) just like this:

Note that I use subselect to avoid per_row calculation of random() function and cast result to biginteger (because, in my case, this is the type of the serial column) because, otherwise, type mistmatch will force sequencial scan.

Also note that I use the sequence to get an aproximate maximum value of the id. This works quite good if sequence’s increment is positive (default Ok), min_value is 0 (default) or, at least, positive and (also default) sequence cycling is not enabled or max_value never reached.

This method is not rock-solid: we can even obtain less rows than expected but the probability of that is much smaller, depending of the density of the holes and the distance between first and second ‘limit’ values.

We can also increase the first ‘limit’ to drastically reduce the probability of finding not enought rows with a quite small process cost increment.

For example, using values up to 100000 for the first limit value, works quite fast for me.

But, take in mind: The probability of obtaining not enought results can be drastically reduced, but NOT totally eliminated.

In conclusion: The best way, if we can do this, is to iterate in client application (by software) until we obtain enought results. So now we can also use this multiple-row approach to achieve it faster and, at least most times, with only one or two database querys.

PD: If we want to obtain large lists of random rows, we can also use this query and, if not obtain enought results, adjust limit values in iterations depending on the number of faults.

Iterating a single-row fetch in the client is potentially quite a lot slower than the approach given in the original post.

Using the sequence last_value is essentially always a bad idea; better to use min() and max() on the actual id column. Due to the non-transactional nature of sequences, it’s possible (for example in the case where a large bulk insert is running in another session) for the sequence last_value to be a long way ahead of the visible maximum ID.

Using a real table rather than a generate_series() call to produce multiple rows in the IN subquery is just going to slow things down.

And finally, omitting the outer level ORDER BY random() clause means that the results will not be in a random order (even though it may look random).

We are just working on migration to Postgresql 8.3 but, I can say that, at least in Postgres 7.4, max() and min() functions causes a sequential scan which in tables with a few milions of rows, can be quite expensive.

I know that using the sequence last_value is not a good idea, but is the (tested) best way that I found.

Also, in Postgres 7.4, does’nt exist generate_series() function (which I thought were user-defined function which author didn’t reproduced but now, I found it in Postgres 8.3 documentation and, yes, is really better way to get some rows (but with Postgres 7 we have’nt this option).

You are right in that we need the outer level order by random() to guarantee the random order. I apologize for that.

So I think in Postgres 7, the best solution will be someting like this:

Off course, if we have Postgres 8, we can use generate_series() to improve it.

For limits, I suggest to test max() and min() functions in Postgres 8 (I will do it as short as I can). I also think that an index must speed up searching maximum and minimum values but, at least when we try to examine big subsets of wide tables, postgres planner uses sequencial scan because, in this cases, is more efficient than index scan.

I repeat: I think that searching maximum or minimum must can be more efficient using index but, at least Postgres 7, does not it this way.

For this reason, I suggest to try “explain select max (item_id)” first. So I understand your arguments, but many times I discovered that things are not as they seem to be.

PD: Just now I remembered a way to implement best ‘max()’ and ‘min()’ which really takes advantadge of index which I implemented in past and also remembered the reason (I think) for which max() doesnt take advantadge of the index (we could need to select max() from a subset of table or join result, but, off course, is possible –and desirable– that Postgres 8 could implement a way to always take advantadge of the indexs).

I really don’t need to obtain random rows of a table. I simply found this post yesterday searching another thing and I think it interesting.

But, if you need, you can use this trick to erradicate the use of the sequence max_value in my query (PG7) or the max() and min() in the post author’s in Postgres8 if max() and min() values continues not taking advantadge of indexes.

Then take a sample row quickly by running this:
SELECT * FROM x WHERE r >= (SELECT random()) ORDER BY r LIMIT 1;

I’m not sure if asking for more than one row in the LIMIT clause would be statistically sound or not. The “random” order is fixed, so whenever you land in an overlapping spot the sequence will be the same.

If you just need a few rows, you can UNION a few of those together, and that should be as random as you could care for.

This can be done simplier using >= instead of =. You will always find some response even if initial random guess was non-existing ID. No need to generate more then needed in case we won’t hit any existing.