Returning Rows in Random Order - Part II - Revenge of the Randomizer

If you liked the first article, then read on; I received a number of very good follow up questions from Nick that you might find interesting. This article focuses on the performance of the solution I proposed in the first article, as well as the problems associated with returning a single randomly chosen row from a table.

Nick writes "I just read the response to "Returning Rows in Random Order," which was posted recently. I'm curious as to how well it performs asymptotically.

Performance wise, the whole operation should be close to Order(N). You're basically making two scans of the recordset - once to pull keys from the main table, once to assign the random number to each row. Depending on how you use the resulting recordset in the temp table (returning just the temp table vs. joining to the orignal table), that will alter the performance somewhat, depending on the algorithmn used to join the two tables.

Assuming you have appropriate indexes on the original table, the slowest part of the randomization process will be the cursor that assigns the random number to each row. However, it should scale in a linear fashion to a reasonable point (if you have a gazillion rows, SQL Server may be memory / tempdb bound, thus slowing your cursor further).

He continues: I also wonder what transaction isolation level it needs to work properly.

The randomization process (updating the temp table) works fine at every isolation level, because each user gets their own copy of the temp table. Now, depending on how you want to use the data, you may want to specify something other than SQL Server's default(READ COMMITTED). Bumping up the isolation level will make your results more consistent with the data in the original table, but if you're randomizing a lot of rows, then concurrency is going to suffer.

Continuing, he adds: But my stumper is: Are there any optimizations that you would suggest if: (1) we only want to return a single random row (and not all rows in random order) (2) we are willing to alter our tables a little, and (3) SELECT performance matters much more than INSERT (or UPDATE?) performance.

My thoughts involve adding a column of packed integer values. Of course, you need some kind of trigger and a clever way of detecting and fixing gaps in the sequence (most likely by repeatedly demoting the max element to fill the lowest hole) as they open up. Selecting a random row is then a matter of picking a random integer from a contiguous range. But I haven't been able to get anything to work well enough for primetime (i.e. guaranteeing hits when you've got concurrent users).

If you want to select a single row, then the best way to do it is to use some sort of integer column that contiguously numbers the rows in the table. You could then use something like:SELECT ID, yadaFROM FooWHERE ID = (SELECT CAST((rand() * @RowCount) as int) + 1)(@RowCount is just the number of rows in the table.)

This can be easily accomplished using an identity column in the table that increments by 1, and some way to track the number of rows in the table (either a SELECT Count(*) or get the number of rows from sysindexes).

Of course, if you delete rows from the table, you're going to end up with gaps in your sequence... D'oh! One solution to this problem is to check the @@rowcount variable after running the query above-if it's zero, then you know you hit a gap, and you can re-run the query to select another row.

Now, if your table incurs a lot of deletes, then you'll probably want to "compress" the table frequently to make sure this identity column gets re-seeded with contiguous values. This would minimize the number of "misses" by the select-random algorithm. The "compression" can be done by creating a copy of the table, copying the rows over, deleting the original table, then renaming the copy to the original table name.

However, if you are using the identity column as your PK, and you have a lot of child tables, then you'll need to adjust the foreign key values in those child tables as well... not impossible, but complicated, certainly.

If you'd rather not use an identity column, then you can create your own "row number" column and use triggers to maintain these contiguous row numbers, as Nick suggests. You would need:- an INSERT trigger that assigns the next row number based on 1+the max row number- a DELETE trigger that fills in any gaps created by a delete (OR some way to periodically re-number the row numbers to eliminate gaps)The trigger solution is likely to slow down inserts and deletes, so you'll have to take that into consideration.

Admittedly, this is a tricky problem. My suggestions are:-On a table with relatively few deletes, use an identity column as the row number, and use an occasional "compression" to remove gaps. That way, you'll have occasional "misses" on your random row selection, but good insert/delete performance.

-On a table with frequent single-row deletes, try using an INSERT trigger-generated row number, and demoting the max row number to fill in the gaps. You'll have pretty good insert performance and slightly worse delete performance.

-On a table with frequent multiple-row deletes, gap elimination is going to really affect delete performance. You'll have to strike a balance between the performance hit from "misses" in your random row selection and the performance hit incurred from a complicated DELETE trigger or "compression" process.

Stay tuned, I think there may be more on this topic in the future. I'd also like to hear from anyone that is trying to do this sort of random row selection and has other ideas on how to work around the problems described here.

-SQLGuru

P.S. To Nick - I've lost your e-mail address, and I had a follow-up to your follow-up. Please e-mail me if you're interested in getting the rest of your follow-up questions answered.