Simple SQL: Random Thoughts

How does one get a truly random sample of data of a certain size from a SQL Server database table. Well, there are simple non-portable tricks one can use, such as the NewID() function, but then refining those can be tricky. Take the Rand() function for a start. Can it really provide you with a truly random number? Why doesn't the TABLESAMPLE clause give you a set number of rows? Joe Celko scratches his head a bit, explains some of the issues and invites some suggestions and tricks from readers.

“Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.” – John von Neumann

John von Neumann was not against the use of pseudo-random numbers, but he thought that you will be very, very careful about using them (“Various Techniques Used in Connection with Random Digits” by John von Neumann in Monte Carlo Method (1951) edited by A. S. Householder, G. E. Forsythe, and H. H. Germond).

Randomness is a good topic in itself, but I want to talk instead about getting a random sample out of a table in a database. Think of this as another one of my “aggregation” articles. Most SQL server programmers only know about the RAND() function and never think further about it; after all, Microsoft wrote it: Right?

There’s a fundamental mistake in just accepting any pseudo-random number generators as being truly random. First of all, there are two kinds of random samples; sampling with replacement and sampling without replacement. These terms come from the rather classic model of drawing numbered balls from an urn. This analogy pops up in every book on probability or introduction to statistics. When I draw a ball out do I keep it, or put it back in the urn? Since I seldom have a bunch of balls in an urn, I like to refer to these as “shooting dice” or “cutting cards” instead. If I have replacement, then I can get the same value in different drawings; if there is no replacement, once I have picked a value it does not occur again in the sample.

If a pseudo random number generator (PRNG to use the usual abbreviation), generates a prior value, then you’re hung in a cycle and will repeat the same series of pseudo-random numbers again. That is the nature of deterministic arithmetic expressions. PRNG algorithms are usually written at the machine level for speed. Here’s a simple one for 31 bit numbers, implemented in C.

1

2

3

4

Generator_31 ()

{staticintn=1;

n=n>>1|((n*n>>3)&1)<<30;

returnn}

There are various algorithms for picking the values of the parameters in the formula to give a complete cycle in their range. Unfortunately, this family of algorithms has some problems. One of the horror stories from the late 1960s was the discovery that RANDU (), one of the standard random number generators from IBM, was flawed.

The definition of randomness was that every number in the range of the function would occur at least once in the cycle. This one’s very obvious and easy to test for; but it’s not the only test. The next important factor is that the number should be evenly distributed over the range. That means no lumps or clusters. Unfortunately, RANDU() has a little problem creating triplets in its number space

Do not feel too smug about that, because Microsoft has the same sort of problems (Statistical tests of the IBM PC pseudorandom number generator.). The Microsoft generator has exhibited a number of flaws including, for some seeds, a lack of uniformity of generated sequences of numbers, and serial correlation within such sequences. If you pick the right seeds for the generator, then your generated sequences can at least pass the basic test for randomness.

Physical Random Number Generators

Before you get too depressed, it’s probably worth pointing out that when Alan Turing was doing his research with the Enigma machines, he had a physical random number generator. It was basically a vacuum tube (that’s what we had before transistors, for you young people). The tube counted electrical fluctuations. Later, the RAND Corporation created a similar machine, which they used to create a book entitled “A Million Random Digits with 100,000 Normal Deviates” (ISBN 978-0833030474), which was a research classic for decades. We actually used the book by opening it at random and reading off numbers. This actually wasn’t as good as we thought it was, because physical books tend to get breaks in their spine and flop open to certain sections over time. And before you ask, yes, there was a sequel – “A Million Random Digits THE SEQUEL: with Perfectly Uniform Distribution” (ISBN 978-146100250).

A similar machine, ERNIE, designed by the Bletchley Park codebreaking team in the 1940s, was used to generate random numbers for the UK Premium Bond lottery. The British Post Office made a great documentary called “The Importance of Being E.R.N.I.E.” which you can find on YouTube. Please don’t let the clothing, hairstyles and IT equipment make you laugh too hard.

In late 1960s, DEC (Digital Equipment Corporation) offered a circuit card that would plug in the bus of their PDP series of computers which had a little speck of some kind of radioactive material and a circuit to detect it – basically a Geiger counter on a card! In theory, if all our theories of quantum physics are right, radioactive decay is truly unpredictable and would meet the qualifications of a true random number generator. The bad news is that the circuit card had to also have that little three-bladed black & yellow propeller thingy to tell people there was radioactive material. That symbol, along with the three-bladed bio-hazard symbol are really scary for a lot of people. And since there really wasn’t much use for it outside of laboratories, such devices disappeared when Digital Equipment Corporation disappeared.

Nowadays, we use a variety of random natural events such as atmospheric noise. (see https://www.random.org/ ) to get a true random number generator (TRNG)

Quick Fix Trick

If you are simply trying to retrieve an exact number or percentage of random rows, use:

1

2

3

4

SELECTTOP(10)-- or 10 PERCENT

<columnlist>

FROM<basetable>

ORDERBYNEWID();

The NEWID() function is probably random enough for most applications. This, of course, is highly proprietary, but quick and easy. If you Google around a bit, you’ll find some other code for doing “good enough or quick work” sampling. Just don’t to expect to get your PhD with this kind of code

.

TABLESAMPLE Clause

TABLESAMPLE() is a feature that appears in SQL Server and other products. It is based on the SQL:2003 Standard.

An example is:

1

2

SELECTAVG(salary_amt)

FROMPersonnelTABLESAMPLESYSTEM (50);

It will return a sample from the Personnel table, based on data pages, which has a size based on then number specified in the parameter. This sample table is the passed to the WHERE and SELECT clauses way. SQL Server, Postgres and DB2 have implemented this clause. The goal was to allow the discovery of general trends and patterns in data.

You’re supposed to record request a number of rows or percentage from a table as a sample. The big problem is that Microsoft has implemented this feature based on data pages, not rows (that’s what the SYSTEM keyword option means; we will talk about the BERNOULLI option shortly). This means that it cannot be guaranteed to return the number of rows you specified; it returns all of the rows from the data pages it found. Unless your table only has fixed width columns, the pages pulled out based on a percentage or row count could contain very different numbers of rows.

But a bigger problem from a theoretical viewpoint is by returning pages you get clustering. Think about when you insert data into a table, you do it in a cluster, usually sorted from an outside source. Thus, one page might have data from mostly County A, the next page might have data mostly from County B, etc. the physical order of data entry affects what is assigned to pages; the rows are not randomly distributed over the entire table. This gets even worse when you have a clustered index, which will cause the table to be sorted. This is essentially an electronic version of wearing a space in the spine of our book of random digits that we had in the old days!

REPEATABLE Option

You can make this more consistent using the REPEATABLE option, but that still won’t make it return the desired number of rows. This clause says that the pages used to get the sampling we use the same random seed, specified as a parameter in this option, each time you go to the base table. The hope is that the data will stay the same. Compare this to the REPEATABLE READ option in transactions.

Remember that TABLESAMPLE cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML. Nor can you use it in the definition of a view or an inline table-valued function. Essentially, it needs to go to a base table where the data physically exist in pages on the disk.

BERNOULLI Option

This sampling method will scan the whole table and randomly pick individual rows. It basically does “coin flip” for each row, so that each row has an equal likelihood of being in the sample. This algorithm gives much better random distribution but will be slower for small percentages; it is a table scan.

Microsoft has not implemented this yet. However, generalizing this to various sample distributions is going to be harder. SQL is being moved from a database language to some kind of statistical package, and I don’t feel this is good idea.

If anyone would like to post some of their tricks in the responses, please do so.

SQL Prompt is an add-in for SQL Server Management Studio (SSMS) and Visual Studio that strips away the repetition of coding. As well as offering advanced IntelliSense-style code completion, full formatting options, object renaming, and other productivity features, SQL Prompt also offers fast and comprehensive code analysis as you type.

Subscribe for more articles

Subscribe to our fortnightly newsletter

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

It depends on what you’re PhD is (sorry, this was really an excuse to start the conversation with “it depends”).

The combined SQL rough and ready sampling technique is one I’ve used quite a few times. You simply combine TABLESAMPLE with “SELECT TOP (@samplesize) .. ORDER BY NEWID()”, but you need to set the ROWS parameter to @samplesize * (8010 / average row length) * @factor. Where @factor is roughly somewhere between 0.1 and 10, dependent on how paranoid you are and how much “wastage” you can afford to add to the performance (I usually use 0.5). And, if you’re joining out to a second table you must take into account the join cardinality in the factor. This has been a very efficient technique for reasonably large samples (~10-100 million) from very big tables (50 billion rows+). Of course, the REPEATABLE parameter then becomes pretty ineffectual – as the NEWID() will negate it.

Again for very big tables, I frequently use TABLESAMPLE(1 PERCENT) combined with the “.. NTILE(2) OVER(ORDER BY col) AS TILE, col.. ” type query to pull out the median for a particular column. Doing a second TABLESAMPLE(1 PERCENT) to get the average, can be a good starting point for getting a feel for the data in a particular column. Or if you’re lucky enough to have one, pulling out the histogram from the column stats blob can very useful if you have a column with a reasonable number of discrete values. Usually these will end up as the features in your model. Again, a rough and ready view of a distribution, but there is going to be no quicker way than that to get the same information.

I also have (somewhere) the SQL code (recursive) for a PRNG that uses a parameterised LCG (linear congruence generator) to generate non-repeating sequences (i.e. the number will not reappear in the sequence). Also, plugging in any number from the produced sequence, assuming you have the same parameters, will repeat that sequence. It has a fair number of limitations on the parameters values that can be used. And, to be honest I’ve never really found a use for it in practice, hence I don’t have it to hand.

Of course this isn’t nearly as sophisticated as stratified sampling or other sophisticated statistical techniques, but when you have 10’s of millions or billions of rows to deal with .. . Also, when your asked “a sample of X, of size Y rows / percent”, but on probing the requestor can provide you with nothing more in the way of detail – “you get what you pay for”.

Tarjei T. Jensen

The C example generator will give 0 every time as long as the starting value is 1.

1 >> 1 = 0

(n*n >> 3) = 0

And I checked by compiling and running it. And calling the function twice.

nonsensql

I think I’d point out that Joe said it was “simple” rather than “a good one to use”. A quite common type of PRNG (a Linear Congruential Generator) is also a fairly simple piece of code. Although, the maths required to pick suitable parameters to use as defaults for the “equation” is pretty complex. And, there are also a couple of practical issues with using an LCG PRNG. There are lots of other types out there, such as the Mersene Twister as used in the Python random function (similarly used in R, variants of C++ and MATLAB).

Anyway, as this is a SQL column, I’ll butt out at this point and get back to the day job.

Sean Redmond

What’s wrong with skipping through the table and taking, say, every nth entry?

I have a table with 20’000 unique e-mail addresses in it. Marketing wants 1’000 unique e-mail addresses to do a mailshot, so I copied all of the e-mail addresses into a table with a simple surrogate primary key enforced with IDENTITY and then used the modulo to get me a little over 1’000 e-mail addresses.

The chosen e-mail addresses were roughly geographically representative of the population as a whole. I could have put in much more work and spent much more time to get an exact representation according to government statistics, but would the extra time spent have been worth the increase in exactness?

Truly random is rarely necessary. Often, it is enough when the sample is reasonably representative.

nonsensql

Skipping through is pretty much what TABLESAMPLE does for you. It is just that it returns a whole page of results. So you get random pages as a sample, even if asked for rows or percent. For small tables, that are just a few 10’s of pages long or smaller, that could be a very artificially skewed sample. Although, TABLESAMPLE is very fast compared to most other techniques, as it only picks and scans the number of pages it needs to satisfy the parameters (ROWS or PERCENT) – logically evaluating at the FROM clause phase. Many other techniques will end up scanning the whole table – potentially just to get a tiny sample of it.

As for “truly random” being necessary, well as Joe implies with his question about RAND(), at no point are we going to be doing that. Hence we’re talking about PRNG’s and samples. It is common to discuss whether a sample is “good enough” for a particular application, i.e. just another way of interpreting “reasonably representative”. My answer to when is “truly random” really required is ‘never’ (for all reasonable intents and purposes).

With the techniques outlined here you can get close to a reasonably representative random sample. And, more to the point get it to perform quickly (not a trivial problem for really big data collections). With increasing demand for data science applications, and bearing in mind that SQL seems to be a big part of that (just looking at the job boards, skills for those postings tend to boil down to “.. R and SQL ..”, “.. Python and SQL ..”, “.. SAS and SQL ..” or “.. MATLAB and SQL ..”. Hmm, is my little pattern recogniser picking something up?). These techniques in SQL should probably need to be more widely understood and used.

Related articles

Whether or not to have NULLable columns in a table can be a religious debate, and how missing data is represented should be carefully considered during database design. In this article, Joe Celko considers the ways that SQL Server handles NULLs in several situations. … Read more

T-SQL window functions have been a fantastic addition to the T-SQL language. In this article, Kathi Kellenberger reviews how optimizations available in SQL Server 2019 can improve the performance of these functions.… Read more