March 12, 2010

Sample Clause

In some of my previous posts (particularly the ones about analysing the data by running aggregate queries) I’ve mentioned the “sample” clause from time to time, so I thought I’d better write a short note about it for those not familiar with it.

To demonstrate the feature, my first step was to run a test script I wrote a few years ago – and discovered that the feature is broken (at least, in 10.2.0.3 and 11.1.0.7) – so at the end of this post I’ve supplied a little test that you can try at home on your own versions.

The only things I use the sample clause for are tablescans and index fast full scans on a single table or index. In fact this was an explicit limitation in the earlier versions of Oracle, but this restriction has been relaxed in more recent versions. Typically, I’d have a query with the shape:

select ...
from tableX {sample clause}
where ...
group by ...
order by ...

The intention was to select a (fairly small) representative sample of data from a very large table.

In 10g the syntax for the sample clause is:

sample [block] ( N [,M] ) [seed(S)]

You can sample rows or blocks (optional keyword [block])

You must specify the percentage sample size (N) – recognizing that the result will only approximate that size

You can specify that the sample should consist of groups of ([,M]) consecutive rows or blocks

You can supply a seed value ([seed (S)]) to the random number generator that drives the sample so that the sample is repeatable.

The ‘seed’ option appeared in 10g, and is documented there.

The ‘consecutive rows/blocks’ option doesn’t seem to be documented but appears in 9i.

So here are the problems in 10g:

The option “sample (N, M)” doesn’t seem to work properly – instead of N% of the data using groups of M consecutive rows in the table, it seems to sample N*M percent of the rows completely randomly across the table. 9i and 10g behave very differently. Fair enough, it’s not documented, so I shouldn’t expect the “,M” feature to work anyway.

The option “sample block N” doesn’t seem to work properly. I have a table with 1,640 blocks, so a sample of 5% ought to show me something like 82 blocks. In 9i I keep seeing sample sizes between 69 and 97 blocks, which is not unreasonable. In 10g I keep getting sample sizes of zero, 66, 132 and 198 blocks – with a couple of other numbers occasionally. This is not reasonable. Luckily there is a workaround: if I use the undocumented syntax “sample (N,1)” the 10g results are then consistent with the 9i results.

Here’s the stripped-down version of my test script. It generates 200,000 rows, and the critical item in the data is the id_col which is a zero-padded counter that lets us track the data selected fairly easily.

create table t1
nologging
as
with generator as (
select --+ materialize
rownum id
from all_objects
where rownum <= 5000
)
select
lpad(rownum,10,'0') id_col,
rpad('x',42) padding
from
generator v1,
generator v2
where
rownum <= 200000
;
select
substr(id_col,1,9),
count(*)
from
t1 sample (1,10) -- seed (0)
group by
substr(id_col,1,9)
order by
substr(id_col,1,9)
;
-- collect and check the table stats here
rem
rem We get about 122 rows per block and 1,640 blocks at 8K blocks
rem for the 200,000 row test (first sample size)
rem
rem A sample of 1 percent should get us 2,000 rows, but if we sample
rem in sets of 10 then drop off the last digit of the id_col, we
rem should get about 200 pairs of values - where each pair sums to
rem 10 rows. Expect about 400 rows of output in 9i.
rem
rem The first pattern I got contained lines like:
rem 000000108 4
rem 000000109 6 10 consecutive rows
rem 000000206 8
rem 000000207 2 10 consecutive rows
rem 000000336 7
rem 000000337 3 10 consecutive rows
rem
rem But this pattern doesn't appear from 10g onwards. (You'll
rem get about 2,000 rows of output).
rem
break on report
compute sum of count(*) on report
select
substr(id_col,1,9),
count(*)
from
t1 sample (1,10) -- seed (0)
group by
substr(id_col,1,9)
order by
substr(id_col,1,9)
;
rem
rem Similarly a sample of 10 percent is about 160 blocks.
rem But 8 blocks in a row is about 976 rows. So if we cut off
rem the last three digits of the id_col we should see another
rem pattern. Expect about 42 rows of output.
rem
rem My second run gave
rem 0000003 217
rem 0000004 759 976 consecutive rows
rem 0000007 557
rem 0000008 419 976 consecutive rows
rem 0000017 919
rem 0000018 57 976 consecutive rows
rem
select
substr(id_col,1,7),
count(*)
from
t1 sample block (10,8) -- seed (0)
group by
substr(id_col,1,7)
order by
substr(id_col,1,7)
;
rem
rem A simple sample of 5% of the blocks, reported by
rem stripping out the block detail from the rowid
rem should give me about 82 blocks - so 82 rows of
rem output.
rem
select
substr(rowid,1,15),
count(*)
from
t1 sample block (5)
group by
substr(rowid,1,15)
order by
substr(rowid,1,15)
;

Related

Nice clear analysis (as always) thanks.
I’ve wanted to use this type functionality test out query plans on subsets of the real tables to limit the running result set sizes of a execution plan. Of course I can use CTAS to create smaller versions of the tables but that’s resource intensive. It’s so tempting to think there might be a way to accomplish the same thing with some limiting clause like the sample clause. Of course the sample clause is problematic in its own ways but still wondering if there is any way to do it. I started to play with creating views that had the sample clause in them. As I recall the queries worked but I didn’t get far enough to determine whether the timing and resource results were worth anything. If I go back and try some more I will at least know of some of the other gotchas.

Thanks for the comment. From memory there was a time when the sample clause could only work with a single table query, and then Oracle did something to extend it. I never got around to testing it, though. I would be quite surprised if it worked well given the conflicting requirements of relational integrity and random selection.