Direct Path Reads

Doug's Oracle Blog

After one of my Hotsos Presentations, I had a discussion with one of the delegates about how his site used parallel execution for some jobs purely to take advantage of Direct Path Reads. He suggested that he thought that was why the parallel 2 tests I ran were always significantly faster than serial. I spoke to a couple of people about this and wondered how I could test it, until Jonathan Lewis pointed out the _serial_direct_read hidden parameter. The reason those last two words are in italics is that you shouldn't set the values of hidden parameters without discussing it with support. You have been warned

As I don't run my business on my server at home, it' s safe for me to test (now that I've bitten the bullet and run a cable down to the server room - it's the first decent playtime I've had for ages). Based on a simple test of a full table-scan on a 3-disk software RAID array, there was a small improvement, but I'm not sure it justifies switching to Direct Path Reads except where Oracle would normally use them anyway.

I ran the tests multiples times, immediately after startup to pick up any timing variations, but the timings were pretty stable. I also looked through raw trace files to check that everything looked as expected.

So direct path was about 10-15% quicker for this very basic full scan of a 9Gb / 65 million row table on my particular configuration

Updated 23rd May

Based on comments from Alex and Jonathan, I thought it might be worth adding snippets of the first few trace lines for direct and non-direct i/o. So you can blame them for the longer post I also had to re-run the tests because some other tests I'd been running overnight had deleted the existing trace files

Doug,
Did you check with 10046 that it was using direct reads instead of db file scattered read?
Also, I believe that the most important advantage is to avoid contention on buffer cache. So if you try concurrent processes than benefit should be higher. Well, perhaps, your IO subsystem will die first before you get stuck on latches concurrency.

Possibly the improved performance is due to asynchronous (or at least non-blocked) I/O calls. When doing direct reads, Oracle may be using several overlapped I/O calls to populate read buffers in the PGA, thus allowing later reads to take place whilst the results from an earlier read are being processed. (You should see gaps in the 10046 trace of direct path reads if this is true).

You might also check on differences in CPU - possibly the absence of latching may be part of the difference.

"Did you check with 10046 that it was using direct reads instead of db file scattered read?"

Yes, I did. The funny thing was that I was trying to keep this blog shorter because one or two people have said they didn't really like longer blogs about px

Jonathan,

"When doing direct reads, Oracle may be using several overlapped I/O calls to populate read buffers in the PGA, thus allowing later reads to take place whilst the results from an earlier read are being processed. (You should see gaps in the 10046 trace of direct path reads if this is true)."

I was reading about that and trying to think how I could illustrate that.

"You must check that you don't have the cursor already parsed ( do hard parse)."

I make sure that happens by restarting the instance before each test. It's an artificial but consistent test. It gives me a consistent starting position and, as I'm reading so much data, I wouldn't expect it to be in the buffer cache anyway.

... required to allow the SGA to be bypassed of course (thanks JL!). I think this is new anyway -- I don't recall seeing it in 10.1. It's a caveat against the frequent use of direct read for busy OLTP tables anyway.

Interestingly different patterns in the elasped times too. The scattered reads start following a repeating pattern of one slow read followed by five fast reads, whereas the direct path alternates fast and slow reads. That suggests to me that something different is going on at a low level aside from the issue of the block buffer being bypassed.

btw, I'm supposing that the elapsed time units are different there, of course. That's also curious, but not surprising as the naming of the number of blocks read also follows different conventions -- those Hotsos profiler guys earn their money dealing with all those inconsistencies

The elapsed times are microseconds in both cases. But the (still outstanding) problem with the trace is that you really need TWO tim= stamps, one for the start of wait and one for the end. Then the time spent between the end of one and the start of next is time spent in CPU.

The trick with the "direct path reads" is that Oracle manages to dispatch N buffers for read requests (probably 4), which is a CPU and memory intensive strategy, then comes back to the first buffer - which in this case requires a very short wait to fill. Then Oracle uses the data, dispatches a new read and moves on to the second buffer. In the interim the second read has nearly completed, but requires a few more microseconds... if Doug's hardware were just a little faster, we wouldn't see any direct path read waits.

Although the 'scattered read' waits are much more variable, check the difference between adjacent tim= figures and the recorded elapsed. The "error" is in the ballpark of the gap between tim= in the "direct path" -- i.e. the CPU time of examining the blocks is about the same - and hides the asynchronous time taken to fill the buffers during the direct path read.

Re: KO - I think that appeared in 10.2. There is a new entry in the buffer header structure which allows for a linked list to be built between buffer headers of the same object. This, of course, means yet another little overhead when reading a block into the buffer in the first place. But it is useful for truncates, drops, and shrinks, as it avoids a massive scanning process if you drop a large object which has not been subject to much update.

Even though I'm in the middle of a very messy server consolidation (working from home), I was going to reply to some of your comments when I got the notification email (has anyone else noticed how slow Blogger has got in this respect). So I arrive here and Jonathan has got there first (as usual). I'd better get back to what I should be doing

So what would it mean if the timings for the direct path read were comparable to those for the scattered read, as they seem to be under 10.1 on Windows for example? I'm not seeing this effect of comparably much lower wait times for direct path than for scattered, and of course there are no tim= elements in there either.

It could mean that your session isn't allocating much memory for read buffers - so there's only one buffer, and the direct read time starts counting from the moment the read request is dispatched.

Try doing a big sort with events 10032, 10033 and 10046 (level set to check the direct path reads and writes. The 10032 trace will tell you about the number of buffers used (for that sort) and some information about waits for synchronous and asynchronous I/Os. The other traces will show you the I/O sizes and locations. It may give you some clues.

- When serial direct reads are enabled a checkpoint of the data object is triggered. This means that all the dirty blocks need to be first written to disk. So depending how many dirty blocks there your measured timings are completely wrong.
- When several concurrent jobs are doing checkpoints it is getting even more expensive

I think this depends on what you think the test was supposed to demonstrate.

Clearly, there is a rather noticeable difference between the serial direct read time and the scattered read time - which is a worthwhile observation point, and means that some people might want to investigate the issue further.
(Doug does, after all, point out in the blog entry that (a) he restarts the system for each test, and (b)it is a very simple test. I don't think he is attempting to convince people that they can enable _serial_direct_reads with no further thought.

Your point about the object checkpoint is valid - and any further investigation would obviously consider the tradeoff between the time (apparently) gained from direct path reads, and the time lost waiting for dirty blocks from that object to be written.

Your point about the concurrent reads, however, raises a further interesting question. If we start two "direct path queryies" at the same time, won't the second query benefit from the writes performed by the first query ? After all, if there are (say) 4,500 dirty blocks to write won't the second query simply wait for the first query to write them, then find that the checkpoint queue has been cleared for it (plus or minus a few extra blocks) ?

In fact, increased concurrency might ensure that there are only ever a few blocks to write, so making Doug's "clean start" a reasonable model in many cases.

Jonathan -
You mentioned that in 10.2 "there is a new entry in the buffer header structure which allows for a linked list to be built between buffer headers of the same object"

You mention that this list is used for truncates, drops and shrinks - is it also used for performing the fast object checkpoint that preceeds direct reads ?

For partitioned objects - is there one single list for blocks from all partitons , or is there a separate list for each partition ?

This relates to your comment about the tradeoff between time (apparently) gained from direct path reads vs time lost waiting for dirty blocks to be written ; what would be the behavior in case of a partitioned table ?

For example - a table partitioned by date ranges , where heavy read/write activity occurs only against the most recent partition.
What would happen (checkpoint-wise) if we performed direct reads against an "old" partition for which there are currently no blocks in the buffer cache ?

If you do a dump of the buffer cache, the name for this particular linked list is the "objq".

I haven't tried to do an exhaustive analysis on how it works - but my first guess would be that each linked list links only the blocks of a given data_object_id (as opposed to object_id), which means each partition of a simple partitioned object would be on a difference queue, and each subpartition of a composite partitioned object would be on a separate queue. You could confirm this quite easily by setting up the example you've described and seeing what writes occur if you run a parallel query against the unchanged partition.

It is possible that this type of linked list could be used for the "object checkpoint" - but it is also possible that the database writer simply walks the checkpoint queue looking for buffers matching the relevant object id. In different circumstances, either stategy could be the faster strategy.

If you wanted to check it would be easy, but tedious, to set up an experiment that read a lot of blocks in an object at random, updated them randomly, then dumped the buffer cache to check the ordering on the different queues, then traced the database writer from the O/S level to see the order in which the blocks were written.

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.