Learn How To Obliterate Processor Caches: Configure Lots and Lots of DBWR Processes. Part I.

Oracle8 introduced true multiple DBWR processes configured with the db_writer_processes parameter. I remember that the maximum value one could set this to back then was 10. Oracle8i raised that limit I recall. These are processes that build their own write batches (scan LRUs, modify cache buffers chains structs, etc) and perform their own I/O (synchronously or asynchronously). They also post (e.g., semop(2) on most platforms) processes waiting in free buffer wait. Originally, multiple true DBWR processes were implemented to support NUMA system that needed processes local to memory and I/O cards to flush buffers-thus eliminating remote I/O operations.

On systems with flat memory systems (non-NUMA) the only reason to implement multiple DBWR processes is if a single DBWR is processor bound. Such a condition is simple to detect because your free buffer waits will shoot through the roof and a cursory glance at top(1) will show DBWR pegged to the top.

I was just giving this NetApp paper a quick read and found the following absolute rule on page 9:

Async I/O is now recommended on all the storage protocols. The recommended value for db_writers_processes is to at least match the number of processor cores on the system.

Beside the fact that the author misspelled the initialization parameter db_writer_processes, this is not very good advice at all. I wouldn’t be as critical if they were to have recommended a DBWR process per socket bound via taskset(1), but one per core just floating around is going to impact L2 caches. The way this works is that the number of cache buffers LRUs are split amongst the DBWR processes. Each DBWR process will maintain a set of LRUs so even with a light amount of flushing, the work will be spread across the DBWR processes. Here is an absolute: Never break up a workload that can be satisfied with a single processor and force it to be distributed across multiple CPUs. When you do that you simply increase the scheduling overhead and increase the amount of work that is done cache-cold (L2). That is, lightly tasked DBWR processes are more likely to get scheduled on CPUs they haven’t recently run on-thus they are cache cold.

I don’t expect anyone to just take my word for it-although it would be nice if at least some did. I do get tired of proving fact that dates back to the mid-1990s. Ugh, that sounded grumpy, sorry. I’ll see if I can’t push a run through and provide some performance data for a pathological (1 per core, no affinity) versus 1 through 1-per-core with affinity. I haven’t done that in a while. It’ll be fun..yippie.

Summary
If anyone gives you an absolute, ignore it. That includes me. As sure as I’m sitting here I will get at least two emails on this topic. One will tell me that they increased db_writer_processes and “got a performance boost.” The other will tell me to remember that sometimes man bites dog.

Fundamentally, however, a single DBWR process on a flat-memory SMP given kernel async I/O, ample scheduling priority, non-preemption on ports that support it, and ample CPU cycles will be able to keep the SGA clean. If any of these criteria are not met then throwing another DBWR at the situation should help, but that is entirely different than starting with 1 per core.

14 Responses to “Learn How To Obliterate Processor Caches: Configure Lots and Lots of DBWR Processes. Part I.”

On large NUMA based servers, such as the Sun E25K, Oracle will by *default* create one dbwr process per system board. This allows the DBWR process to search LRUs out of local memory which is very efficient… Kevin could explain the benefits of this far better than I could 🙂

The best advice I could give on “db_writer_processes” is don’t set it AT ALL. Let Oracle figure out the best setting.

Thanks Glenn, but remember, you live with the boutique ports. I have a lot of readers that use the Linux ports. These ports are basically “reference ports” in that they have no such OSD specialization as what runs on an E25K. Just the fact that the Solaris port is robust enough to pick out the fact that it is running in an E25K and branch to that small amount of NUMA-optimized code is proof of that.

On Linux, set it to one. If it bottlenecks set it to two. A little processor affinity in that case would be smart (e.g., taskset(1) each DBWR to a set of N cores from the same socket).

Dang it Glenn, I wish you’d read and post here more often. If you don’t I’m going to make you pay for the Martini’s next week!

Please stand by. I’m wrapping up a proof on this topic (even though I thought I finished doing these similar cache affinity proofs back in the mid 1990s). When I produce the proofs (statspack, throughput, etc), I’ll elaborate. By all means remind me if I don’t answer your questions when that time comes, OK?

I had always thought that one *should* set the number of DBWR processes equal to the number of physical CPUs. Naturally I can’t recall where I picked up that advice from. Is that advice really outdated?

I’m about to migrate to a 4 dual-core CPU box. It is a mix of OLTP and bulk-loaded/read tables. Is 4 DBWRs a good starting point?

It depends on where the advice is presented. I’ve written product documentation for products in prior lives where I recommended 1 per core in order to not have to document what would happen if you set more than 1 per core or if one writer cannot keep up. Depending on how much you write, have 1 per core or even more than 1 per core might not matter at all. Think about it, if DBWR doesn’t do anything on a given workload having a few extra hanging around doesn’t much matter. So it depends on if you are reading a general, sor of “blanket” recommendation.

On the blog I am to go beyond the “what time it is” and even “how to tell time” and present the occasional “how the clock works.” From a processor cache point of view splitting work up that a single process can comfortably do and spreading it out over multiple processes is not fundamentally sound. There will be exception, I’m sure. It is better to start with few (or 1 if you listen to me) and add based upon free buffer waits.

As for your specific question, you say there will be “OLTP” and bulk-loaded read tables. If the bulk loading is with Loader (or External Tables) then DBWR won’t have anything to do with that. The OLTP demand upon DBWR’s services depends on the write load. I would start with 1 and add them if you need to.