This stuff was developed because we were running a benchmark that we could not get to perform. We had spent several weeks trying to figure out what was happening with no success. The symptoms were clear—the system was mostly idle—we just couldn’t figure out why.

We looked at the statistics and ratios and kept coming up with theories, the trouble was that none of them were right. So we wasted weeks tuning and fixing things that were not the problem. Finally we ran out of ideas and were forced to go back and instrument the code to figure out what the problem was.

Once the waits were instrumented the problem was diagnosed in minutes. We were having “free buffer” waits because the DBWR was not writing blocks fast enough. It’s amazing how hard that was to figure out with statistics, and how easy it was to figure out once the waits were instrumented.

...In retrospect a lot of the names could be greatly improved. The wait interface was added after the freeze date as a “stealth” project so it did not get as well thought through as it should have. Like I said, we were just trying to solve a problem in the course of a benchmark. The trouble is that so many people use this stuff now that if you change the names it will break all sorts of thing tools, so we have to leave them alone.

The Working-Waiting Model

In this formula, Oracle defines service time as the duration of the CPU used by your Oracle session (the duration Oracle spent working), and wait time as the sum of the durations of your Oracle wait events (the duration that Oracle spent waiting). Of course, response time in this formula means the duration spent inside the Oracle Database kernel.

Why I Don’t Say Wait, Part 1

There are two reasons I don’t use the word wait. The first is simply that it’s ambiguous.

The Oracle formula is okay for talking about database time, but the scope of my attention is almost never just Oracle’s response time—I’m interested in the business’s response time. And when you think about the whole stack (which, of course you do; see holistic), there are events we could call wait events all the way up and down:

The customer waits for an answer from a user.

The user waits for a screen from the browser.

The browser waits for an HTML page from the application server.

The application server waits for a database call from the Oracle kernel.

If I say waits, the users in the room will think I’m talking about application response time, the Oracle people will think I’m talking about Oracle system calls, and the hardware people will think I’m talking about device queueing delays. Even when I’m not.

Why I Don’t Say Wait, Part 2

There is a deeper problem with wait than just ambiguity, though. The word wait invites a mental model that actually obscures your thinking about performance.

Here’s the problem: waiting sounds like something you’d want to avoid, and working sounds like something you’d want more of. Your program is waiting?! Unacceptable. You want it to be working. The connotations of the words working and waiting are unavoidable. It sounds like, if a program is waiting a lot, then you need to fix it; but if it’s working a lot, then it is probably okay. Right?

The connotations “work is virtuous” and “waits are abhorrent” are false connotations in Oracle. One is not inherently better or worse than the other. Working and waiting are not accurate value judgments about Oracle software. On the contrary, they’re not even meaningful; they’re just arbitrary labels. We could just as well have been taught to say that an Oracle program is “working on disk I/O” and “waiting to finish its CPU instructions.”

The terms working and waiting really just refer to different subroutine call types:

“Oracle is working”

means

“your Oracle kernel process is executing a user call”

“Oracle is waiting”

means

“your Oracle kernel process is executing a system call”

The working-waiting model implies a distinction that does not exist, because these two call types have equal footing. One is no worse than the other, except by virtue of how much time it consumes. It doesn’t matter whether a program is working or waiting; it only matters how long it takes.

Working-Waiting Is a Flawed Analogy

The working-waiting paradigm is a flawed analogy. I’ll illustrate. Imagine two programs that consume 100 seconds apiece when you run them:

Oracle’s wait interface is vital because it helps us measure an Oracle program’s complete execution duration—not just Oracle’s user calls, but its system calls as well. But I avoid saying wait to help people steer clear of the incorrect bias introduced by the working-waiting analogy.

Monday, August 7, 2017

When I was a young boy, my dad would sometimes drive me to school. It was 17 miles of country roads and two-lane highways, so it gave us time to talk.

At least once a year, and always on the first day of school, he would tell me, “Son, there are two answers to every test question. There’s the correct answer, and there’s the answer that the teacher expects. ...They’re not always the same.”

A big problem with expert is corruption—when self-congratulators hijack the label to confer authority upon themselves. But of course, misusing the word erodes the word. After too much abuse within a community, expert makes sense only with finger quotes. It becomes a word that critical thinkers use only ironically, to describe people they want to avoid.

Saturday, July 29, 2017

The “best practice” serves a vital need in any industry. It is the answer to, “Please don’t make me learn about this; just tell me what to do.” The “best practice” is a fine idea in spirit, but here’s the thing: many practices labeled “best” don’t deserve the adjective. They’re often containers for bad advice.

The most common problem with “best practices” is that they’re not parameterized like they should be. A good practice usually depends on something: if this is true, then do that; otherwise, do this other thing. But most “best practices” don’t come with conditions of execution—they often contain no if statements at all. They come disguised as recipes that can save you time, but they often encourage you to skip past thinking about things that you really ought to be thinking about.

Most of my objections to “best practices” go away when the practices being prescribed are actually good. But the ones I see are often not, like the old SQL “avoid full-table scans” advice. Enforcing practices like this yields applications that don’t run as well as they should and developers that don’t learn the things they should. Practices like “Measure the efficiency of your SQL at every phase of the software life cycle,” are actually “best”-worthy, but alas, they’re less popular because they sound like real work.

Thursday, July 27, 2017

When people use the word “holistic” in my industry (Oracle), it means that they’re paying attention to not just an individual subcomponent of a system, but to a whole system, including (I hope) even the people it serves.

But trying to differentiate technology services by saying “we take a holistic view of your system” is about like differentiating myself by saying I’ll wear clothes to work. Saying “holistic” would make it look like I’ve only just recently become aware that optimizing a system’s individual subsystems is not a reliable way to optimize the system itself. This should not be a distinctive revelation.

Approximately 100% of the time in the [mostly non-scientific] Oracle world that I live in, when people say “methodology,” they’re using it in the form that American Heritage describes as a pretentious substitute for “method.” But methodology is not the same as method. Methods are processes. Sequences of steps. Methodology is the scientific study of methods.