Wednesday, June 16, 2010

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

If you have encountered the log file sync wait event before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to anticipate the performance improvement. In the three previous blog postings in this series I explored the wait event and its common cause (rapid commits), and setup an experiment to demonstrate that by increasing the inserts per commit (batching the commits) Oracle was able to insert more rows per second. We explored the core response time components (service time, queue time, arrival rate, number of "servers") and finally created a response time curve representing the current performance situation!

One important value of the response time curve is helping others to visually see that the performance is precarious (or at least non-optimal). I've never met a manager who didn't quickly grasp that being in the "elbow of the curve" was a bad thing. Most people inherently feel in their gut that it's not the place to be. So then the logical question is, "How do we get out of the elbow of the curve?" And that, is when you list off your fantastic performance solutions!

In this blog entry I want to demonstrate how you can take your performance solutions and inject them into the response time plot. This allows you to visually show others how you will "get out of the elbow of the curve." Plus, if/when you are challenged or you'd like to provide more detail, you have the numerics to back up your analysis. So let's get started!

Thinking Consistently

All performance solutions will in some way alter the core response time variables; service time, queue time, arrival rate, or M. Having a solid and consistent definition of the variables are key to understanding how a performance solution will alter one or more these variables. So let's quickly review them from an Oracle system perspective.

Service time is the CPU consumed per unit of work. For example, 0.0914 ms/insert or perhaps 5.01 ms/pio, where "pio" is Oracle physical block reads, which I commonly call, physical IO.

Queue time is the non-idle Oracle wait time per unit of work. For example, 0.1356 ms/insert or perhaps 115.35 ms/pio.

Arrival rate is the work processed (i.e., the workload) per unit of time. For example, 10.8445 inserts/ms or perhaps 0.0516 pio/ms. (This is not a 100% correct definition, but it's close enough for this analysis.)

M is the number of effective transaction servers, such as a CPU core or an active IO "device." For example, 1.4235 devices or perhaps 0.2695 devices. (Defining an IO "device" is out of scope for this blog entry, but you can scan my Firefighting book for the word, here.) M can be difficult to solve, so this free msolver web application can help.

How Solutions Change the Parameters

As I mentioned above we can boil down any Oracle performance solution to primarily changing one of the above parameters, and probably only either the service time, arrival rate, or the number of servers (M). Let me give you a few examples:

If, in your SQL tuning prowess, you reduce the number of physical IOs and PIO is the unit of work, then you are reducing the arrival rate, that is, the number of PIOs that occur over the sample (e.g., AWR report) interval of time.

If, in your Oracle internals mastery, you increased the number of cache buffer chain latches to reduce cache buffer chain latch contention, you are reducing the CPU consumed/required to process a single logical IO, which means the service time will be reduced.

If, in your financial wizardry you beg, buy, or steal (By stealing, I'm referring to virtual machine power of course.) CPU cores, then you are increasing the number of effective servers, M.

If, in your gut you know that, if you increase the Oracle buffer cache less physical IOs will be required, then you are reducing the arrival rate, that is, the number of PIOs that occur over an interval of time.

In general, here's how to work through understanding how your change will affect either the service time, arrival rate, or the number of servers (M). Please read this slowly...

If the service time is going to decrease, then your performance solution will, on average over an interval of time, reduce the amount of CPU Oracle consumes to process a single unit of work. This typically requires some kind of efficiency improvement, like tuning Oracle or using faster CPUs.

If the arrival rate is going to decrease, then your solution will, on average over an interval of time, reduce the amount of work to be processed per unit of time. You are essentially reducing the workload, that is, requiring the system to process less units of work. Workload balancing and SQL tuning are good examples of reducing the arrival rate.

If the number of servers is going to change (increase or decrease), then there will need to be change in the number of effective CPU cores or IO devices. This typically requires a physical configuration change and therefore is not very likely in most Oracle systems. However, a virtual machine can "easily" alter the number of CPU cores, thereby directly affecting the number of servers.

Focus on the inputs, not the outputs. I also find it's helpful to focus on the single most pronounced change. For example, if the service time will decrease, then the system will be able to process more work per unit of time. This means the arrival rate will most likely increase... but this arrival rate increase is the result, that is, the output of our solution, not the input. Focus on what you will change, enter that into the response time formula (as I'll show below), and let the graph appear before you eyes. Then you can see, for example, that the arrival could possibly increase before the response time started to dramatically increase. So focus on the inputs and let the math take care of the output.

How this works in practice is that for each one of your proposed performance solutions, you ask yourself how the solution will impact the service time, arrival rate, or the number of servers. You can create a simple matrix to help organize your thoughts.

I'm not implying this is a simple thought process. Personally I find this distillation process very difficult. It forces me to bring together all my Oracle internals, application SQL tuning, operating systems, and queuing theory knowledge and experience while at the same time keeping the above core definitions consistent. It's just plain hard. However, the process purifies my thinking and forces me to really think through just why a particular performance solution will make a difference. And if I can do that, then I'm able to explain the performance situation and solution much better others... not to mention the confidence I gain in anticipating the impact of my proposed solution. (Or arguing with a vendor about why their recommendation makes absolutely no sense.)

So if you want to get good at this, it will take some practice. (And if you want to, email me your thought process. I will email you back.) It takes an entire day of working with my students in my Advanced Oracle Performance Analysis course before I'm able to open up the discussion of how nearly a dozen performance solutions will effect the service time, arrival rate, or the number of servers. As a teacher, it's one of my life's highlights because I know everyone in the class is about to have at least one "ah-ha moment." Guaranteed!

Putting This Into Practice

Enough talk, let's apply this to reality. I'm going to continue using our commit batch size experiment because we have actual real numerical results when the batch size was increased. The full experimental results can be found in Part 2 of this blog series and the top few can be found in Part 3. I have duplicated the just the top few lines below:

In the previous blog (Part 3) I showed a graph plotting the above arrival rate and response time (service time + queue time). Then applying the CPU-based response time formula while varying only the arrival rate, the associated response time curve appeared and ran directly through our datapoint!

Now let's take a common log file sync solution and think through how that specific solution will affect either the service time, arrival rate, or the number of servers. It's paramount to remember the unit of work is a single insert and the unit of time is constant, in this case a single millisecond. Supposing our performance improving solution is to increase the batch size from one to four, let's think about how that will affect the service time, arrival rate, and number of servers.

Service time will decrease because Oracle has become more CPU efficient per insert. It should take less CPU time to insert each row because by batching our inserts into a single commit, Oracle distributes the commit CPU consumption overhead across more inserts. This decreases the CPU overhead, that is, consumption per insert. So we expect the CPU consumed per insert, that is the service time, to decrease.

As an input parameter, the arrival rate will remain the same because I am not pushing/forcing more work through the system. I am not starting more insert processes, asking employees to work faster, or bringing more employees on-line. However, I do expect the system to be able to process more work per time, but I am not forcing this change. Remember, I am looking for input changes, not output results.

The number of servers (M) will not change. We are modeling this situation as a CPU bound system and since we are not altering the number of CPU cores the number of servers will not change.

Now the question becomes, "Just how much will the service time decrease?" This is where your experience, gut feeling, the risk of being wrong and the resulting impact, and testing all come together. I highly recommend you not do this in isolation. If you are having a good day, you're likely to be optimistic. And of course, if you're having a bad day, you'll be pessimistic. So discuss this with at least one other person.

Fortunately I did run a test! In this situation, the test is the experimental results shown in the table above. The only input change is the service time while keeping constant the arrival rate and the number of servers. The experimental result table above shows that with a batch size of four, the actual service time decreased by 32.95%, down to 0.0613 ms/insert. Therefore:

The experimental result table above show that with a batch size of four, the actual service time decreased by 32.95% down to 0.0613 ms/insert and the actual response time was 0.1527 ms/insert. However, we anticipated (never use the word predict or forecast using this analysis method...more below about this) the response time would drop to 0.926363 ms/insert. So in this situation it appears we anticipated a significantly greater benefit then actually occurred? Nope, read on...

This is important to understand: The experimental results shown in the table above are based on a real Oracle system were we allowed the arrival rate to increase to a steady-state and is not based on mathematics! In contrast, our calculated response time is based on a frozen arrival rate, a frozen number of servers, and we only changed the service time. Because our response time figure is based on a lower arrival rate (10.8445 vs 16.2470), we would expect our response time calculation to be less.

Now I am going to substitute the actual experimentally observed arrival rate (16.2470 inserts/ms) into the CPU based response time formula. The inputs and the calculated response time are now:

Wow! What this means is if we entered the observed increased workload (that is, the arrival rate) into our formula, our calculations would have been within a celebratory 0.7%. This is amazing and should provide some confidence when using this analysis method. Don't ever let someone get away with saying Oracle systems don't behave in a query theory like way! Can I be so bold as to say, they have not freak'n idea what they are talking about!?

For those of us who understand better with pictures, let's look at the before and after graphs. The below graph was created using my free Response Time Comparison MS-Excel based spreadsheet.

Our baseline situation is the blue point on the blue response time curve. This represents the system with one insert per commit. Based on our experiment, when we increased the batch size from one to four, we observed the service time decreased (0.0613 ms/insert) and we recalculated the response time (0.9264 ms/insert), which is represented by the red point on the orange curve. Notice that along with the service time decrease the entire response time curve shifted down and to the right resulting in the orange response time curve. (I will blog about this response time curve shift another time.) We then entered the observed four inserts per commit arrival rate (16.2470 inserts/ms) into our response time calculation and the the orange point resulted showing that at an arrival rate of 16.2470 inserts/ms the response time is 0.154 ms/inserts, which is nearly perfect to what what actually occurred! So in this case our model did a great job at anticipating what was to occur. (This is also why it is important to test our proposed solutions whenever possible. The test provided us with a reliable service time.)

Very Cool Yes, But Caution Please

If you're like many DBAs, this type of analysis is very exciting because it opens up an entirely new method to effectively communicate a complex performance situation, brings about a deeper level of performance understanding, and allows us to anticipate change.

But just how good is this "anticipation?" While the example I used worked nearly perfect (and was not selected among many examples), it's not always this good and more importantly, the performance model developed is not robust.

Let me contrast the anticipatory work I just did to a true predictive analysis forecast. What we did was quickly develop a performance model based on a single data point. Yes, the data point is based on a real system, contains actual Oracle activity over an interval of time, and standard performance mathematics was used. That's the good news and why this works fantastic for quick work in an existing firefighting situation. However, there is plenty of bad news! To name just a few weaknesses: Our model is based on a single data point, performance can dramatically fluctuate during intense times, and what scares me the most is, our analysis contains absolutely no statistical work. This means we have absolutely no mathematics behind the "plus and minus" of our "forecast."

Does this mean we throw out all the work we have just done? Absolutely not! It's much, much better than using our gut, our experience, group think, marketing hype, vendor promises, or the latest performance tuning fad. At a minimum it will force you to think through why a performance solution should work, why for some strange reason you want to aggressively implement a particular solution, allows you to communicate more effectively, and provides a reference point for your work.

My point is, you must develop a proper and robust predictive model to answer questions like, "What will happen if we we insert this new SQL statement?" Or, "Will the system be able to handle the load in nine months?" Or, "Should be purchase these new CPUs?" These questions are based on you not being in a firefighting situation. This means you have time to gather lots of data and perform a solid predictive performance analysis.

Concluding Thoughts

Over the past few blog entries we have explored why increasing the commit batch size can make a dramatic performance impact when presented with the log file sync wait event. If you have followed this entire series, you know we've covered a lot of ground:

1. Understanding what the log file sync wait event means.
2. Presented one of the most common solutions, that is, increasing commit batch size.
3. Modeled the system in a traditional performance mathematics way, that is, response time.
4. Created a graph to visually demonstrate the situation.
5. Discussed how and why we expect our batching solution to impact the key performance variables.
6. Altered the service time to reflect increasing the batch size from one to four.
7. Anticipated the response time change both numerically and graphically.
8. Demonstrated our model anticipated response time nearly perfect!
9. Discussed why this analysis is only appropriate for performance firefighting work.

This brings me to the end of this blog series. Now it's time to go back and finish the parallelism blog series. Plus I've been working on ways to visualize Oracle internal structures... so there is some good stuff to be posted!

Thanks for reading!

Craig.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.