Wednesday, June 16, 2010

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

If you have encountered the log file sync wait event before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to anticipate the performance improvement. In the three previous blog postings in this series I explored the wait event and its common cause (rapid commits), and setup an experiment to demonstrate that by increasing the inserts per commit (batching the commits) Oracle was able to insert more rows per second. We explored the core response time components (service time, queue time, arrival rate, number of "servers") and finally created a response time curve representing the current performance situation!

One important value of the response time curve is helping others to visually see that the performance is precarious (or at least non-optimal). I've never met a manager who didn't quickly grasp that being in the "elbow of the curve" was a bad thing. Most people inherently feel in their gut that it's not the place to be. So then the logical question is, "How do we get out of the elbow of the curve?" And that, is when you list off your fantastic performance solutions!

In this blog entry I want to demonstrate how you can take your performance solutions and inject them into the response time plot. This allows you to visually show others how you will "get out of the elbow of the curve." Plus, if/when you are challenged or you'd like to provide more detail, you have the numerics to back up your analysis. So let's get started!

Thinking Consistently

All performance solutions will in some way alter the core response time variables; service time, queue time, arrival rate, or M. Having a solid and consistent definition of the variables are key to understanding how a performance solution will alter one or more these variables. So let's quickly review them from an Oracle system perspective.

Service time is the CPU consumed per unit of work. For example, 0.0914 ms/insert or perhaps 5.01 ms/pio, where "pio" is Oracle physical block reads, which I commonly call, physical IO.

Queue time is the non-idle Oracle wait time per unit of work. For example, 0.1356 ms/insert or perhaps 115.35 ms/pio.

Arrival rate is the work processed (i.e., the workload) per unit of time. For example, 10.8445 inserts/ms or perhaps 0.0516 pio/ms. (This is not a 100% correct definition, but it's close enough for this analysis.)

M is the number of effective transaction servers, such as a CPU core or an active IO "device." For example, 1.4235 devices or perhaps 0.2695 devices. (Defining an IO "device" is out of scope for this blog entry, but you can scan my Firefighting book for the word, here.) M can be difficult to solve, so this free msolver web application can help.

How Solutions Change the Parameters

As I mentioned above we can boil down any Oracle performance solution to primarily changing one of the above parameters, and probably only either the service time, arrival rate, or the number of servers (M). Let me give you a few examples:

If, in your SQL tuning prowess, you reduce the number of physical IOs and PIO is the unit of work, then you are reducing the arrival rate, that is, the number of PIOs that occur over the sample (e.g., AWR report) interval of time.

If, in your Oracle internals mastery, you increased the number of cache buffer chain latches to reduce cache buffer chain latch contention, you are reducing the CPU consumed/required to process a single logical IO, which means the service time will be reduced.

If, in your financial wizardry you beg, buy, or steal (By stealing, I'm referring to virtual machine power of course.) CPU cores, then you are increasing the number of effective servers, M.

If, in your gut you know that, if you increase the Oracle buffer cache less physical IOs will be required, then you are reducing the arrival rate, that is, the number of PIOs that occur over an interval of time.

In general, here's how to work through understanding how your change will affect either the service time, arrival rate, or the number of servers (M). Please read this slowly...

If the service time is going to decrease, then your performance solution will, on average over an interval of time, reduce the amount of CPU Oracle consumes to process a single unit of work. This typically requires some kind of efficiency improvement, like tuning Oracle or using faster CPUs.

If the arrival rate is going to decrease, then your solution will, on average over an interval of time, reduce the amount of work to be processed per unit of time. You are essentially reducing the workload, that is, requiring the system to process less units of work. Workload balancing and SQL tuning are good examples of reducing the arrival rate.

If the number of servers is going to change (increase or decrease), then there will need to be change in the number of effective CPU cores or IO devices. This typically requires a physical configuration change and therefore is not very likely in most Oracle systems. However, a virtual machine can "easily" alter the number of CPU cores, thereby directly affecting the number of servers.

Focus on the inputs, not the outputs. I also find it's helpful to focus on the single most pronounced change. For example, if the service time will decrease, then the system will be able to process more work per unit of time. This means the arrival rate will most likely increase... but this arrival rate increase is the result, that is, the output of our solution, not the input. Focus on what you will change, enter that into the response time formula (as I'll show below), and let the graph appear before you eyes. Then you can see, for example, that the arrival could possibly increase before the response time started to dramatically increase. So focus on the inputs and let the math take care of the output.

How this works in practice is that for each one of your proposed performance solutions, you ask yourself how the solution will impact the service time, arrival rate, or the number of servers. You can create a simple matrix to help organize your thoughts.

I'm not implying this is a simple thought process. Personally I find this distillation process very difficult. It forces me to bring together all my Oracle internals, application SQL tuning, operating systems, and queuing theory knowledge and experience while at the same time keeping the above core definitions consistent. It's just plain hard. However, the process purifies my thinking and forces me to really think through just why a particular performance solution will make a difference. And if I can do that, then I'm able to explain the performance situation and solution much better others... not to mention the confidence I gain in anticipating the impact of my proposed solution. (Or arguing with a vendor about why their recommendation makes absolutely no sense.)

So if you want to get good at this, it will take some practice. (And if you want to, email me your thought process. I will email you back.) It takes an entire day of working with my students in my Advanced Oracle Performance Analysis course before I'm able to open up the discussion of how nearly a dozen performance solutions will effect the service time, arrival rate, or the number of servers. As a teacher, it's one of my life's highlights because I know everyone in the class is about to have at least one "ah-ha moment." Guaranteed!

Putting This Into Practice

Enough talk, let's apply this to reality. I'm going to continue using our commit batch size experiment because we have actual real numerical results when the batch size was increased. The full experimental results can be found in Part 2 of this blog series and the top few can be found in Part 3. I have duplicated the just the top few lines below:

In the previous blog (Part 3) I showed a graph plotting the above arrival rate and response time (service time + queue time). Then applying the CPU-based response time formula while varying only the arrival rate, the associated response time curve appeared and ran directly through our datapoint!

Now let's take a common log file sync solution and think through how that specific solution will affect either the service time, arrival rate, or the number of servers. It's paramount to remember the unit of work is a single insert and the unit of time is constant, in this case a single millisecond. Supposing our performance improving solution is to increase the batch size from one to four, let's think about how that will affect the service time, arrival rate, and number of servers.

Service time will decrease because Oracle has become more CPU efficient per insert. It should take less CPU time to insert each row because by batching our inserts into a single commit, Oracle distributes the commit CPU consumption overhead across more inserts. This decreases the CPU overhead, that is, consumption per insert. So we expect the CPU consumed per insert, that is the service time, to decrease.

As an input parameter, the arrival rate will remain the same because I am not pushing/forcing more work through the system. I am not starting more insert processes, asking employees to work faster, or bringing more employees on-line. However, I do expect the system to be able to process more work per time, but I am not forcing this change. Remember, I am looking for input changes, not output results.

The number of servers (M) will not change. We are modeling this situation as a CPU bound system and since we are not altering the number of CPU cores the number of servers will not change.

Now the question becomes, "Just how much will the service time decrease?" This is where your experience, gut feeling, the risk of being wrong and the resulting impact, and testing all come together. I highly recommend you not do this in isolation. If you are having a good day, you're likely to be optimistic. And of course, if you're having a bad day, you'll be pessimistic. So discuss this with at least one other person.

Fortunately I did run a test! In this situation, the test is the experimental results shown in the table above. The only input change is the service time while keeping constant the arrival rate and the number of servers. The experimental result table above shows that with a batch size of four, the actual service time decreased by 32.95%, down to 0.0613 ms/insert. Therefore:

The experimental result table above show that with a batch size of four, the actual service time decreased by 32.95% down to 0.0613 ms/insert and the actual response time was 0.1527 ms/insert. However, we anticipated (never use the word predict or forecast using this analysis method...more below about this) the response time would drop to 0.926363 ms/insert. So in this situation it appears we anticipated a significantly greater benefit then actually occurred? Nope, read on...

This is important to understand: The experimental results shown in the table above are based on a real Oracle system were we allowed the arrival rate to increase to a steady-state and is not based on mathematics! In contrast, our calculated response time is based on a frozen arrival rate, a frozen number of servers, and we only changed the service time. Because our response time figure is based on a lower arrival rate (10.8445 vs 16.2470), we would expect our response time calculation to be less.

Now I am going to substitute the actual experimentally observed arrival rate (16.2470 inserts/ms) into the CPU based response time formula. The inputs and the calculated response time are now:

Wow! What this means is if we entered the observed increased workload (that is, the arrival rate) into our formula, our calculations would have been within a celebratory 0.7%. This is amazing and should provide some confidence when using this analysis method. Don't ever let someone get away with saying Oracle systems don't behave in a query theory like way! Can I be so bold as to say, they have not freak'n idea what they are talking about!?

For those of us who understand better with pictures, let's look at the before and after graphs. The below graph was created using my free Response Time Comparison MS-Excel based spreadsheet.

Our baseline situation is the blue point on the blue response time curve. This represents the system with one insert per commit. Based on our experiment, when we increased the batch size from one to four, we observed the service time decreased (0.0613 ms/insert) and we recalculated the response time (0.9264 ms/insert), which is represented by the red point on the orange curve. Notice that along with the service time decrease the entire response time curve shifted down and to the right resulting in the orange response time curve. (I will blog about this response time curve shift another time.) We then entered the observed four inserts per commit arrival rate (16.2470 inserts/ms) into our response time calculation and the the orange point resulted showing that at an arrival rate of 16.2470 inserts/ms the response time is 0.154 ms/inserts, which is nearly perfect to what what actually occurred! So in this case our model did a great job at anticipating what was to occur. (This is also why it is important to test our proposed solutions whenever possible. The test provided us with a reliable service time.)

Very Cool Yes, But Caution Please

If you're like many DBAs, this type of analysis is very exciting because it opens up an entirely new method to effectively communicate a complex performance situation, brings about a deeper level of performance understanding, and allows us to anticipate change.

But just how good is this "anticipation?" While the example I used worked nearly perfect (and was not selected among many examples), it's not always this good and more importantly, the performance model developed is not robust.

Let me contrast the anticipatory work I just did to a true predictive analysis forecast. What we did was quickly develop a performance model based on a single data point. Yes, the data point is based on a real system, contains actual Oracle activity over an interval of time, and standard performance mathematics was used. That's the good news and why this works fantastic for quick work in an existing firefighting situation. However, there is plenty of bad news! To name just a few weaknesses: Our model is based on a single data point, performance can dramatically fluctuate during intense times, and what scares me the most is, our analysis contains absolutely no statistical work. This means we have absolutely no mathematics behind the "plus and minus" of our "forecast."

Does this mean we throw out all the work we have just done? Absolutely not! It's much, much better than using our gut, our experience, group think, marketing hype, vendor promises, or the latest performance tuning fad. At a minimum it will force you to think through why a performance solution should work, why for some strange reason you want to aggressively implement a particular solution, allows you to communicate more effectively, and provides a reference point for your work.

My point is, you must develop a proper and robust predictive model to answer questions like, "What will happen if we we insert this new SQL statement?" Or, "Will the system be able to handle the load in nine months?" Or, "Should be purchase these new CPUs?" These questions are based on you not being in a firefighting situation. This means you have time to gather lots of data and perform a solid predictive performance analysis.

Concluding Thoughts

Over the past few blog entries we have explored why increasing the commit batch size can make a dramatic performance impact when presented with the log file sync wait event. If you have followed this entire series, you know we've covered a lot of ground:

1. Understanding what the log file sync wait event means.
2. Presented one of the most common solutions, that is, increasing commit batch size.
3. Modeled the system in a traditional performance mathematics way, that is, response time.
4. Created a graph to visually demonstrate the situation.
5. Discussed how and why we expect our batching solution to impact the key performance variables.
6. Altered the service time to reflect increasing the batch size from one to four.
7. Anticipated the response time change both numerically and graphically.
8. Demonstrated our model anticipated response time nearly perfect!
9. Discussed why this analysis is only appropriate for performance firefighting work.

This brings me to the end of this blog series. Now it's time to go back and finish the parallelism blog series. Plus I've been working on ways to visualize Oracle internal structures... so there is some good stuff to be posted!

Thanks for reading!

Craig.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Monday, June 14, 2010

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

This was totally unexpected! But after talking with a South African company about sending a few folks to the US for my upcoming October training in Philadelphia, we decided instead to bring me to South Africa this September. I was able to fit the training in just after a personal holiday and before OracleWorld in San Francisco, so it's going to happen! I've never been to South Africa, so this will be quite a treat for me and I'm really looking forward to it.

Wednesday, June 9, 2010

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

If you have encountered the Oracle Database wait event log file sync before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to anticipate the performance improvement. In the first two parts of this blog series I explored the actual wait event, its common cause (rapid commits), and a very special experiment. The experiment demonstrated by increasing the inserts per commit (batching the commits) we saw quicker inserts that enabled Oracle to processes more inserts per second. We also explored the core response time performance components (service time, queue time, and arrival rate) and how they related to log file sync and our experiment. Wow... that's a lot of stuff!

In this entry I want to show you how you can visually plot the situation. This is very important for the performance analyst. First, it allows us to better communicate the situation when performance is horrible, what I call a firefighting situation (hence, the title of my associated book and training course). Second, it sets us up to understand and anticipate the impact of our proposed performance improving ideas. In this blog entry I'm going to focus on plotting the situation and in the next entry I'll introduce how to anticipate the performance change. Let's get started!

First, we need to enumericate the performance situation into the standard performance metrics of arrival rate, service time, and queue time. We have already done this (details in the previous blog entry) and the table showing some of the experimental reasults is included below (the complete table was shown in the previous blog entry, Part 2).

Let's plot the situation when a commit occurred immediately after each insert, that is, the batch size is only one. In this situation, the arrival rate is 10.8445 inserts/ms, the service time is 0.0914 ms/insert, and the queue time is 0.1356 ms/insert. Therefore, the response time is 0.2270 ms/insert (response time = service time + queue time).

(Now if you are used to standard response time mathematics your first question will be, "Why are we using the unit of work, insert?" The answer is because in this specific situation, inserts are highly related to the limiting performance factor, which is the inserts per commit, which manifests as a log file sync. I cover this extensively in my Advanced Oracle Performance Analysis course and it's also introduced in the last chapter of my Oracle Performance Firefighting book.)

Creating a plot with arrival rate as the horizontal axis and the response time as the vertical axis, we get an extremely boring figure like this:

I know, it's not that impressive and as they say, "It's nut'n to write home about." But it does represent the situation.... and the really cool aspect of this is, this point does reside on a response time curve. In other words, there is a response time curve that passes directly through this point. This response time curve will allow us to get a visual grasp of the situation and, as I'll present in the next blog entry, how we come up with solutions to reduce the response time per insert.

The trick is, if we can come up with a function representing the response time curve, then we can simply alter the arrival rate to let function will crank out the respective response times! So let's do it!

To determine the response time curve function, I have to introduce the basic response time formulas. Here they are along with the variable definitions:

R(io) = S / (1-(LS/M))

R(cpu) = S / ( 1 - (LS/M)^M )

Q = R - S

where:

R is response time
S is service time
L is the arrival rate
M is the number of effective CPU cores or IO devices
Q is the queue time (normally Q is the queue length, but I didn't want to add a subscript)

Yes, there are two response time formulas; one for an IO focused system and the other for a CPU focused system. In this case I am using the CPU subsystem formula because during the experiment the OS was CPU bound, not IO bound. (There is obviously much more I could say here... but I'm not.)

Most math people think the problem is pretty much a done deal at this point. Wrong! For IO focused systems you can solve for M, but for CPU focused systems, it's not going to happen. Even WolframAlpha.com won't solve M into a simple equation (take a look) . For us this means M needs to be somehow solved iteratively. There are a number of ways to do this. One way to do this is to download my Response Time Comparison spreadsheet and repeatedly try various M values converging to where the derived response time is close the experimentally observed response time. But probably the simplest method is to use a cool little web application I created, called msolve. The input form is shown below.

After a few seconds, the application returns with a solution for M. This is shown below.

It turns out that with an M of 1.4235 the derived response time closely matches are our observed response time of 0.2270 ms/insert. So we have found a good M value.

If you look closely at the bottom of the above screen shot, you'll notice a link to see the graph. If you click the link, you'll be sent to WolframAlpha.com and the graph will appear before your eyes! For this case however, I wanted the graph to look a very specific way, so I created it in MS-Excel.

To create graph in MS-Excel, keep M and S constant and varying the arrival rate (L). And there you have it, the following response time curve will appear! (I used my Response Time Comparison spreadsheet to help me create the graph.)

Now you have to admit that's pretty cool! We can now use this curve to visually show others the performance situation. But even more important, we can use the graph as the baseline (the current performance situation) to visually demonstrate how our proposed performance enhancing solutions will alter the situation to reduce the response time! In the next blog entry, I'll show you how increasing the inserts per commit will alter the response time graph so we can anticipate the impact of our solutions!

What we have done in this blog entry is to take a real Oracle system and represent it both numerically and visually. The advantage is it will help both ourselves and others better relate to the performance situation. But more importantly, it sets the stage for showing just how our proposed performance solutions will alter the situation to get us out of the elbow of the curve. This will make selling your solutions more persuasive, compelling, and truly robust.

If you're, as they say, chomping at the bit to do this kind of thing, here are some resources to help you get started:

Wednesday, June 2, 2010

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

If you have encountered the Oracle Database wait event log file sync before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to explain to others and to anticipate the impact of your proposed solution. In part one of this series I introduced the log file sync wait event and presented an experiment that demonstrated the insert rate increases as the number of inserts per commit increases. I call the number of inserts per commit, the batch size. In this blog entry I'll be exploring the components of time during each insert. So let's get started!

Breaking Down Time

As the experiment in Part 1 demonstrated, increasing the number of inserts per commit allows the system to absorb inserts at a faster rate. But to really grasp why a performance change improves the situation, we need to understand how the components of time changed. A fantastic way begin this journey is to break down time situation into three core elements: What is known as the arrival rate, the service time, and the queue time.

If you are new to this type of analysis, pay close attention because this is the aroma of how we can numerically anticipate the impact of our proposed change, which in this case is increasing the inserts per commit.

So here we go... The arrival rate is some unit of work over some meaningful unit of time, such as inserts/ms. You can think of the arrival rate as the workload, such as transactions per second, user calls per second, inserts per second, or logical IOs per second. The service time is some unit of time over some unit of work, and for Oracle performance analysis, it works wonderfully when the service time is CPU consumed over some unit of work, such as ms/insert. The queue time can be represented by the non-idle wait time per unit of work, such as ms/insert. When we sum the CPU time per insert and the non-idle wait time per insert, we have what is called the response time, in terms of time per insert (e.g., ms/insert).

Our goal as a performance analyst is to reduce the response time, not only the service time or only the non-idle wait time (that is, the queue time). The response time is directly related to how long a batch job runs, what the system is experiencing, and many times what the user is experiencing. Another added benefit of enumerating these characteristics is we can plot the situation graphically. I'll save the plotting for the next blog entry in this series.

So let's look at the actual experimental results in terms of the arrival rate, the service time, and the queue time. Here they are below.

Looking at the above numbers, you'll notice that by simply increasing the batch size from 1 to 4 the arrival rate (measured in inserts/ms) increases by 50%. This means the system was able to absorb 50% more work simply because the inserts per commit was increased from 1 to 4. Second, the associated CPU consumption (i.e., service time), non-idle wait time (i.e., queue time), and the response time per insert was reduced by 30%. By increasing the inserts per commit from 1 to 32, the work absorbed increased by 84% while at the same the time response time decreased by 65%. These are significant numbers and is why we witnessed such a dramatic drop in elapsed time when inserting 900000 rows.

But more than just big time changes, the breakdown of time in a response time-like format allows us to further understand what occurred and then explain that to others. So why did the arrival rate, service time, and queue time numbers change? If you have a good understanding of Oracle internals you'll probably be able to figure this out.

The service time decreased because, in part, Oracle was able to better batch redo entries achieving increased efficiency and thereby reduce the commit overhead per insert. This reduces the CPU consumption per insert, so the service time decreased. The queue time decreased because, with fewer commits per second the total commit time per insert was less which means there is less log file sync time. And let's not forget that there is also less physical IO being written by the log writer as well. Having less log file sync time means there is a decreased likelihood that log file sync will be the top wait event. The really amazing part is with the reduction in both service time and queue time, the system was able to process more inserts per second.

Caution: By the way, if you're used to plotting response time graphs, you'll be tempted to plot the above table with the arrival rate on the horizontal axis and the response time on the vertical axis. But be careful, because the arrival rate is changing because of the batch size increase not because we are pushing more work into the system. If you want to plot the response time graph based on experimental data, you'll need to set the batch size and increase the workload while sample the response time components.

So in this entry we broke down insert time into CPU time (service time), non-idle wait time (queue time), and the associated workload (arrival rate). Plus we looked into why increasing the batch size alters each component.

In the next blog entry of this series, I'll introduce how to plot a graph representing the situation, which will set us up for anticipating the performance change before actually altering the batch size.