Thursday, February 24, 2011

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

If you have been following my recent posts and especially the post about SQL statement elapsed times, you'll know that unfortunately SQL statement elapsed times do not conform to a normal, poisson, exponential, or even a log normal distribution. That's unfortunate because if a statement does indeed conform, even with limited data (which we can easily obtain), we can make some pretty amazing predictions. But still, I was able to come to some useful conclusions. In this posting I want to investigate the pattern of SQL statement arrivals.

Why Should We Care?

Note: If the below terms are not familiar to you, I will clearly define them starting in the next section.

Arrival rates and inter-arrival times are a massively important topic in computing system performance analysis. In the capacity planning and predictive analysis industry it is a given that the pattern of inter-arrival times (time/work: ms/trx) is exponentially distributed and the average arrival rate (work/time) is poisson distributed. [1, 2, 3] This is so ingrained into minds that if I went to a non-Oracle computing performance conference and suggested otherwise, I would immediately be questioned, demeaned, and have pencils thrown at me (or something like that). It wouldn't be pretty.

But to believe science is unscientific, is it not? And what is wrong with challenging assumptions about Newton's Three Laws of Motion, that we live in a closed system, and even perhaps question arrival pattern assumptions? I got flack from the "experts" when I first start speaking about Oracle's wait interface and when to use an index. And even now I still get some flack regrading how to apply classic response time analysis to Oracle time-based analysis. And being called dangerous is actually pretty cool, especially when you look at who's saying it!

But in all seriousness, my objective in this posting is simply these three things:

To test and create the ability for others to test; are SQL statement arrival patterns exponential? And if not, do they conform to some known statistical distribution? What can I learn that will help me in my daily performance analysis work?

That's my objective, pure and simple.

The Plan

To ease you into this topic, I'm going to start with defining a few key terms with words and pictures. Then I'll summarize the experimental design, how I collected the data, an analysis of three data sets, and then I'll draw some final conclusions. I will also provide the links so you can perform the same tests I have. I hope you enjoy the journey!

Transaction Arrivals

Average Arrival Rate

Let's say you went to your local shopping mall, sat down in a nice cushy chair, and counted the number of people who walked into a specific store over a one minute period. Perhaps 21 people went into the store. Don't mean to insult your intelligence here, but this means 21 people arrived into the store over a one minute period. So the arrival rate is 21/1 people/minute or 21/60 people/second or 0.350 people/second. This is known as an average arrival rate.

An arrival rate needs a unit of occurrence or work and a also a unit of time. Examples of occurrence or work are a transaction, logical IO, physical IO read, physical IO read request, an execution of SQL statement ABC, or procedure customer_add. Examples of time are hours, seconds, etc. While this may seem trivial, when referring to the arrival rate it is important to always keep the occurrence as the numerator and the time as the denominator. One reason is because the inter-arrival time is naturally given in time per occurrence and most of this blog is specifically referring to the inter-arrival time. So always listen close and use the unit of occurrence and time carefully.

Now that we can calculate the average arrival rate, let's take this topic to the next level...inter-arrival times.

Inter-Arrival Times

This topic really starts with inter-arrival times. The inter-arrival time is simply the time between each arrival. For example, taking a slight yet significant twist from the above shopping mall example, suppose I wrote down the time of each arrival. Not the number of people that arrived over an interval of time, but the time of each arrival. It's then a simple task to determine the time between each arrival, that is, the inter-arrival time. Let's say the shopping mall inter-arrival time (sec/person) data looked like this:

If I take this data set and paste it into WolframAlpha (did it here), one of the results is this histogram.

The industry expects inter-arrival times for computing system transactions to be exponential, but how about the data above? If you have followed my blog entry on statistical distributions you'll know that this could be exponential, poisson, or log normal. Even if we had more samples, if I creatively alter the bin sizes I could make the histogram look just about anyway I wanted. What we need to do is a statistical hypothesis test. I'm not going to do that here, but I will with the real experimental data below.

Suppose I sat like a couch potato at the shopping mall until I ended up with 100 samples, not just 18 as above. The resulting histogram looks like this:

The two histograms look physically different because the just above histogram was created using Mathematica... WolframAlpha doesn't like me pasting 100 samples into it. Three points: The average is 19.7, the median is 14.4, and while the distribution looks log normal, it is exponential because I created it that way (sorry...kind of sneaky I know).

By industry definition, transaction inter-arrivals should be exponentially distributed. [1, 2, 3] There is also another accepted assumption regarding the inter-arrivals; they are expected to be independent of each other. Referring to my shopping mall example: If I'm at a mall with my wife, if she walks into Lucy's I will likely follow! My arriving was dependent on her arrival, so our arrivals are clearly not independent. Family's walking into restaurants are another dependent example.

With the knowledge of inter-arrivals, dependent vs independent arrivals, and the accepted industry assumption, let's now look at the pattern and the average of inter-arrivals.

Inter-Arrival Average and Pattern

Let's explore how the arrivals visually appear on a time line. This is a fantastic way to grasp the fact that it is entirely possible for inter-arrival average times to be the same but the pattern of arrivals be different. The pattern of inter-arrivals above was exponential, but what does that pattern and other patterns look like on a time line?

If you need a quick statistical distribution refresher, I blogged about this topic here.

The image above shows four different inter-arrival patterns, each with the an average inter-arrival time of 5 ms/trx. The top time-line is a constant inter-arrival rate (a perfect uniform distribution), that is, every 5 ms another transaction arrives...if only we were so lucky! The second from top time-line shows transactions arriving in a normally distributed pattern, with the average inter-arrival time of 5 ms along with some variance. In fact, I have defined the standard deviation to be 2. The third from top shows transactions arriving in a log normal distributed pattern, with the average inter-arrival time pulled from a normal distribution with an average of 5 ms and the standard deviation of 2. (Confused? Read this blog post.) The fourth from top, that is, the bottom time-line shows transactions arriving in an exponentially distributed pattern with the average inter-arrival time of 5 ms. As you can see, there is more to arrivals then just the average...the pattern is also very important.

So important in fact, we can feel the difference! If you worked the counter at a fast food restaurant, which arrival pattern would you prefer? If you choose, exponential (bottom line) your nuts because sometimes you would be sitting around doing nothing while other times there would be people queued up and glaring at you! Most people (including our users) desire smoothness and predictability and so a constant (that is uniform) inter-arrival rate is what we like. When we analyze the experimental data, we'll get a picture of real Oracle SQL statement inter-arrival times!

OK, I'm done with the background information. Now it's time to delve into what is really happening in production (not the lab) Oracle systems.

Experimental Design

To demonstrate the pattern of SQL statement inter-arrivals I needed to create an experiment. I also wanted it to be easily performed by others in their production environments. I created a data collection script that will gather the inter-arrival times for a specific SQL statement (and specific plan) and record the results in an Oracle table for easy manipulation and retrieval. I also created a Mathematica based notepad to perform the statistical analysis. The analysis is essentially a hypothesis test to determine if the collected inter-arrival times conform to a standard statistical distribution; normal, exponential, poisson, or log normal. Before we analyze the results, I need to describe how the data is collected.

Data Collection

How the data is collected is key to this experiment. If I botched anything in this experiment, it would be the data collection. It's tricky to get good inter-arrival times. To simplify the situation, the collection works better under certain circumstances. If the same SQL statement is notwaiting to be run by multiple sessions, it appears I can adequately determine when a SQL statement arrives by looking at it's execution start time, which will also be close to when v$sqlstats inserts the first row for the statement or updates (i.e., refreshes) its existing row.

As I demonstrated in my When Does V$SQLSTATS Get Refreshed posting, for SQL statements the execution column in v$sqlstats is incremented when the statement begins (worst case when parsing ends). Therefore, when detecting an execution count change, we know the SQL statement has begun and therefore arrived. The arrival time is logged in the op_results_raw table.

Click here and you can view a text file that introduces the experiment, shows the actual collection and extraction code, and step-by-step how to perform the experiment yourself. I also include some sample data that was taken from one of my test systems.

The collection procedure samples from v$sqlstats in a tight loop (you can insert a delay however) and when the execute count changes, the time is recorded in the op_results_raw table. The collection procedure does not query from the v$sqlstats underlying x$kkssqlstat fixed table because this may limit your ability (think: security issues) to collect the data. However, if you wanted, all you would need to do to use x$kkssqlstat is simply substitute the object name in the collection script and of course, connect as sys when you collect the data.

delay_secs_in is the number of seconds to sleep between samples. Setting this to zero will give you the best data. However, without a delay the sampling script will likely consume 100% of one of your CPU cores. So be very careful! Changing the delay parameter from 0 to 1 can make a big difference in the monitoring procedure's CPU consumption.

sql_id_in is used, in part, to uniquely identify the SQL you are interested in understanding its arrival pattern.

plan_hash_value_in is used, in part, to uniquely identify the SQL you are interested in.

Let's assume I want to sample without any delay for 60 seconds looking only at the SQL statement with a sql_id of acz1t53gkwa12 and a plan_hash_value of 4269646525. This is one way to setup running the procedure and then doing so:

To reduce the overhead of inserting the experimental data into an Oracle table, I utilize Oracle's commit write facility. I reference the commit_write setting on page 302 in the fourth printing of my book, Oracle Performance Firefighting and also discuss this in performance firefighting course as well. This is a perfect use for the facility.

After the 60 seconds I should have some rows inserted into the op_results_raw table. I show a number of short SQL statements in the experimental text file (again, click here to download) which, to not bore you death, I do not show here. But what I'm really interested in is the list of inter arrival times. For example, in one of my sample runs on an experimental system, here is the first 19 inter-arrival sample times (in seconds).

With only a few values, I can call on my good friend WolframAlpha to quickly and easily create a histogram. All I need to do is remove the ending comma, enclose the list of values in curly braces, go to www.wolframalpha.com, copy and past in the list, and submit the request. In a couple of seconds, Mr. WolframAlpha will present me with, among other things, the histogram shown below.

What type of distribution does this look like; constant, normal, random, exponential, or some other?

So that's how the experimental data was gathered and I'm hoping you'll be motivated to do the same. Let's move on to two actual production collections and then make some final conclusions.

Analysis of Sample Set One

Data Set: DaveB-OLTP-1

Note: You can download this full analysis in pdf format here. It is the Mathematica notepad (printed to PDF) used to analyze the data including the statistical hypothesis testing and plenty of graphs. I also liberally commented the notepad so both you can I can follow along.

The SQL statement comes from a production OLTP intense Oracle system. Over the 60 second sample period, the SQL statement was executed 321 times, hence we have 321 samples. Numerically, the average inter-arrival time was 186.8 ms, median 59.1 ms, standard deviation 1384.8 ms, with a minimum value of 0.135 ms and a maximum value of 19182.7 ms. Talk about variance! Notice the median is less than half of the mean. If was to randomly select one of the sample inter-arrival times, I'm likely to pick a value around the median. Said another way, the inter-arrival time is more likely to be around 59 ms than 187 ms.

While clearly not visually exciting, above is the histogram for the entire data set. (The horizontal axis unit of time is microseconds.) It clearly shows our data is massively dispersed. Why? Two simple reasons: the only reason the horizontal axis goes out so far to the right is because there are actual sample values out there...just not that many of them! Plus the maximum value sample would be placed near the far right horizontal axis around 190,000 micro-seconds. These far-right samples are not anomalies, but actual sample values and you'll see that they appear in all three data sample sets...so I'm not going to ignore them. They force me to understand that while the relatively massive inter-arrival times are indeed rare, they are so massive they effectively pull the mean (187 ms) away and to the right of the median (59 ms). However, I am also interested in the bulk of the data, so I created another histogram.

Because of the massive dispersion of the data (difference in min and max values), showing all the data limits our ability see where and how most of the samples values group. The above histogram shows 95% of the data. It's the top 5% (and perhaps less) that contain the massive inter-arrival times. Because there are 321 samples and the above histogram contains 95% of the smallest values, the above histogram contains 305 samples (perhaps 304...I didn't count).

Does this inter-arrival sample set conform to the normal, poisson, exponential, or log normal distribution? Visually it sure doesn't look like it! And our hunch is correct. The statistical fitness (hypothesis) test clearly showed the difference between the sample set and each of the listed distributions could not be explained by randomness....so statistically we must assume they are different. That's a lot of words to simply say, the data does not match any of the tested distributions.

Analysis of Sample Set Two

Data Set: DaveB-3

The SQL statement comes from a production OLTP intense Oracle system. Over the 60 second sample period, the SQL statement was executed only 81 times, hence we have only 81 samples. Numerically, the average inter-arrival time was 705.0 ms, median 149.0 ms, standard deviation 1172.1 ms, with a minimum value of 3.67 ms and a maximum value of 5649.2 ms. Again, massive variance! Notice that once again the median is less than half of the mean. The inter-arrival time is more likely to be around 149 ms than 705 ms.

Because the data is so massively dispersed, just as with the previous sample set, I only show the lower 95% of the sample data. This means, the histogram contains 77 samples, not the full 81. This allows us to focus in on the bulk of the data and the ever-interesting (perhaps important) far left histogram bars.

Does this inter-arrival sample set conform to the normal, poisson, exponential, or log normal distribution? Visually it looks like perhaps we found either an exponential or log normal match! Sorry...our hunch is incorrect. The statistical fitness (hypothesis) test clearly showed the differences between the sample set and each of the listed distributions could not be explained by just randomness....so statistically we must assume they are different. That's a lot of words to simply say, the data does not match any of the tested distributions. Bummer. Oh well... on to the third sample set.

Analysis of Sample Set Three

Data Set: DaveB-dw1

The SQL statement comes from a production data warehouse Oracle system. Over the 60 second sample period, the SQL statement was executed 33974 times...yes this is correct. So we have lots of samples! Numerically, the average inter-arrival time was 1.76 ms, median 0.35 ms, standard deviation 17.79 ms, with a minimum value of 0.093 ms and a maximum value of 2859.76 ms. Again, massive variance! Notice that once again the median is less than half of the mean. The inter-arrival time is more likely to be around 0.35 ms than 1.76 ms.

Below are five small histogram composed of various percentages of the data. Notice that as the largest values are excluded, we get an interesting glimpse of the majority of the samples. The just below histogram shows 100% of the data whereas the bottom histogram shows the 85% of the data, the 85% of the smallest values.

If you're like me, you're wondering if the 85% histogram confirms to one of our common statistical distributions; perhaps normal? So once again, I performed a goodness of fit hypothesis test comparing the lowest 85% of the sample data values to the normal, poisson, exponential, and log normal distributions. Yet again, they all "failed" the test, which means they are statistically so different, randomness can not account for the difference. Bummer...

Predicting the Median Values

As I was performing the analysis, I noticed that the assuming the sample data is log normal distributed, which it is not, the predicted median value was kind of close to the actual median value. So I thought I would document this to force a more realistic look at the situation. Below is the actual results from our three data sets.

What conclusions can be draw? None...we only have three sample sets. So while I remain hopeful we can reliably predict the median, I simply do not have enough data sets to responsibly act on that hope. So.... please send me your data sets. If I receive enough I will be able to do a solid statistical analysis and post the results.

Conclusions

It is what it is... all three of our sample sets failed to statistically match the normal, exponential, poisson, or log normal distributions. Certainly this is not the results I would have liked to see. But even so, we can draw some useful conclusions, that you can check for yourself. (In fact, if you send me your experimental data, I will run it though my analysis and email you back the results.)

Don't be fooled. The Oracle SQL statement inter-arrival rates did not statistically conform to the normal, exponential, poisson, or log normal distributions. If someone claims otherwise, ask for the experimental data.

Academically interesting. The median was always less than half the mean. If you find the average inter-arrival time is 1ms, I would feel comfortable going with the assumption the median is at least half of the mean.

Strange data can be real data. A very small subset (less than 5%) of the samples are likely to be at least a factor of 10 larger than the mean. They may seem like outliers, but all of our sample sets show these exceptionally large inter-arrival times will occur, which means they are not an anomaly.

Always validate forecasts. All classic computing system predictive work assumes inter-arrival rates to exponentially distributed. But our data is clearly not...so while I can make predictions, we can see one reason why our predictions are not always spot on! This is just one reason why I stress in my courses (especially my Oracle Forecasting and Predictive Analysis course) the need to validate our forecasts.

Expect the unexpected, just not that often. In my mind, this is by far the most practical application of this research. Because inter-arrival times are clearly not constant and vary wildly, it should not surprise us when a non-heavily loaded system experiences an "unexpected" and perhaps brief slowdown. The slowdown may be short-lived, but it will occur just not that often...but with increasing likelihood as your system approaches the elbow of the response time curve. The way to reduce the likelihood of this slowdown occurring is to, in some way, influence your system to operate at a perhaps surprisingly low utilization for the most limited resource (e.g., CPU, IO, application object). If you're interested in this topic, I highly recommend Taleb's book, Fooled by Randomness.

P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub@comcast.net.

Tuesday, February 8, 2011

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

The Situation

In my performance work I often am presented with a Statspack or AWR report. After performing my 3-Circle Analysis the key SQL statement(s) is easily identified. Obviously my client and I want to know key characteristics about that SQL statement. The standard reports presents me with only total values such as elapsed time, executions, physical IO blocks read, CPU consumption, etc. that occured during the report interval. From an elapsed time perspective, all I can determine is the average elapsed time...not a whole lot of information and as I'll detail below, it has limited value and can easily cause miscommunication.

The problem and my desire

I find that in communicating using only the average SQL elapsed time my audience makes a number of assumptions which, are many times incorrect. Plus I simply cannot glean very much from what the standard reports present. This is another way of saying my communication skills need some work and I need more data! I want to better understand, describe, and communicate SQL statement elapsed times. So in essence, my desire is to be able to responsibly provide more information with very limited inputs.

The value of a conforming SQL statement

Now if a SQL statement elapsed time conforms to a known statistical distribution, then with minimal inputs I will be able to responsibly say much more, communicate better, and serve my clients and students better. But even if it does not statistically conform, if a general pattern develops I may be able to say something like, "The typical elapsed time will most likely be much less than the average." Just saying that will be valuable to me.

It's not normal

Last November I blogged about how given an average value, most DBAs will assume a value is just as likely to be greater than or less than the average. This is another, albeit very simplified and not entirely true, way of saying the samples are normally distributed. One of my objectives in the initial "average" blog entry was to demonstrate that Oracle's wait events are not normally distributed. For example, given the average db file scattered read wait time of 10ms, it is highly unlikely there are just as many waits greater than 10ms as there are less then 10ms. What we clearly saw in that blog entry was that more wait occurrences took less than 10ms than longer than 10ms. In this blog entry we'll take a giant leap forward.

Check This Out For Yourself

I spent quite a bit of time preparing this blog entry. In part because I want others to be able to do exactly what I have done, that is, to check this out for themselves. So I created a tool kit that anyone can use to investigate what I'm writing about on their systems or they can use the tool kit on their test system using sample data the tool kit quickly generates. While all the actual data samples and the associated analysis tools in this blog entry can be found here, the data collection tool kit used to gather the sample data can be downloaded, installed, and used for free. To get the tool kit, go to http://www.orapub.com and then search for "sql distribution". Here's a link to help you get started.

The Experimental Design

This experiment requires two Oracle users each with their own Oracle SQL*Plus connection: the monitoring user and an application user. You can run this yourself (test or production system) and see the details by downloading and using the tool kit. The monitoring session quickly and repeatedly queries from v$sqlstats looking for refreshed rows related to a specific SQL statement, which is uniquely identified by its sql_id and plan_hash_value. The algorithm creatively (it's pretty cool, check it out) determines when the statement begins, ends, and also collects timing and resource consumption details.

It is limited however; to simplify the data collection strategy, concurrent executions are averaged when they complete. To mitigate this factor, the chosen SQL statements where somewhat unlikely to be executed concurrently. Also, to reduce the likelihood of bogus data being recorded, in some cases potentially incorrect samples are discarded. My data collection tool is clearly not the ultimate, but it's the best I have at the moment and I have taken specific steps to reduce the likelihood of bogus data being part of the final experimental sample set.

The results are inserted into the results table, op_sql_sample_tbl. When the data collection has finished, a simple query formats the results so they can be easily understood and also will feed nicely into a Mathematica notepad for analysis. The notepad file is also provided in the tool kit.

Below is an example of what a run result can look like. For readability I replaced the real sql_id and plan_hash_value with abc and 123 respectively.

Without access to Mathematica to you can quickly create a histogram by selecting the wall_ms_exe column, placing the values into a single line separated by commas, enclose the line in curly braces, pasting that into http://www.wolframalpha.com and pressing the "=" submit button. Here's what I mean:

Even with this limited sample data set above from one of my test systems, you'll get tons of statistical data including the below histogram I snipped from the WolframAlpha's on-line result page.

(I'm still amazed WolframAlpha is on-line and for free.) I will explain the results in the below analysis section.

Real Experimental Results

Real Production Data

While I could have used data from my in-house test systems, for this experiment I wanted to use real SQL statements from real production Oracle systems. I found two volunteers who, using my tool kit, gathered data for a number of SQL statements. A profound thanks go out to Dave Abercrombie and Garret Olin, who have both provided me with real performance data, and in the past as well! Dave is also my Mathematica mentor and statistical Jedi Master. He also writes a very insightful blog you may want to follow. While both Dave and Garret have helped me in this quest, they are in no way responsible for any mistakes or botches I may have made.

Lots of Production Data

Even using the snippet of data above along with WolframAlpha you get the gist of the experimental results. But as I demonstrated in my Important Statistical Distributions blog posting, looks can be deceiving! Plus it's always nice to get lots of data. I asked Dave and Garret for at least a hundred samples. This reduces the risk of anomalous data making a significant impact...not to mention people decrying the conclusions as bogus and insignificant.

Statistical Distributions

Before I get into the analysis of the experimental results, it's extremely important to understand a histogram, five common distributions, the histograms for these distributions, and a few key characteristics of each. These key statistical distributions are uniform, normal, exponential, poisson, and log normal. If you are not familiar with these, please reference my blog posting about Statistical Distributions. That blog entry was originally part of this post, but it grew too large and can stand on its own.

Analysis of Experimental Results

My colleagues collected four SQL statements from OLTP systems. I performed statistical fitness tests checking to see if the sample sets, statistically speaking (alpha=0.05) conform to a normal, exponential, poisson, or log normal distribution.

Statistically speaking, all four sample sets did not match any of the above distributions.

In every case, the log normal distribution was the closest match, but still it was not even close to statistically conforming.

If you want to check out the statistical work which includes lots of comments, math, and graphs, you can download the Mathematica notepad, a PDF of one of the experiments, data sets, etc. by clicking here. It's pretty cool.

Initial visual test passed nearly every time

I want to demonstrate just how easy it is to be deceived. The below histogram was taken from one of the four elapsed time sample sets (ref: Garr 8q, 97% of histogram data shown).

Looking at this histogram, the data looks exponential, perhaps poisson or log normal. This sample data failed to match all of these distributions. Not only did the hypothesis test fail, but it was not even close! My lesson from this is, the next time someone tells me the histogram represents data that is exponentially, poisson, or log normal I will kindly demand to see the fitness (statistical hypothesis) test comparing the experimental data with the stated fitted data distribution.

By the way, if you recall in Important Statistical Distribution blog entry about if the sample data is log normal, you can take the log of each sample, place it into a histogram, and the resulting histogram will look normal. Well...using the data from the above histogram, I did just that and the histogram below was the result!

This is a visually clear sign that our data set is not log normal distributed! But looking at the experimental data histogram we could not visually determine this. Personally, I find this very interesting...but time constraints did not allow me to delve deeper.

Beware of bind variables!

The data collection processes identifies the statement based on the combination of a sql_id and plan_hash_value. That's good because there can be many plans for the same sql_id. This allows us to differentiate between execution plans! However, there can also be many bind variable combinations applied to the same execution plan! If you see multiple clear histogram peaks (modes), there is likely a different bind variable set(s). There can also be multiple bind variable sets that result in the same elapsed time. This will effectively stack a histogram bar. Based on my conversations with the data provider of the image below, he knows there are multiple bind variable sets related to the same execution plan. The below histogram confirmed what he thought was occurring.

Of the four data sets I received, there was one that I believe is actually far more common then we suspect. It's histogram is shown below (ref: Aber 31, 90% of histogram data shown).

This histogram above shows the elapsed times for a single SQL statement with a single execution plan, yet with what looks to be two or three different bind variable sets. Notice there are at least three clear modes (peaks); near zero, near 10000 ms, and near 70000 ms. The actual average elapsed time is around 57200 ms. Is stating the average elapsed time is 57.2 seconds a good way to communicate the elapsed time? If I added the standard deviation is 104 seconds and we have 230 samples, it's obvious elapsed time ranges are all over the place and of course, never go negative.

A better way to communicate the analysis is that user will tend to either have results returned around 10000 ms or around 70000 ms, but probably not around the average of 57200 ms. There is a huge difference between 10 seconds and 70 seconds. Depending on your audience, actually showing them this histogram will immediately convey the complexity of the situation. My point is, unless we actually gather the detailed data and plot a histogram, describing the situation in terms of average can easily mislead your audience.

Profiling

This also brings up another point. If I'm going to profile the SQL statement to tune it, which bind variable situation am I tuning for? Can I tune for both? The profile is likely to look very different based on the bind variables. I used to do a lot of SQL tuning and when I did, I always demanded real bind variables and I made sure everyone new the bind variables I was using. Even then I new that, when we go for 100% optimization, there are likely to be big winners...and also big losers. So we've got to be careful. Having a histogram of the actual elapsed time can help us understand the true situation.

Predictive possibilities

There is good news and there is bad news... I thought for sure the experimental data would statistically conform to either the exponential, poisson, or log normal distribution...but it wasn't even close. The big disappointment is I cannot predict the median (or other statistics) based only on the average SQL elapsed time. But the results are what they are.

While the comparison between the data sets and the distribution did not statistically match, in every case unless bind variables caused a massive difference in the elapsed time, the median is far less then the mean. And if I assume the data is log normal distributed, the predicted median value was higher then the actual median. That is, if I predict the median, the actual median is probably lower. Said another way, if the average is 1000 ms and the predicted median is 50 ms, the actual median is probably less then 50 ms and what the user is more likely to experience.

Pick one of your Statspack/AWR SQL Statements

Now let's apply this to a real life situation. Referencing one of my cleansed Statspack reports and then looking at the reported top CPU consumer, I see the following statistics during the one hour sample period/interval.

Over the one hour sample interval this SQL statement was executed 7000 times which took about about 497518.80 seconds in total, that is the wall time or the elapsed time. This means that during this one hour sample period and for this SQL statement, the average elapsed time was 71.1 seconds.

If the statement is using a single bind variable set or the bind variables are not causing wild elapsed time swings, then you would expect the typical elapsed time to be less then 71 seconds and perhaps around half that. We can also expect the elapsed time to occasionally be much longer than 71 seconds.

If I needed more specific and more useful data, then I would use my data collection script to gather some data, plug it into my Mathematica analysis notepad, and crank out the histogram.

Unfortunately, because our analysis could not match the experimental data with a known statistical distribution, we can not confidently and responsibly make precise predictions about the expected median and other statistics. That's too bad, but that's the way it is...for now.

If you really want to know...

If you want to sincerely understand the elapsed time characteristics of a key SQL statement, then you will need to gather some data. Running the statement a few times from within the application or perhaps using SQL*Plus will get you some results, but that is not what we are looking for here. We want to see the elapsed times for the SQL statement as it occurs in your real production system. The best way to do this is through some automated tool. But what tool?

Whatever that tool may be, it must gather multiple individual elapsed times and then draw conclusions. Gathering the number of executions and the total elapsed time, even within an 30 minute interval, will only provide the average...and that's not what we're going for here.

You could use my tool that my colleagues and I used to gather the data which I analyzed. But the sampling method (and to get the best samples) nearly 100% consumes a CPU core...not what I would call a production ready product! If I come up with a utility to gather this type of data with minimal overhead I'll be sure to blog about it and offer it to other DBAs.

Conclusions

Based upon my analysis I think it is reasonable to assume given a SQL statement (the text, that is, the sql_id) its elapsed time will not conform to any common statistical distribution. Why? Because for a given SQL statement there can be multiple execution plans and within each execution plan there can be multiple bind variable combinations. Even if you can demonstrate the SQL being investigated is running with the same execution plan and with the same bind variable set, as two of my sample sets showed, the elapsed times varied wildly and did not conform to the normal, exponential, poisson, or log normal distribution. While the distribution may look log normal, statistically it didn't pass the hypothesis test.

Here's a few points that I will take away from this specific work:

While this experiment did not prove SQL statement elapsed times do not conform to a standard distribution, it certainly raises the question they may not conform. Why? Because every sample I have analyzed has shown its elapsed time distribution is not statistically normal, poisson, exponential, or log normal distributed. If someone states otherwise, respectfully ask to see their data and subsequent analysis.

Unless a SQL statement is using multiple bind variable sets, I would be very comfortable stating the median is less and possibly considerably less then the mean. All our analysis data sets (and my experimental ones also) show the median is far less then the mean. At least by 50%.

Just because a data set visually looks very much like another data set or statistical distribution, does not mean it statistically matches...be careful.

If you really want to know the elapsed time characteristic of a key SQL statement, you must actually gather elapsed time samples...and lots of them. If you don't have access to Mathematica, send me your data and I will gladly perform the analysis for you!

I learned a lot from preparing this blog entry: My math skills have certainly increased, especially regarding hypothesis testing. My Mathematica skills have also dramatically increased...this can be seen by viewing the notepad file I developed and used to analyze the data. Developing the data collection tool kit was very satisfying and I've already used to it to create another similar tool and I'm considering developing an advanced product version (stay tuned). My SQL elapsed time assumptions were where seriously brought into question. I was able to come away with stronger conclusions about the typical SQL elapsed time then I could previously. Not a hunch, but based on a solid statistical analysis.

Thanks for reading and I look forward to receiving data samples from you!

Wednesday, February 2, 2011

This page has been permanently moved. Please
CLICK HERE to be redirected.

Thanks, Craig.

Important Statistical Distributions For The Serious Oracle DBA

These past two months I have been doing a tremendous amount of Oracle Database performance research work. I have been driven to achieve a deeper understanding of an "average," what it implies and its worth in Oracle Database performance analysis.

But, it's more than that.

Understanding statistical distributions is absolutely key to understanding the user experience, to communicate with users and your peers, simple experiments and when I carry out my research.

I wrote this both for myself as reference and for the Oracle DBA, which means I wrote it for you! I am really hoping it makes lots a lot of sense and will be useful in your career.

My Approach

My approach is to start with introducing the histogram. I'm doing this because it is a wonderful visual way to compare and contrast experimental sample sets and of course, statistical distributions. Then I will move on and introduce five different distributions. I didn't just randomly pick five. I chose these five because they are important for our work. For each distribution I start by presenting an simple non-Oracle example. Then I get a little more technical by describing the inputs to create the distribution, for example the average. I also include other interesting tidbits. Here we go...

Understanding a Histogram

When presented with a sample set, such as {1.5, 3.2, 2.6, 4.2, 3.8, 2.1 , 5.1, 2.6, 6.5, 3.4, 4.2}, a fantastic way to visually grasp the data and to get a gut feel about it, is to create a histogram. The image below was created by simply copying and pasting the previous sample set directly into WolframAlpha...give it a try.

A histogram's vertical axis is the number of occurrences which, in our case is the number of samples. In our sample set there are 11 samples, which means there are 11 occurrences. Each of these 11 occurrences will be represented somewhere in the histogram.

The horizontal axis are the values of the samples. Scanning our 11 sample values you'll notice the minimum value is 1.5 and the maximum value is 6.5. The horizontal axis must include this range. For our sample set's histogram, the minimum value on the histogram is 1 and the maximum value is 7.

Our first sample is 1.5 and is represented on this histogram as the bottom (that is, first) occurrence on the first bar from the left. The second sample 3.2 is represented as the first occurrence on the third bar from the left. Notice there are three samples between 3 and 4, hence that histogram bar is 3 occurrences high. If you count the number of occurrences in the histogram, you'll notice there are 11, which is also the number of our samples!

What is interesting in this sample set's histogram is we can see the distribution is skewed to the left with a somewhat long right tail. The mean, that is the average, is 3.6 and the median is 3.4. This tells us there are more samples to left of the mean than to right of the mean. Recognizing this difference is very important in Oracle performance analysis.

That's the end of my introduction to histograms. The key is every sample is represented somewhere on the graph. Next I'm going to introduce the five statistical distributions every Oracle performance analyst needs to know about; uniform, normal, exponential, poisson, and finally the log normal distribution.

The Uniform Distribution

Every programmer at some point has needed a random number. Most random number generators provide the ability to return a random number between two values, say 0 and 100. There is an important underlying assumption we typically don't think about. This assumption is any number is just as likely to be returned as any other number. Said another way, the likelihood of returning a 15 is just as likely as returning 55. Said yet another way, there is no preference toward returning a specific number or a group of numbers. This is another way of saying the distribution of results is uniform...hence the uniform distribution.

Let's start with a specific quantify and numeric range of random numbers and place them into a histogram. The histogram below is based on a set of only 10 random real numbers between 0 and 100. I also specified there to be 10 histogram bins, that is, groups or buckets. I defined the random set of values by providing the minimum and maximum values.

By the way, in the figure above you should be able to count the total number of samples which, should add up to the number of samples (10). That doesn't look very random because it contains only 10 values. If I increase the number of samples from 10 to 10000 the histogram looks very different. Below is the histogram containing 10000 random real numbers between 0 and 100.

This is more like it and what we expected to see. A way of interpreting this histogram is that we are about as likely to pick any number between 0 and 100... it's like random! ...that's because the sample set is full of random uniformly distributed numbers.

The median for a uniform distribution is the same as the mean. If you were to sort all the samples and pick the one in the middle that would be the median and also the mean.

The Normal Distribution

If I asked a group of people to measure the length of a physically present piece of wood down to the millimeter, I will receive a variety of answers. If took all the results (my sample set) and placed them into a histogram format, the result would be the classic bell curve, which is more formally known as the normal distribution.

The key thing to remember about a normal distribution is there are just as many samples less than the mean than there are greater than the mean. This is another way of saying the median is equal to the mean.

A normal distribution set of values is defined by its mean and standard deviation (which is a statistic that tell us about the dispersion of the samples). If I had a really cool random number generator, I could tell it to return a set of numbers that are normally distributed, with a mean of m, and a standard deviation of s. Thankfully, I do have a spiffy random number generator like this! I used a Mathematica command (actually called a symbol, not a command) to generate a sample set used to create the below histograms!

The histogram below is based on a set of only 10 normally distributed real numbers with a mean of 50 and a standard deviation of 2. I also specified there to be 10 bins, that is, groups or buckets.

That doesn't look very normal! That's because there are only 10 samples, but if we increase the number of samples to 10000 it looks more, well...normal.

But still, it's not that smooth because I set the number of bins to 10. If I let Mathematica automatically set the number of bins, we see the classic looking normal distribution histogram. Remember these three "normal" distribution images have the same mean (average) and standard deviation.

The median for a normal distribution is the same as the mean. If you where to sort all the samples and pick the one in the middle that would be both the median and also the mean.

The normal distribution is often used because it makes things easy, most people know what the histogram looks like, the math is more straightforward, and students are used to working with normally distributed data. People are drawn to symmetry...we want symmetry. If we see something that is not symmetrical, it's like we are forced to understand why...an that takes energy and time. So most people, including research scientists, tend to assume their data is normal. [Log-normal Distributions across the Science: Key and Clues, BioScience, May 2001] In my Forecasting Oracle Performance book, you will notice I write many times, "...assuming the samples are normally distributed..." By stating this, I am saying the math I'm going to use will work well only when the sample set is normally distributed. If the distribution is not normal, while the math will crank out a result, it will not be as reliable.

As you dig deeper into Oracle performance analysis, you'll begin to realize that many performance related distributions are indeed not normal. Most people think that most distributions are normal, but as I'm demonstrating in my blog entries there are many situations where this is not true...especially when we're talking about Oracle performance topics.

The Exponential Distribution

Radioactive material decays exponentially. For example, suppose in the next hour a piece of radioactive material has a 10% chance of splitting, therefore having a 90% chance of not splitting. Because radioactive decay occurs exponentially, within the next hour there will be a 5% chance the material will split and a 95% chance it will not split. Within the third hour there will be a 2.5% chance the material will split, with a 97.5% chance it will not split. And on and on...unfortunately forever. Here is our sample set in WolframAlpha-ready format, followed by the plot:

plot {{1,5},{2,2.5},{3,1.25},{4,0.625},{5,0.313},{6,0.156},{7,0.078}}

Crack open a book about queuing theory or computing system performance analysis and you'll see the words exponential distribution. It's one of the phrases that computer folks throw around, only few can talk about, and very few really understand. So in a few short paragraphs, I'm going to try and explain this (that's the "talk about" part) as clearly as I possibly can, without the aid of a white board and personal classroom interaction. If you want personal classroom interaction, check out my courses.

The exponential distribution is like most other distributions in that it is defined by a small set of parameters. For the uniform distribution, the sample set definition is x number of random numbers (i.e., integers, real, etc.) between a minimum and maximum value. For the normal distribution, the sample set definition is x number of normally distributed numbers (i.e., integers, real, etc.) having a mean of m and a standard deviation of s. For the exponential distribution, the sample set definition is x number of exponentially distributed numbers having an average of m. (Actually the real input parameter is 1/m). There is only a single parameter!

Unlike a uniform or normal distribution, there are more samples less than the mean then greater than the mean! If you understand the second point above, then it will make sense that if you where to sort all the samples and pick the one in the middle (the median sample) its value would be less than the mean. Said another way, there are more samples less than the mean then greater then the mean. You can see this visually in the figures below. In the radioactive decay example above, the mean is 1.42 and the median is 0.625.

Exponentially distributed sample sets are very common in computing system performance analysis. It is a given in capacity planning that the time between transaction arrivals (one of my next blog entries) and also how long it takes to service the transaction (not their waiting or queue time but the actual service) is exponentially distributed.

The histogram below is based on a set of only 10 exponentially distributed real numbers with a mean of 50. I also specified there to be 10 bins, that is, groups or buckets.

Because there are only 10 samples, the histogram is very awkward looking. But even with only 10 samples, it seems to look different then both the 10 sample uniform and normal distribution histograms shown above. This histogram is also eerily similar to the histogram near the top of this blog based on the 12 sample SQL execution elapsed times...woops... sorry...that's at the top of my next blog posting!

The below two figures are based on the exact same sample set. The only difference is how I defined the histogram bins.

The figure below is still an exponentially distributed set of values with a mean of 50 but I increased the number of samples from only 10 to 10000.

It is still chunky looking but that's because I set the number of bins to 10. I wanted to show you this because many times when doing experiments and performance analysis we don't have 10000 samples and so there may only be a few bins. Even with only 10 bins, the above pattern or look is what you might see. Letting Mathematica set the number of bins, we get the classic looking exponential distribution histogram.

Remember, all three of the above histograms has a mean of 50. The only difference is the number of samples and the number of histogram bins.

Remember, the above two images/histograms are based on the exact same sample set and the only difference in the bin size settings!

Let's investigate the median a bit. As you'll recall if we sorted all the samples and picked the middle sample, that will be the median. Looking at the directly above 10000 sample histogram above, what does the mean and the median look to be? Well, I told you the mean is 50, so what about the median? Because the median is based on the number of samples not their values, in this case the median will be less than the mean. It's like those few large values have pulled the mean to right and away from the median. While you can't see this in the above histogram graphic, there are samples with values of 300 to 400 to 500. There are not a lot of them, but they are there and they are effectively pulling the mean toward them!

While the mean for the above histogram is 50, the median is about 35. So our hunch is correct in that the median is less than the mean.

While many Oracle performance distributions may look exponential, if we increase the bin size they begin to look poisson-like. So it's important we also investigate the poisson distribution.

Poisson Distribution

Suppose I'm doing a traffic study and need to estimate the average number of vehicles that pass by a specific free-flowing(1) destination each minute. I wake up one delightful morning, get my cup of coffee, arrive at my destination and start counting. Every 30 seconds we record the number of vehicles that pass by. Let's say I did this for 10 minutes, which means I would have 20 samples. The result sample set is poisson distributed. Assuming I actually did this, here are the actual 20 sample values, followed by the histogram.{19, 19, 18, 10, 20, 16, 12, 18, 15, 15, 18, 18, 16, 20, 17, 14, 27, 19, 14, 17}

You can see a similar histogram by copying and pasting the above values (keep the curly braces) into WolframAlpha.

(1) By free-flowing I mean each vehicle that passes is not somehow dependent on another vehicle that passes. For example, if there is a traffic jam, an accident, or a stop sign nearby causing the vehicles to bunch up, then the vehicle arrivals would be related. This is another way of saying the arrivals must be independent which, must occur in a true poisson distribution. Additionally, the inter-arrival times (i.e., time between each arrival) must be exponentially distributed.

The poisson histogram image above is very typical. Going left to right, there is a quick build up to the peak and then then a slow build down. If we have enough samples, like the normal distribution, the mean will equal the median. So even though there are a few large values to the right of the mean, there are enough smaller values to the left of the mean to keep the mean and median "in check" and not swaying from each other. As you'll recall from above, this is very different compared to an exponential distribution. With an exponential distribution the far left histogram bar is the tallest, that is it contains the most samples, and hence there is not built up.

Take a look at the below two histograms. Are they exponential or poisson? It is natural to think the distribution on the left is exponential and the distribution on the right is poisson... But actually, they are both poisson.

When comparing the above two images, the image on the left (contains the wider bins) looks more exponential than poisson. Surprisingly, they are both based on the exact same sample set! If you look closely, you'll notice the only difference is the histogram bin size. You can also be deceived when there are lots of samples because a poisson distribution can look like a normal distribution! Just look at the image below.

I will get more into a poisson process and its distribution in my upcoming blog entry on SQL statement arrival rates. But for now, keep in mind that what may appear to be a normal or exponential distribution could actually be more poisson-like!They can be difficult to distinguish. The only way to test this is to perform a statistical hypothesis test.

As I will blog about soon, many Oracle performance distribution may look exponential or poisson, but they fail a statistical hypothesis test. There is yet another and lesser known distribution that many times is the best match for Oracle performance related sample sets. It's called the log normal distribution.

Log Normal Distribution

There are times when a skewed normal distribution occurs. By skew I mean the tallest histogram bar is not in the middle but to the left or the right. For example, a skew is likely when mean values are near zero, variances are large and perhaps extreme, and the sample values cannot be negative. When these types of conditions exist, the log normal distribution may best describe the sample set.

Personal income, reaction time to snake bites and bee stings, and country GDP and oil field reserves are all supposed to be log normal distributed. Humm...

Proving it to myself

To prove this to myself, I decided to give this a try using real country GDP and country oil reserve data. With my hands shaking in anticipation, I called once again on Mathematica. Mathematica has vast data resources that can be pulled onto my desktop and analyzed. It's scary-amazing. Within minutes I had the GDP and oil reserves for 231 countries at my fingertips. I'm not sure of the year, but that's really not important anyways.

I then placed the data into a histogram. Regardless of my histogram tweaks, the image always looked exponentially distributed. However, when I performed a statistical fitness tests comparing the data samples with the distributions presented in this blog, only the log normal distribution was statistically similar. Every other distribution did not conform to the actual data. So I guess the experts where correct.

How to create a log normal data set

First, I'll state a simple definition: A sample set is log-nomal if log(sample x) is normally distributed. For example, if I have 100 log-normal samples and I apply log(sample[i]) to each one and then create a histogram from the results, the histogram will be normally distributed!

Here is how I created a log normal sample set taking a slightly different twist: For each sample x in a normal distribution with a given mean and standard deviation, apply exp(x) to it and place the result into another sample set. If you create a histogram on the new exp(x) samples it will look this like this (assuming the normal distribution samples have a mean of 3 and the standard deviation of 1.5):

Figure X.

It looks a lot like the exponential and poisson distributions! In fact, based on it's two input parameters (mean and standard deviation from its associated normal distribution), it can look like either one...especially if we mess with the histogram bin sizes and number. I suspect this flexibility is what makes it a relatively good visual match for our experimental data.

Now this is really cool! Recall the country GDP sample set I mentioned just above in the Proving it to myself section. Below is the country GDP histogram, with 80 bins and showing 90% of the data.

If I take the log of each country's GDP and create a histogram of the result, it looks like this:

Very cool, eh? So while the raw data histogram looks more exponential, when we apply the log function to the data and create a histogram it looks pretty normal...which means visually our country GDP data is log normal distributed. And as I mentioned above, performing a statistical hyposthesis test the data is statistically log normal as well.

How to lie using histograms

Interestingly, the image below is based on the exact same sample set as the Figure X histogram three images above. The only difference is I set the number of bins to 40 and displayed 90% of the data. This is why the horizontal axis only extends to around 70, not over 700.

Now that's more like it! Notice the far three left bars are not the tallest and also notice how far the tail extends to the right. These are two key identifying characteristics of the log normal distribution. But I have to warn you, there are many log normal distribution sample sets that do not have the far left bars less then the tallest bar. (e.g., country GDP data.) The only way to really test if your data is log normal is to perform a hypothesis test. I will detail this in the next blog entry.

Predicting the median and mean

This is pretty cool: The actual data samples in the above histogram have an average of 53 and a median of 19. So there are enough large value samples to effectively pull the mean away from the median. If you recall, for both the normal and poisson distributions, the mean and median are equal.

Ready to be freaked out? For a log normal distribution, the median of its samples is supposed to be the constant e to the power of its normal distribution's average. If you recall above, the log normal sample set was created from normally distribution samples with a mean of 3. Therefore, the median equation is:

median = e^m = e^3 = 2.718282846^3 = 20.08

Woah! Just above I said the actual median was 19, which is very close considering the sample set consists of only 1000 samples.

Let's try another freak-out thing: the mean of a log normal distribution is the constant e to the power of its normal distribution's average plus its standard deviation squared divided by two. Words are messy, so the equation is:

mean = e^(m+(s^2)/2) = 2.718282846^(3+(1.5^2)/2) = 61.87

Again, not perfect but close...although I would have liked it to be closer. I will delve into the application of this in subsequent blog postings.

Shifting sand

I mentioned above the input parameters to create a log normal distribution are the mean and standard deviation of its associated normal distribution. These two parameters are also sometimes referred to as the scale and shape parameters. If I mess with the shape parameter (i.e., standard deviation), this causes the histogram tail to either contract or extend far to the right. The scale parameter shifts the tallest histogram bar to the left or right.

Here's an example. In the image below the darkest red-ish color is the overlap of two histograms. The two histograms are colored purple and pink. The two data sets are only different in their scale parameter. You can see one of the differences is the tallest bar shifted to the right when the scale parameter was increased. Perhaps not that interesting, but it will be useful in future blog posts.

The log normal distribution is amazing. The reason I focused so much on it is because it is important for Oracle performance analysis. I haven't demonstrated this yet, but I will in subsequent posts...stay tuned.

Conclusions

As I dig deeper into Oracle performance analysis, I am forced to understand statistical distributions. There is just no way around it. Documenting my research and to prepare me to analyze experimental data has produced the content for this posting. My hope is that I have conveyed a few key take-aways:

A clear understanding of a histogram.

By changing histogram characteristics (e.g., bin size, number of bins) you can make a sample set look like a desired distribution.

How common statistical distributions relate to our lives.

What common statistical distribution histograms look like.

In subsequent blog postings, I will analyze how various Oracle system happenings relate to these common statistical distributions. This will allow us to communicate more confidently, more correctly, and also perhaps make some interesting predictions.