R is without doubt the Open Source tool of choice for statistical analysis, it contains a huge variety of statistical analysis techniques – rivalled only by hugely expensive commercial products such as SAS and SPSS. I’ve been playing with R a bit lately, and – of course – working with data held in Oracle. In particular, I’ve been playing with data held in the Oracle dynamic performance views.

This post is a brief overview of installing R, connecting R to Oracle, and using R to analyse Oracle performance data.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

> install.packages("RJDBC")

Using the StatET Eclipse plug-in

I use a free eclipse plug-in called StatET. It provides an editing environment and GUI console for the R system. The configuration steps are a little laborious, but it has a online getting started module that guides you through the steps. Once you have it installed, I doubt you’ll go back to the command line.

The library command loads the RJDBC module, which will provide connectivity to Oracle.

3

We create a driver object for the Oracle JDBC driver. The second argument is the location of the Oracle JDBC jar file, almost always $ORACLE_HOME/jdbc/lib/ojdbc6.jar.

6

Connect to the Oracle database using standard JDBC connections strings

7

Create an R dataset from the result set of a query. In this case, we are loading the contents of the V$SQL table.

10

The R “summary” package provides simple descriptive statistics for each variable in the provided dataset.

Basic R statistical functions

R has hundreds of statistical functions, in the above example we used “summary”, which prints descriptive statistics. The output is shown below; mean, medians, percentiles, etc:

Correlation

Statistical correlation reveals the association between two numeric variables. If two variables always increase or decrease together the correlation is 1; if two variables are absolutely random with respect of each other then the correlation tends towards 0.

cor prints the correlation between every variable in the data set:

cor.test calculates the correlation coefficient and prints out the statistical significance of the correlation, which allows you to determine if there is a significant relationship between the two variables. So does the number of sorts affect response time? Let’s find out:

The p-value is 0.19 which indicates no significant relationship – p values of no more than 0.05 (one chance in 20) are usually requires before we assume statistical significance.

On the other hand, there is a strong relationship between CPU time and Elapsed time:

Plotting

plot prints a scattergram chart. Here’s the output from plot(sqldata$ELA,sqldata$CPU):

Here’s a slightly more sophisticated chart using “smoothScatter”, logarithmic axes and labels for the axes:

Regression

Regression is used to draw “lines of best fit” between variables.

In the simplest case, we use the “lm” package to create a linear regression model between two variables (which we call “regdata” in the example). The summary function prints a summary of the analysis:

This might seem a little mysterious if your statistics is a bit rusty, but the data above tells us that there is a significant relationship between elapsed time (ELA) and physical reads (PHYS) and gives us the gradient and Y axis intercept if we wanted to draw the relationship. We can get R to draw a graph, and plot the original data by using the plot at abline functions:

Testing a hypothesis

One of the benefits of statistical analysis is you can test hypotheses about your data. For instance, what about we test the until recently widely held notion that the buffer cache hit rate is a good measure of performance. We might suppose if that were true that SQL statements with high buffer cache hit rates would show smaller elapsed times than those with low buffer cache hit rates. To be sure, there are certain hidden assumptions underlying that hypothesis, but for the sake of illustration let’s use R to see if our data supports the hypothesis.

Simple correlation is a fair test for this, all we need to do is see if there is a statistically signifcant correlation between hit rate and elapsed time. Here’s the analysis:

The correlation is close to 0, and the statistical significance way higher than the widely accepted .05 threshold for statistical significance. Statements with high hit ratios do not show statistically signficantly lower elasped times that SQLs with low hit ratios.

Conclusion

There’s tons of data in our Oracle databases that could benefit from statistical analysis – not the least the performance data in the dynamic performance views, ASH and AWR. We use statistical tests in Spotlight on Oracle to extrapolate performance into the future and to set some of the alarm thresholds. Using R, you have easy access to the most sophisticated statistical analysis techniques and as I hope I’ve shown, you can easily integrate R with Oracle data.

Reader Comments (6)

Guy,How does the RJDBC package compare to RODBC and RORACLE packages in terms of performance? Just curious since we are using a 64 bit trial version of Revolution R to connect to Oracle. SAS licensing for servers is expected to increase from $1.5M to $4M upon renewal and we need to find a substitute. We are using sas pass-thru to hit oracle tables currently.Thanks.

I haven't done any performance comparisons, though its something I am interested in. I did think that RJDBC seemed slower than you would expect - I wondered if it might not be doing array fetch. If I get a few spare mementos I'll check it out.

Interesting comment about SAS licensing, not the first time I've heard that!Guy

Guy,Thanks for the response to my post. Revolution just did a presentation for us last week and they are planning to incorporate RODBC into their product by the end of the summer. Regarding RORACLE, they stated the difficulty in getting time with Oracle due to the large number of vendors. Looking forward to your take on performance of the various R interface packages.

"For statistical analysis SAS is way ahead of any other software " Don't think so. I've been using SAS for 14 years and R the last 3 years. SAS doesn't offer advance statistical methods such as Support Vector Machines, Random Forests, etc. which are readily available in R. Also, Revolution Analytics is on the cusp of something extemely powerful with big data capabilities in R.