Sunday Dec 08, 2013

It’s getting to be that time of year
again. The Oracle BIWA Summit '14 will
be taking place January 14-16 at Oracle HQ Conference Center, Redwood Shores,
CA. Check out the detailed agenda.

BIWA Summit provides a wide range of
sessions on Business Intelligence, Warehousing, and Analytics, including: novel
and interesting use cases of Oracle Big Data, Exadata, Advanced Analytics/Data
Mining, OBIEE, Spatial, Endeca and more! You’ll also have opportunities to
get hands on experience with products in the Hands-on Labs, great customer case
studies and talks by Oracle Technical Professionals and Partners. Meet
with technical experts on the technology you want and need to use.

Click HERE to read detailed abstracts
and speaker profiles. Use the SPECIAL DISCOUNT code ORACLE12c and
registration is only $199 for the 2.5 day technically focused Oracle user group
event.

Friday Oct 11, 2013

If you'd like try Oracle R Enterprise, Oracle Partner Vlamis Software Solutions provides a quick and easy way to get started using a virtual machine (VM) loaded with all the software you require and hosted on Amazon Web Services (AWS). Follow this link and within a few clicks you'll have a "Remote Desktop" connection to the cloud with sample scripts for you to explore both the R language and Oracle R Enterprise, both from R and SQL.

Oracle R Enterprise, a component of the Oracle Advanced Analytics option,
makes the open source R statistical programming language and
environment ready for the enterprise and big data. It provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. R users can operationalize entire R scripts in production applications, thereby eliminating porting of R code to other languages or reinventing code to integrate R results into existing applications. Oracle R Enterprise allows users to seamlessly leverage Oracle Database as a high performance computing (HPC) environment for R scripts, providing data parallelism and resources management.

Monday Aug 12, 2013

Refreshing predictive models is a standard part of the
process when deploying advanced analytics solutions in production environments.
In addition, many predictive models need to be used in a real-time setting for scoring
customers, whether that is for fraud detection, predicting churn, or
recommending next likely product.One of
the problems with using vanilla R is that real-time scoring often requires
starting an R engine for each score, or enabling some ad hoc mechanism for
real-time scoring, which can increase application complexity.

In this blog post, we look at how Oracle R Enterprise
enables:

Building models in-database on database data from R

Renaming in-database models for use by a stored procedure

Invoking the stored procedure to make predictions from SQL

Building a second model and swapping it with the original

Moving a model from development environment to production
environment

Building the model in
R

So let’s start with building a generalized linear model (GLM)
in Oracle Database. For illustration purposes, we’ll use the longley data set from R – a
macroeconomic data set that provides a well-known example for a highly
collinear regression. In R, type ?longley
for the full description of the data set.

Using the following R script, we create the database table LONGLEY_TABLE from the longleydata.frame
and then build the model using the in-database GLM algorithm. We’re predicting
the number of people employed using the remaining variables. Then, we view the model
details using summary and the
auto-generated fit.name. This fit.name corresponds to the name of
the Oracle Data Mining (ODM) model in the database, which is auto-generated.
Next, we use the model to predict using the original data, just for a
confirmation that the model works as expected.

ore.connect("rquser","my_sid","my_host","rquser_pswd",1521,
all=TRUE)

ore.create(longley,
table="LONGLEY_TABLE")

mod.glm <-
ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm)

mod.glm$fit.name

predict(fit1, LONGLEY_TABLE)

While a user can refer to the ODM model by its name in fit.name, for example, when working
with it in SQL or the Oracle Data Miner GUI, this may not be convenient since
it will look something like ORE$23_123.
In addition, unless the R object mod.glm
is saved in an ORE datastore (an ORE
feature corresponding to R’s save
and load functions using ore.save and ore.load, but in the database), at the end of the
session, this object and corresponding ODM model will be removed.

In addition, we’ll want to have a common name for the model
so that we can swap an existing model with a new model and not have the change
higher level code. To rename an ODM model, we can use the PL/SQL statement
shown here, invoked with R using ore.exec.
Of course, this could also be done from any SQL interface, e.g., SQL*Plus, SQL
Developer, etc., just supplying the explicit SQL.

So now, we have the ODM model named MY_GLM_MODEL. Keep in mind, after the model is renamed,
the original model no longer exists and the R object is invalid – at least from
the standpoint of being able to use it in functions like summary or predict.

Scoring data from a SQL
procedure

As noted above, users can score in batch from R, however,
they can also score in batch from SQL. But we’re interested in real-time
scoring from the database using the in-database model. This can be done
directly in a SQL query but providing the input data in the query itself. This eliminates
having to write data to a database table and then doing a lookup to retrieve
the data for scoring – making it real-time.

The following SQL does just this. The WITH clause defines the input data,
selecting from dual. The SELECT
clause uses the model MY_GLM_MODEL
to make the prediction using the data defined by data_in.

WITH data_in as (select 2013
"Year",

234.289 "GNP",

235.6 "Unemployed",

107.608
"Population",

159 "Armed.Forces",

83 "GNP.deflator",

60.323 "Employed"

from dual)

SELECT PREDICTION(MY_GLM_MODEL
USING *) "PRED"

FROM data_in

While we could invoke the SQL directly, having a stored
procedure in the database can give us more flexibility. Here’s the stored
procedure version in PL/SQL.

Let’s say the model above has been in production for a
while, but has become stale – that is,
it’s not predicting as well as it used to due to changing patterns in the data.
To refresh it, we build a new model. For illustration purposes, we’re going to
use the same data (so an identical model will be produced, except for its
name).

mod.glm2 <-
ore.odmGLM(Employed ~ ., data = LONGLEY_TABLE)

summary(mod.glm2)

mod.glm2$fit.name

To swap the models, we delete the existing model called MY_GLM_MODEL and rename the new model
to MY_GLM_MODEL. Again, we can do this from R using PL/SQL and through ore.exec.

You may have noticed that this approach can introduce a brief period where no model is accessible - between the DROP_MODEL and RENAME_MODEL. A better approach involves the use of SYNONYMs. In general, synonyms provide both data independence and location transparency, being an alternative name for a table, view, sequence, procedure, stored function, and other database objects. We can use this in conjunction with our stored procedure above. First, create a synonym for the original scoring procedure.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC;

When invoking the procedure from your application, use the name MY_SCORING_PROC_SYM in place of MY_SCORING_PROC. Instead of renaming the model, create a second stored procedure, with a different name, e.g., MY_SCORING_PROC_2. The new procedure references the name of the newly build model internally.

When it is time to swap the models, invoke the following to change the procedures.

CREATE or REPLACE SYNONYM MY_SCORING_PROC_SYM for MY_SCORING_PROC_2;

Another benefit of this approach is that replaced models can still be kept should you need to revert to a previous version.

Moving an in-database
model from one machine to another

In a production deployment, there’s often the need to move a
model from the development environment to the production environment. For
example, the data scientist may have built the model in a development / sandbox
environment and now needs to move it to the production machine(s).

In-database models provide functions EXPORT_MODEL and IMPORT_MODEL
as part of the DBMS_DATA_MINING
SQL package. See the 11g
documentation for details. These calls can be invoked from R, but we’ll
show this from SQL just to keep the flow easier to see.

From a SQL prompt, e.g., from SQL*Plus, connect to the
schema that contains the model. Create a DIRECTORY
object where the exported model file will be stored. List the model names
available to this schema, which should contain MY_GLM_MODEL.
Then, export the model

CONNECT rquser/rquser_psw

CREATE OR REPLACE DIRECTORY
rquserdir AS '/home/MY_DIRECTORY';

-- list the models available
to rquser

SELECT name FROM
dm_user_models;

-- export the model called MY_GLM_MODEL
to a dump file in same schema

EXECUTE
DBMS_DATA_MINING.EXPORT_MODEL ('MY_GLM_MODEL_out',

'RQUSERDIR',

'name = ''MY_GLM_MODEL''');

At this
point, you have the ODM model named MY_GLM_MODEL
in the file MY_GLM_MODEL_out01.dmp
stored in the file system under /home/MY_DIRECTORY.
This file can now be moved to the production environment and the model loaded
into the target schema.

Log into the
new schema and invoke IMPORT_MODEL.

CONNECT rquser2/rquser2_psw

EXECUTE
DBMS_DATA_MINING.IMPORT_MODEL (MY_GLM_MODEL_out01.dmp',

'RQUSERDIR', 'name =
''MY_GLM_MODEL''',

'IMPORT', NULL,
'glm_imp_job', 'rquser:rquser2');

Summary

In this post, we’ve highlighted how to build an in-database
model in R and use it for scoring through SQL in a production, re-time
settings. In addition, we showed how it is possible to swap, or refresh, models
in a way that can leave your application code untouched. Finally, we highlighted
database functionality that allows you to move in-database models from one database
environment to another.

Users should note that all the functionality shown involving
SQL, or being invoked through ore.exec, can be easily wrapped in R functions that could ultimately become part of ORE. If
any of our readers are interested in giving this a try, we can post your
solution here to share with the R and Oracle community. For the truly
adventurous, check out the Oracle Database package DBMS_FILE_TRANSFER to consider wrapping the ability to
move model files from R as well.

Friday Jul 19, 2013

Oracle R Connector for Hadoop 2.2.0 is now available for download. The Oracle R Connector for Hadoop 2.x series has introduced numerous enhancements, which are highlighted in this article and summarized as follows:

ORCH 2.0.0

ORCH 2.1.0

ORCH 2.2.0

Analytic Functions

orch.lm

orch.lmf

orch.neural

orch.nmf

Oracle Loader for Hadoop (OLH) support

CDH 4.2.0

ORCHhive transparency layer

.

.

.

.

.

.

Analytic Functions

orch.cor

orch.cov

orch.kmeans

orch.princomp

orch.sample - by percent

Configurable delimiters in text input data files

Map-only and reduce-only jobs

Keyless map/reduce output

"Pristine" data mode for high performance data access

HDFS cache of metadata

Hadoop Abstraction Layer (HAL)

.

Analytic Functions

orch.sample - by number of rows

CDH 4.3.0

Full online documentation

Support integer and matrix data types in hdfs.attach with detection of "pristine" data

In ORCH 2.0.0, we introduced four Hadoop-enabled analytic functions supporting linear regression, low rank matrix factorization, neural network, and
non-negative matrix factorization. These enable R users to immediately begin using advanced analytics functions on HDFS data using the MapReduce paradigm on a Hadoop cluster without having to design and implement such algorithms themselves.

While ORCH 1.x supported moving data between the database
and HDFS using sqoop, ORCH 2.0.0 supports the use of Oracle Loader for Hadoop (OLH) to move very large
data volumes from HDFS to Oracle Database in a efficient and high
performance manner.

In ORCH 2.1.0, we added several more analytic functions, including
correlation and covariance, clustering via K-Means, principle component
analysis (PCA), and sampling by specifying the percent of records to return.

ORCH 2.1.0 also brought a variety of features, including: configurable delimiters
(beyond comma delimited text files, using any ASCII delimiter), the ability to specify mapper-only and
reduce-only jobs, and the output of NULL keys in mapper and reducer functions.

To speed the loading of data into Hadoop jobs, ORCH introduced
“pristine” mode where the user guarantees that the data meets certain
requirements so that ORCH skips a time-consuming data validation step. “Pristine” data requires that numeric columns contain only numeric data,
that missing values are either R’s NA or the null string, and that all
rows have the same number of columns. This improves performance of hdfs.get on a 1GB file by a factor of 10.

ORCH 2.1.0 introduced the caching of ORCH metadata to improve response time of ORCH functions, such as hdfs.ls, hdfs.describe, and hdfs.mget between 5x and 70x faster.

The Hadoop Abstraction Layer, or HAL, enables ORCH to work on top of
various Hadoop versions or variants, including Apache/Hortonworks,
Cloudera Hadoop distributions: CDH3, and CDH 4.x with MR1 and MR2.

ORCH 2.2.0

In the latest release, ORCH 2.2.0, we’ve augmented orch.sample to allow specifying the
number of rows in addition to percentage of rows. CDH 4.3 is now supported, and ORCH functions provide
full online documentation via R's help function or ?. The function hdfs.attach now
support integer and matrix data types and the ability to detect pristine
data automatically. HDFS bulk directory upload and download performance speeds were
also improved. Through the caching and automatic synchronization of ORCH metadata and file lists, the
responsiveness of metadata HDFS-related functions has improved by 3x over ORCH 2.1.0, which also improves performance of hadoop.run and hadoop.exec functions. These improvements in turn bring a more
interactive user experience for the R user when working with HDFS.

Starting in ORCH 2.2.0, we introduced out-of-the-box tuning optimizations for
high performance and expanded HDFS caching to include the caching of file lists, which further improves performance of HDFS-related functions.

The function hdfs.upload now supports the option to upload multi-file directories in a single invocation, which optimizes the process. When downloading an HDFS directory, hdfs.download is optimized to issue a single
HDFS command to download files into one local temporary
directory before combining the separate parts into a single file.

The Hadoop Abstraction Layer (HAL) was extended to support
Hortonworks Data Platform 1.2 and Apache Hadoop 1.0. In addition, ORCH now allows the user to override the Hadoop Abstraction Layer version for use with unofficially supported distributions of Hadoop using system environment variables. This enables testing and certification of ORCH by other Hadoop distribution vendors.

Certification of ORCH on non-officially supported platforms can be done
using a separate test kit (available for download upon request: mark.hornick@oracle.com) that
includes an extensive set of tests for core ORCH functionality and that
can be run using the ORCH built-in testing framework. Running the tests
pinpoints the failures and ensures that ORCH is compatible with the
target platform.

Thursday Jul 18, 2013

This guest post from
Marcos Arancibia describes how to use Oracle R Enterprise to analyze Time Series data.

In this article, we give an overview of how to use Time Series Analysis against data stored in Oracle
Database, using the Embedded R Execution capability to send time series computations
to the Oracle Database server instead processing at the client.We will also learn how to retrieve the final
series or forecasts and retrieve them to the client for plotting, forecasting, and
diagnosing.

One key thing to keep in mind when
using Time Series techniques with data that is stored in Oracle Database is the
order of the rows, or records.Because
of the parallel capabilities of Oracle Database, when queried for records, one
might end up receiving records out of order if an option for order is not
specified.

Simple Example using Stock Data

Let’s start with a simple Time
Series example. First we will need to connect to our Oracle Database using ORE.
Then, using the package TTR, we will access Oracle Stock data from YahooData
service, from January 1, 2008 to January 1, 2013 and push it to the database.

#
Load the ORE library and connect to Oracle Database

library(ORE)

ore.connect("myuser","mysid","myserver","mypass",port=1521,all=TRUE)

library(TTR)

#
Get data in XTS format

xts.orcl
<- getYahooData("ORCL", 20080101, 20130101)

#
Convert it to a data frame and gets the date

#
Makes the date the Index

df.orcl
<- data.frame(xts.orcl)

df.orcl$date
<- (data.frame(date=index(xts.orcl))$date)

#
Create/overwrite data in Oracle Database

#
to a Table called ORCLSTOCK

ore.drop(table="ORCLSTOCK")

ore.create(df.orcl,table="ORCLSTOCK")

#
IMPORTANT STEP!!!

#
Ensure indexing is kept by date

rownames(ORCLSTOCK)
<- ORCLSTOCK$date

#
Ensure the data is in the DB

ore.ls()

#
Review column names, data statistics and

#
print a sample of the data

names(ORCLSTOCK)

>names(ORCLSTOCK)

[1] "Open""High""Low""Close""Volume"

[6] "Unadj.Close"
"Div""Split""Adj.Div""date"

summary(ORCLSTOCK$Close)

>summary(ORCLSTOCK$Close)

Min. 1st Qu.MedianMean 3rd Qu.Max.

13.3620.5324.2224.7929.7035.73

head(ORCLSTOCK)

>head(ORCLSTOCK)

OpenHighLowCloseVolume

2008-01-02 01:00:00 21.74414 22.00449 21.58022 21.68629 44360179

2008-01-03 01:00:00 21.62843 22.28413 21.62843 22.28413 43600532

2008-01-04 01:00:00 21.95628 22.06235 21.01130 21.24272 46391263

2008-01-07 01:00:00 21.17523 21.67664 21.01130 21.45486 41527032

2008-01-08 01:00:00 21.44522 21.52236 20.38453 20.39417 45155398

2008-01-09 01:00:00 20.57738 20.91487 20.39417 20.83773 49750304

Unadj.Close
Div Split Adj.Divdate

2008-01-02 01:00:0022.49NANANA 2008-01-02

2008-01-03 01:00:0023.11NANANA 2008-01-03

2008-01-04 01:00:0022.03NANANA 2008-01-04

2008-01-07 01:00:0022.25NANANA 2008-01-07

2008-01-08 01:00:0021.15NANANA 2008-01-08

2008-01-09 01:00:0021.61NANANA 2008-01-09

Pull data from the database for a simple
plot

#
Pull data from Oracle Database (only the necessary columns)

orcl
<-
ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

#
Simple plot with base libraries - Closing

plot(orcl$date,orcl$Close,type="l",col="red",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock
Closing points")

#
Simple plot with base libraries - Other Series

plot(orcl$date,orcl$Open,type="l",col="blue",xlab="Date",ylab="US$",

main="Base plot:Daily ORACLE Stock:
Open/High/Low points")

lines(orcl$date,orcl$High,col="green")

lines(orcl$date,orcl$Low,col="orange")

legend("topleft",
c("Opening","High","Low"),

col=c("blue","green","orange"),lwd=2,title = "Series",bty="n")

A different plot option, using the
package xts

library(xts)

#
Pull data from Oracle Database (only the necessary columns)

orcl
<-
ore.pull(ORCLSTOCK[,c("date","Close","Open","Low","High")])

#
Convert data to Time Series format

orcl.xts
<- as.xts(orcl,order.by=orcl$date,dateFormat="POSIXct")

#
Plot original series

plot(orcl.xts$Close,major.ticks='months',minor.ticks=FALSE,

main="Time Series plot:Daily ORACLE
Stock Closing points",col="red")

Simple Time Series: Moving Average
Smoothing

We might be tempted to call
functions like the Smoothing Moving Average from open-source CRAN packages
against Oracle Database Tables, but those packages do not know what to do with
an “ore.frame”.For that process to work
correctly, we can either load the data locally or send the process for remote
execution on the Database Server by using Embedded R Execution.

We will also explore the built-in
Moving Average process from ore.rollmean() as a third alternative.

ALTERNATIVE 1 -
The first example is pulling the data from Oracle Databaseinto a ts (time series) object first, for a Client-side
smoothing Process.

ALTERNATIVE 2 – In
this alternative, we will use a Server-side example for running the Smoothing
via Moving Average, without bringing all data to the client.Only the result is brought locally for
plotting.Remember that the TTR package
has to be installed on the Server in order to be called.

#
Server execution call using ore.tableApply

#
Result is an ore.list that remains in the database until needed

sv.orcl.ma30
<-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect = TRUE,

function(dat) {

library(TTR)

ordered <- dat[order(as.Date(dat$date,
format="%Y-%m-%d")),]

list(res1 <-
ts(ordered$Close,frequency=365, start=c(2008,1)),

res2 <-
ts(SMA(res1,n=30),frequency=365, start=c(2008,1)),

res3 <- ordered$date)

}

);

#
Bring the results locally for plotting

local.orcl.ma30
<- ore.pull(sv.orcl.ma30)

#
Plot two series side by side

#
(the third element of the list is the date)

plot(local.orcl.ma30[[3]],local.orcl.ma30[[1]],type="l",

col="red",xlab="Date",ylab="US$",

main="ORCL Stock Close SERVER-side
Smoothed Series n=30 days")

#
Add smoothed series

lines(local.orcl.ma30[[3]],

local.orcl.ma30[[2]],col="blue",type="l")

#
Add legend

legend("topleft",
c("Closing","Server MA(30) of Closing"),

col=c("red","blue"), lwd=2,title = "Series", bty="n")

ALTERNATIVE 3 – In this alternative we will use a Server-side example with
the computation of Moving Averages using the native ORE in-Database functions
without bringing data to the client. Only the result is brought locally for
plotting.

Just one line of code is needed to generate an in-Database
Computation of Moving averages and the creation of a new VIRTUAL column in the Oracle
Database.We will call this new column rollmean30.

We will use the function ore.rollmean().The option align="right" makes the
MA look at only the past k days (30 in this case), or less, depending on the
point in time.This creates a small
difference between this method and the previous methods in the beginning of the
series, since ore.rollmean() can actually calculate the first sets of days
using smaller sets of data available, while other methods discard this data.

Now that we have learned how to
execute these processes using Embedded R, we can start using other
methodologies required for Time Series using the same Server-side computation
and local plotting.

It is typical for an analyst to try
to understand a Time Series better by looking at some of the basic diagnostics
like the Seasonal Decomposition of Time Series by Loess.These can be achieved by using the stl()
command in the following process:

Another typical set of diagnostic
charts includes Autocorrelation and Partial Autocorrelation function plots.
These can be achieved by using the acf() command with the proper options in
Embedded R Execution, so computations happen at the Oracle Database server:

plot(local.orcl.acf[[1]],main="Server-side ACF Analysis for
Series ORCL",col="blue",lwd=2)

plot(local.orcl.acf[[2]],main="Server-side PACF Analysis for
Series ORCL",col="blue",lwd=5)

Simple Exponential Smoothing

Using the popular package “forecast”,
we will use the ses() function to calculate a 90 days horizon (h=90) into the
future, using the option criterion=MSE for the model. The package forecast needs
to be installed on the Oracle Database server R engine.

Then, we will bring the resulting
model locally for plotting. Remember to load the library “forecast” locally as
well, to be able to interpret the meaning of the ses() output when it’s brought
locally.

Using the popular package “forecast”,
we will use the holt() function to calculate a 90 days horizon (h=90) into the
future, requesting the Intervals of confidence of 80 and 95%. Again. the
package “forecast” needs to be installed on the Oracle Database server R engine.

Then, we will bring the resulting
model locally for plotting. Remember to load the library forecast locally as
well, to be able to interpret the meaning of the holt() output when it’s
brought locally.

#
Execute holt() call in the server

sv.orcl.ets
<-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

function(dat) {

library(forecast)

ordered <- dat[order(as.Date(dat$date,
format="%Y-%m-%d")),]

ts.orcl <-
ts(ordered$Close,frequency=365, start=c(2008,1))

res
<- holt(ts.orcl, h=90, level=c(80,95), initial="optimal")

}

);

#
Get resulting model from the server

#
Since remote object contains a Holt Exponential Smoothing

#
model from package forecast, load package locally as well

library(forecast)

local.orcl.ets
<- ore.pull(sv.orcl.ets)

plot(local.orcl.ets,col="blue",fcol="red",

main="ORCL Original Series Stock
Close with Server-side Holt Forecast")

ARIMA – Auto-Regressive Interactive
Moving Average

There are at least two options for
fitting an ARIMA model into a Time Series.One option is to use the package “forecast”, that allows for an
automatic arima fitting (auto.arima) to find the best parameters possible based
on the series.

For more advanced users, the arima()
function in the “stats” package itself allows for choosing the model
parameters.

#
ARIMA models on the server using auto.arima() from package forecast

arimaModel
<-

ore.tableApply(ORCLSTOCK[,c("date","Close")], ore.connect=TRUE,

FUN = function(dat){

# load forecast
library to use auto.arima

library(forecast)

# sort the table into
a temp file by date

ordered <-
dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column into
a Time Series

# format ts(...) and
request creation of an automatic

# ARIMA model auto.arima(...)

res <-auto.arima(ts(ordered$Close,frequency=365,
start=c(2008,1)),

stepwise=TRUE, seasonal=TRUE)

})

#
Alternative using the arima() from package “stats”.

arimaModel
<-

ore.tableApply(ORCLSTOCK[,c("date","Close")],ore.connect=TRUE,

FUN =
function(dat){

# sort table into
a temp file by date

ordered <-
dat[order(as.Date(dat$date, format="%Y-%m-%d")),]

# convert column
into a Time Series

# format ts(...) and
request creation of a specific

# ARIMA model
using arima(), for example an ARIMA(2,1,2)

res <-arima(ts(ordered$Close,frequency=365,
start=c(2008,1)),

order = c(2,1,2))

})

#
Load forecast package locally to use the model

#
for plotting and producing forecasts

library(forecast)

#
Show remote resulting Time Series model

>arimaModel

Series: ts(ordered$Close, frequency = 365, start = c(2008, 1))

ARIMA(2,1,0)

Coefficients:

ar1ar2

-0.0935-0.0192

s.e.0.02820.0282

sigma^2 estimated as 0.2323:log likelihood=-866.77

AIC=1739.55AICc=1739.57BIC=1754.96

#
Get remote model using ore.pull for local prediction and plotting

local.arimaModel
<- ore.pull(arimaModel)

#
Generate forecasts for the next 15 days

fore.arimaModel
<- forecast(local.arimaModel, h=15)

#
Use the following option if you need to remove scientific notation of

Wednesday Jun 12, 2013

R users have a few choices of how to connect to their Oracle
Database. The most commonly seen include: RODBC, RJDBC, and ROracle. However,
these three packages have significantly different performance and scalability
characteristics which can greatly impact your application development. In this
blog, we’ll discuss these options and highlight performance benchmark results
on a wide range of data sets.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

By way of introduction, RODBC is an R package that
implements ODBC database connectivity. There are two groups of functions: the
largely internal odbc* functions implement low-level access to the
corresponding ODBC functions having a similar name, and the higher level sql*
functions that support read, save, copy, and manipulation of data between R
data.frame objects and database tables. Here is an example using RODBC:

The R package RJDBC is an implementation of the R DBI
package – database interface – that uses JDBC as the back-end connection to the
database. Any database that supports a JDBC driver can be used in connection
with RJDBC. Here is an example using RJDBC:

The ROracle package is an implementation of the R DBI
package that uses Oracle OCI for high performance and scalability with Oracle
Databases. It requires Oracle Instant Client or Oracle Database Client to be
installed on the client machine. Here is an example using ROracle:

library(ROracle)

drv
<- dbDriver("Oracle")

con
<- dbConnect(drv, "rquser", "rquser")

dbWriteTable(con,”TEST_TABLE”,
test_table)

dbGetQuery(con,
"select count(*) from TEST_TABLE")

d
<- dbReadTable(con, "TEST_TABLE")

dbDisconnect(con)

Notice that since both RJDBC and ROracle implement the DBI
interface, their code is the same except for the driver and connection details.

To compare these interfaces, we prepared tests along several
dimensions:

Where an in-database capability as provided by Oracle R
Enterprise is not available, typical usage is to pull data to the R client for
subsequent processing. In Figure 1, we compare the execution time to pull 10, 100, and 1000 columns of data
from 1K, 10, 100K, and 1M rows for BINARY_DOUBLE data on a log-log scale.
Notice that RJDBC does not scale to 100 columns x 1M rows, or above 1000 cols x
100K records. While RODBC and ROracle both scale to these volumes, ROracle is
consistently faster than RODBC: up to 2.5X faster. For RJDBC, ROracle is up to
79X faster.

Figure 1: Comparison of RJDBC, RODBC, and ROracle for BINARY_DOUBLE for
Select *

In Figure 2, we provide the range of results for RODBC,
ROracle, and RJDBC across all data types. Notice that only ROracle provides the
full range of scalability while providing superior performance in general.

ROracle is virtually always faster than RODBC: NUMBER data
up to 2.5X faster, VARCHAR2 data up to 142X faster, and time stamp data up to
214X faster. RODBC fails to process 1000 columns at 1M rows.

For RJDBC, ROracle is up to 13X faster on NUMBER data, 79X
faster on binary double data, 3X for VARCHAR2 data (excluding the 25X over the
smallest data set). Note that RODBC and RJDBC have a limit of 255 characters on
the length the VARCHAR2 columns. TIMESTAMP data is the one area where RJDBC
initially shines, but then fails to scale to larger data sets.

Figure 2: Comparing the three interfaces for select * from
<table>

Data set sizes represented in megabytes are captured in
Table 1 for all data types. With only minor variation, the data sizes are the
same across data types.

Table 1: Dataset sizes in megabytes

Creating database
tables from an R data.frame

Data or results created in R may need to be written to a
database table. In Figure 3, we compare the execution time to create tables
with 10, 100, and 1000 columns of data with 1K, 10, 100K, and 1M rows for
BINARY_DOUBLE. Notice that in all three cases, RJDBC is slowest and does not
scale. RJDBC does not support the NUMBER or BINARY_DOUBLE data types, but uses
FLOAT(126) instead. ROracle scaled across the remaining data types, while RODBC
and RJDBC were not tested.

ROracle is 61faster than RODBC for 10 columns x 10K
rows, with a median of 5X faster across all data sets. ROracle is 630X faster
on 10 columns x 10K rows, with a median of 135X faster across all data sets.
RJDBC did not scale to the 1M row data sets.

Depending on the application any sub-second response time
may be sufficient. However, as depicted in Figure 4, ROracle introduces minimal
time to establish a database connection. ROracle is nearly 10X faster than RJDBC
and 1.6X faster than RODBC.

Figure 4: Database connection times for ROracle, RODBC, and RJDBC

In summary, for maximal performance and scalability, ROracle
can support a wide range of application needs. RJDBC has significant
limitations in both performance and scalability. RODBC can be more difficult to
configure on various platforms and while it largely scales to the datasets
tested here, its performance lags behind ROracle.

If you use ROracle, we'd like to hear about your experience. Please take this brief survey.

All tests were
performed on a 16 processor machine with 4 core Intel Xeon E5540 CPUs @ 2.53
GHz and 74 GB RAM. Oracle Database was version 11.2.0.4. For JDBC, the
following was modified before installing rJava.

rJava_0.9-4.tar.gz\rJava_0.9-4.tar\rJava\jri\bootstrap\JRIBootstrap.java
was modified to use 2GB :

In this white paper, we begin with "Beyond the Laptop" exploring the ability to run R code in the database, working with CRAN packages at the database server, operationalizing R analytics, and leveraging Hadoop from the comfort of the R language and environment.

Excerpt: "Oracle Advanced Analytics and Oracle R Connector for Hadoop combine the advantages of R with the power
and scalability of Oracle Database and Hadoop. R programs and libraries can be used in conjunction with these
database assets to process large amounts of data in a secure environment. Customers can build statistical models
and execute them against local data stores as well as run R commands and scripts against data stored in a secure
corporate database."

The white paper continues with three use cases involving Oracle Database and Hadoop: analyzing credit risk, detecting fraud, and preventing customer churn. The conclusion: providing analytics for the enterprise based on the R environment is here!

Tuesday May 28, 2013

Oracle R Enterprise provides a comprehensive,
database-centric environment for end-to-end analytical processes in R, with
immediate deployment to production environments. This message really resonates
with our customers who are interested in executing R functions on
database-resident data while seamlessly leveraging Oracle Database as a high-performance
computing (HPC) environment. The ability to develop and operationalize R
scripts for analytical applications in one step is quite appealing.

One frequently asked question is how to convert existing R code that access data in flat files or
the database to use Oracle R Enterprise. In this blog post, we talk about a few
scenarios and how to begin a conversion from existing R code to using Oracle R
Enterprise.

Consider the following scenarios:

Scenario 1: A stand-alone R script that generates its own
data and simply returns a result. Data is not obtained from the file system or
database. This may result from performing simulations where dadta is
dynamically generated, or perhaps access from a URL on the internet.

Scenario 2: An R script that loads data from a flat file
such as a CSV file, performs some computations in R, and then writes the result
back to a file.

Scenario 3: An R script that loads data from a database
table, via one of the database connector packages like RODBC, RJDBC, or
ROracle, and writes a result back to the database –using SQL statements or
package functions.

Scenario 1

A stand-alone R script might normally be run on a user’s
desktop, invoked as a cron job, or even via Java to spawn an R engine and
retrieve the result, but we’d like to operationalize its execution as part of a
database application, invoked from SQL. Here’s a simple script to illustrate
the concept of converting such a script to be executed at the database server
using ORE’s embedded R execution. The script generates a data.frame with some random columns,
performs summary on that data
and returns the summary statistics, which are represented as an R table.

#
generate data

set.seed(1)

n
<- 1000

df
<- 3

x
<- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

#
perform some analysis

res
<- summary(x)

#return
the result

res

To convert this to use ORE, create a function with
appropriate arguments and body, for example:

myFunction1
<- function (n = 1000, df = 3,seed=1) {

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

res <- summary(x)

res

}

Next, load the ORE packages and connect to Oracle
Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the
tables and views in that schema. We then store the function in the R script
repository, invoking it via ore.doEval.

#
load ORE packages and connect to Oracle Database

library(ORE)

ore.connect("schema","sid","hostname","password",port=1521,
all=TRUE)

#
load function into R script repository

ore.scriptDrop("myFunction-1")

ore.scriptCreate("myFunction-1",
myFunction1)

#
invoke using embedded R execution at the database server

ore.doEval(FUN.NAME="myFunction-1")

> ore.doEval(FUN.NAME="myFunction-1")

abc

Min.:1.0Min.:-3.00805Min.: 0.03449

1st Qu.: 250.81st Qu.:-0.697371st Qu.: 1.27386

Median : 500.5Median :-0.03532Median : 2.36454

Mean: 500.5Mean:-0.01165Mean: 3.07924

3rd Qu.: 750.23rd Qu.: 0.688433rd Qu.: 4.25994

Max.:1000.0Max.: 3.81028Max.:17.56720

Of course, we’re using default values here. To provide
different arguments, change the invocation with arguments as follows:

ore.doEval(FUN.NAME="myFunction-1", n=500, df=5,
seed=2)

> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2)

abc

Min.:1.0Min.:-2.72182Min.: 0.1621

1st Qu.:125.81st Qu.:-0.653461st Qu.: 2.6144

Median :250.5Median : 0.04392Median : 4.4592

Mean:250.5Mean: 0.06169Mean: 5.0386

3rd Qu.:375.23rd Qu.: 0.790963rd Qu.: 6.8467

Max.:500.0Max.: 2.88842Max.:17.0367

Having successfully invoked this from the R client (my
laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which
is an R table, as an XML string.

select*

fromtable(rqEval(
NULL,'XML','myFunction-1'));

The
result can be viewed from SQL Developer.

The
following shows the XML output in a more structured manner.

What if we wanted to get the result to appear as a SQL
table? Since the current result is an R table
(an R object), we need to convert this to a data.frame
to return it.We’ll make a few
modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of
ways to do this.

myFunction2
<- function (n = 1000, df = 3,seed=1) {

# generate data

set.seed(seed)

x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))

# perform some analysis

res <- summary(x)

# convert the table result to a data.frame

res.df <- as.matrix(res)

res.sum <-
as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum) <- c('a','b','c')

res.sum$statname <-
c("min","1stQ","median","mean","3rdQ","max")

res.sum <- res.sum[,c(4,1:3)]

res.sum

}

#
load function into R script repository

ore.scriptDrop("myFunction-2")

ore.scriptCreate("myFunction-2",
myFunction2)

We’ll now modify the SQL statement to specify the format of
the result.

This type of result could be incorporated into any SQL
application accepting table or view input from a SQL query. That is particular
useful in combination with OBIEE dashboards via an RPD.

Scenario 2

If you’ve been loading data from a flat file, perhaps a CSV
file, your R code may look like the following, where it specifies to builds a
model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters
highlighting the individual points, colored by their cluster id, with the
centroids indicated with a star.

#
read data

setwd("D:/datasets")

dat
<- read.csv("myDataFile.csv")

#
build a clustering model

cl
<- kmeans(x, 2)

#
write model to file

save(cl,
file="myClusterModel.dat")

#
create a graph and write it to a file

pdf("myGraphFile.pdf")

plot(x,
col = cl$cluster)

points(cl$centers,
col = 1:2, pch = 8, cex=2)

dev.off()

The resulting PDF file contains the following image.

To convert this script for use in ORE, there are several options.
We’ll explore two: the first involving minimal change to use embedded R
execution, and the second leveraging in-database techniques. First, we’ll want
the data we used above in variable dat
to be loaded into the database.

#
create the table using the R data.frame, resulting in an ore.frame named
MY_DATA

ore.create(dat,"MY_DATA")

#
assign the ID column as the row.names of the ore.frame

row.names(MY_DATA)
<- MY_DATA$ID

In the first example, we’ll use embedded R execution and pass
the data to the function via ore.tableApply.
We’ll generate the graph, but simply display it within the function to allow embedded
R execution to return the graph as a result. (Note we could also write the
graph to a file in any directory accessible to the database server.) Instead of
writing the model to a file, which requires keeping track of its location, as
well as worring about backup and recovery, we store the model in the database R
datastore using ore.save. All this
requires minimal change. As above, we could store the function in the R script
repository and invoke it by name – both from R and SQL. In this example, we
simply provide the function itself as argument.

myClusterFunction1
<- function(x) {

cl <- kmeans(x, 2)

ore.save(cl,
name="myClusterModel",overwrite=TRUE)

plot(x, col = cl$cluster)

points(cl$centers, col = 1:2, pch = 8, cex=2)

TRUE

}

ore.tableApply(MY_DATA[,c('x','y')],
myClusterFunction1,

ore.connect=TRUE,ore.png.height=700,ore.png.width=700)

The ore.tableApply
function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using
the R datastore, which requires a database connection. Optionally, we can
specify control arguments to the PNG output. In this example, these are the height
and width of the image.

For the second example, we convert this to leverage the ORE Transparency
Layer. We’ll use the in-database K-Means algorithm and save the model in a
datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically
assign cluster ids (since the data may be very large or are not required), the
scoring is done separately. Note, however, that the prediction results also
exist in the database as an ore.frame.
To ensure ordering, we also assign row.names
to the ore.framepred. Lastly, we create the plot.
Coloring the nodes requires pulling the cluster assignments; however, the
points themselves can be accessed from the ore.frame.
The centroids points are obtained from cl$centers2
of the cluster model.

#
build a clustering model in-database

cl
<- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)

#
save model in database R datastore

ore.save(cl,name="myClusterModel",overwrite=TRUE)

#
generate predictions to assign each row a cluster id, supplement with original
data

pred
<- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")

#
assign row names to ensure ordering of results

row.names(pred)
<- pred$ID

#
create the graph

plot(pred[,c('x','y')],
col = ore.pull(pred$CLUSTER_ID))

points(cl$centers2[,c('x','y')],
col = c(2,3), pch = 8, cex=2)

We can also combine using the transparency layer within an
embedded R function. But we’ll leave that as an exercise to the reader.

Scenario 3

In this last scenario, the data already exists in the
database and one of the database interface packages, such as RODBC, RJDBC, and
ROracle is be used to retrieve data from and write data to the database. We’ll
illustrate this with ROracle, but the same holds for the other two packages.

Converting this to ORE is straightforward. We’re already
connected to the database using ore.connect
from previous scenarios, so the existing table MY_RANDOM_DATA
was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result,
so we can just start using it.

> ore.ls(pattern="MY_RAND")

[1] "MY_RANDOM_DATA"

#
no need to retrieve the data, use the transparency layer to compute summary

res
<- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))

#
convert the table result to a data.frame for output as table

res.df
<- as.matrix(res)

res.sum
<- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))

names(res.sum)
<- c('a','b','c')

res.sum$statname
<-
c("min","1stQ","median","mean","3rdQ","max")

res.sum
<- res.sum[,c(4,1:3)]

#
create the database table

ore.create(res.sum,
"SUMMARY_STATS")

SUMMARY_STATS

As we did in previous scenarios, this script can also be
wrapped in a function and used in embedded R execution. This too is left as an
exercise to the reader.

Summary

As you can see from the three scenarios discussed here, converting
a script that accesses no external data, accesses and manipulates file data, or
accesses and manipulates database data can be accomplished with a few strategic
modifications. More involved scripts, of course, may require additional
manipulation. For example, if the SQL query performs complex joins and
filtering, along with derived column creation, the user may want to convert
this SQL to the corresponding ORE Transparency Layer code, thereby eliminating
reliance on SQL. But that’s a topic for another post.

Wednesday Apr 17, 2013

Overhauling analytics processes is becoming a recurring
theme among customers. A major telecommunication provider recently
embarked on overhauling their analytics process for customer surveys. They had three
broad technical goals:

Provide an agile
environment that empowers business analysts to test hypotheses based on
survey results

The ultimate goal is to derive greater value from survey
research that drives measurable improvements in survey service delivery, and as
a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power
their survey research. Survey results and analytics are maintained in Oracle
Database and delivered via a parameterized BI dashboard. Both the database and
BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create
samples for hypothesis testing by filtering respondents to a survey question
based on a variety of filtering criteria. This provider required the ability to
deploy a range of statistical techniques depending on the survey variables,
level of measurement of each variable, and the needs of survey research
analysts.

Oracle Advanced Analytics offers a range of in-database
statistical techniques complemented by a unique architecture supporting
deployment of open source R packages in-database to optimize data transport to
and from database-side R engines. Additionally, depending on the nature of
functionality in such R packages, it is possible to leverage data-parallelism
constructs available as part of in-database R integration. Finally, all OAA
functionality is exposed through SQL, the ubiquitous language of the IT
environment. This enables OAA-based solutions to be readily integrated with BI
and other IT technologies.

The survey application noted above has been in production
for 3 months. It supports a team of 20 business analysts and has already begun
to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of
statistical techniques deployed as part of this application.

At the heart of survey research is hypothesis testing. A completed customer satisfaction survey
contains data used to draw conclusions about the state of the world. In the survey
domain, hypothesis testing is comparing the significance of answers to specific
survey questions across two distinct groups of customers - such groups are
identified based on knowledge of the business and technically specified through
filtering predicates.

Hypothesis testing sets up the world as consisting of 2
mutually exclusive hypotheses:

a) Null hypothesis -
states that there is no difference in satisfaction levels between the 2 groups
of customers

b) Alternate
hypothesis states that there is a significant difference in satisfaction levels
between the 2 groups of customers

Obviously only one of these can be true and the true-ness is
determined by the strength, probability, or likelihood of the null hypothesis
over the alternate hypothesis. Simplistically, the degree of difference
between, e.g., the average score from a specific survey question across two
customer groups could provide the necessary evidence in helping decide which
hypothesis is true.

In practice the process of providing evidence to make a
decision involves having access to a range of test statistics – a number
calculated from each group that helps determine the choice of null or alternate
hypothesis. A great deal of theory, experience, and business knowledge goes
into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental
function used in hypothesis testing that helps understand the differences in
means across two groups. When the t-values across 2 groups of customers for a
specific survey question are extreme then the alternative hypothesis is likely
to be true. It is common to set a critical value that the observed t-value
should exceed to conclude that the satisfaction survey results across the two
groups are significantly different. Other similar statistics available
in-database include F-test, cross tabulation (frequencies of various response
combinations captured as a table), related hypothesis testing functions such as
chi-square functions, Fisher's exact
test, Kendall's coefficients, correlation coefficients and a range of lambda
functions.

If an analyst desires to compare across more than 2 groups
then analysis of variance (ANOVA) is a collection of techniques that is commonly
used. This is an area where the R package ecosystem is rich with several proven
implementations. The R stats package
has implementations of several test statistics and function glm allows analysis of count data
common in survey results including building Poisson and log linear models. R's MASS package implements a popular
survey analysis technique called iterative
proportional fitting. R's survey
package has a rich collection of features
(http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in
the survey package - raking (also known as sample balancing) - a process that assigns
a weight to each customer that responded to a survey such that the weighted
distribution of the sample is in very close agreement with other customer attributes,
such as the type of cellular plan, demographics, or average bill amount. Raking
is an iterative process that uses the sample design weight as the starting
weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide
variety of statistical techniques - some in-database accessible through the
transparency layer in Oracle R Enterprise and some in CRAN packages - were
built and stored in the Oracle R Enterprise in-database R script repository.
These parameterized scripts accept various arguments that identify samples of
customers to work with as well as specific constraints for the various
hypothesis test functions. The net result is greater agility since the business
analyst determines both the set of samples to analyze as well as the
application of the appropriate technique to the sample based on the hypothesis
being pursued.

Monday Apr 15, 2013

If
you build an R lm model with a relatively
large number of rows, you may be surprised by just how large that lm model is and what impact it has on your
environment and application.

Why might you care about size? The most obvious is that the
size of R objects impacts the amount of RAM available for further R processing
or loading of more data. However, it also has implications for how much space
is required to save that model or the time required to move it around the
network. For example, you may want to move the model from the database server R
engine to the client R engine when using Oracle R Enterprise Embedded R
Execution. If the model is too large, you may encounter latency when trying to
retrieve the model or even receive the following error:

Perform summary component
access, like coefficients, inside the embedded R function and
return only what is needed

Save the model in a database R
datastore and manipulate that model at the database server to avoid
pulling it to the client

Reduce the size of the
model by eliminating large and unneeded components

In this blog post, we focus on the third approach and
look at the size of lm model
components, what you can do to control lm model size, and the implications for
doing so. With vanilla R, objects are the "memory" serving as the repository for repeatability. As a result, models
tend to be populated with the data used to build them to ensure model build repeatability.

When working with database tables, this "memory" is not needed
because
governance mechanisms are already in place to ensure either data
does not change or logs are available to know what changes took place.
Hence it is unnecessary
to store the data used to build the model into the model object.

Some of these components may appear deceptively small using
R’s object.size function.The following script builds an lm model to help reveal what R
reports for the size of various components. The examples use a sample of the ONTIME airline arrival and departure delays data set for domestic flights. The ONTIME_S data set is an ore.frame proxy object for data stored in an Oracle database and consists of 219932 rows and 26 columns. The R data.frame ontime_s is this same data pulled to the client R engine using ore.pull and is ~39.4MB.

Note: The results
reported below use R 2.15.2 on Windows. Serialization of some components in the lm model
has been improved in R 3.0.0, but the implications are the same.

Using
the object.size function on the
resulting model, the size is about 55MB. If only scoring data with this model, it
seems like a lot of bloat for the few coefficients assumed needed for
scoring.Also, to move this object over
a network will not be instantaneous. But is this the true size of the model?

A better way to determine just how big an object is, and
what space is actually required to store the model or time to move it across a
network, is the R serialize
function.

length(serialize(lm.fit.1,NULL))

[1] 65826324

Notice that the size reported by object.size is different from that of serialize – a difference of 11MB or ~20% greater.

What is taking up so much space? Let’s invoke object.size on each component of this lm model:

lapply(lm.fit.1,
object.size)
$coefficients

424 bytes

$residuals

13769600 bytes

$effects

3442760 bytes

$rank

48 bytes

$fitted.values

13769600 bytes

$assign

56 bytes

$qr

17213536 bytes

$df.residual

48 bytes

$na.action

287504 bytes

$xlevels

192 bytes

$call

1008 bytes

$terms

4432 bytes

$model

6317192 bytes

The components residuals,
fitted.values, qr,model, and even na.action are large. Do we need all
these components?

The lm function
provides arguments to control some aspects of model size. This can be done, for
example, by specifying model=FALSE
and qr=FALSE. However, as we saw
above, there are other components that contribute heavily to model size.

The resulting serialized model size is down to about ~52MB,
which is not significantly smaller than the full model.The difference with the result reported by object.size is now ~20MB, or 39% smaller.

Does removing these components have any effect on the usefulness
of an lm model? We’ll explore this using four commonly used
functions: coef, summary, anova, and predict.
If we try to invoke summary on lm.fit.2, the following error results:

summary(lm.fit.2)

Error in qr.lm(object) : lm
object does not have a proper 'qr' component.

Rank zero or should not have used lm(..,
qr=FALSE).

The same error results when we try to run anova. Unfortunately, the predict function also fails with the
error above. The qr component is
necessary for these functions. Function coef
returns without error.

coef(lm.fit.2)

(Intercept)DISTANCEDEPDELAY

0.225378249 -0.0012175110.962528054

If only coefficients are
required, these settings may be acceptable. However, as we’ve seen, removing the model and qr components, while each is large, still leaves a large
model.The really large components appear
to be the effects, residuals, and fitted.values. We can explicitly nullify them to remove
them from the model.

Thinking the model size should be small, we might be surprised
to see the results above.The function object.size reports ~295KB, but
serializing the model shows 24MB, a difference of 23.8MB or 98.8%. What happened?
We’ll get to that in a moment.First,
let’s explore what effect nullifying these additional components has on the
model.

To answer this, we’ll
turn on model and qr, and focus on effects, residuals, and fitted.values.
If we nullify effects, the anova
results are invalid, but the other results are fine. If we nullify residuals, summary cannot produce residual and coefficient statistics,
but it also produces an odd F-statistic with a warning:

If we nullify fitted.values,
summary produces an invalid F-statistics issuing the warning:

Warning message:

In mean.default(f) :
argument is not numeric or logical: returning NA

However, there are no adverse effects on results on the
other three functions.

Depending on what we need from our model, some of these
components could be eliminated. But let’s continue looking at each remaining
component, not with object.size,
but serialize. Below, we use lapplyto compute the serialized length
of each model component. This reveals that the terms component is actually the
largest component, despite object.size
reporting only 4432 bytes above.

as.matrix(lapply(lm.fit.3,
function(x) length(serialize(x,NULL))))

[,1]

coefficients 130

rank26

assign34

df.residual26

na.action84056

xlevels55

call275

terms24004509

If we nullify the terms
component, the model becomes quite compact. (By the way, if we simply nullify terms,
summary, anova, and predict all fail.) Why is the terms component so large? It
turns out it has an environment object as an attribute. The environment contains
the variable dat, which contains the original data with 219932 rows and 26 columns. R’s serialize function includes this object
and hence the reason the model is so large. The function object.size ignores these objects.

As shown above, an lm model can become quite large. At least for some applications, several of these
components may be unnecessary, allowing the user to significantly reduce the
size of the model and space required for saving or time for transporting the model. Relying on Oracle Database to store the data instead of the R model object further allows for significant reduction in model size.

Friday Jan 18, 2013

Oracle R Distribution Performance
Benchmarks

Oracle R Distribution provides
dramatic performance gains with MKL

Using the recognized R benchmark R-benchmark-25.R test script,
we compared the performance of Oracle
R Distribution with and without the dynamically loaded high performance Math Kernel Library (MKL) from
Intel. The benchmark
results show Oracle R Distribution is significantly faster with the dynamically
loaded high performance library. R users can immediately gain performance enhancements
over open source R, analyzing data on 64-bit architectures and leveraging
parallel processing within specific R functions that invoke computations
performed by these high performance libraries.

The Community-developed
test consists of matrix calculations and functions, program control, matrix multiplication,
Cholesky Factorization, Singular Value Decomposition (SVD), Principal Component
Analysis (PCA), and Linear Discriminant Analysis. Such computations form a core
component of many real-world problems, often taking the majority of compute
time. The ability to speed up these computations means faster results for
faster decision making.

This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz
per CPU and 47 GB RAM, using Linux 5.5.

In the first graph, we see significant performance improvements. For example, SVD with ORD plus MKL executes 20 times faster using 4 threads, and 29 times faster using 8 threads. For Cholesky Factorization, ORD plus MKL is 18 and 30 times faster for 4 and 8 threads, respectively.

In the second graph,we focus on the three longer running tests. Matrix multiplication is 88 and 166 times faster for 4 and 8 threads, respectively. PCA is 30 and 50 times faster, and LDA is over 3 times faster.

This level of performance improvement can significantly reduce application execution time and make interactive, dynamically generated results readily achievable. Note that ORD plus MKL not only impacts performance on the client side, but also when used in combination with R scripts executed using Oracle R Enterprise Embedded R Execution. Such R scripts, executing at the database server machine, reap these performance gains as well.

Tuesday Oct 02, 2012

Oracle Server Technologies Curriculum has just released the
Oracle R Enterprise Tutorial Series, which is publicly available on Oracle Learning Library (OLL). This 8 part interactive lecture series with review sessions covers Oracle R Enterprise 1.1 and an introduction to Oracle R Connector for Hadoop 1.1:

Monday Sep 17, 2012

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

Friday Apr 13, 2012

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release
includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more. The key new features include:

Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components

Wednesday Apr 04, 2012

Oracle recently updatedROracle to version 1.1-2 on CRAN with enhancements and bug fixes. The major enhancements include the introduction of support for Oracle Wallet Manager and datetime and interval types.

Oracle Wallet support in ROracle allows users to manage public key security from the client R session. Oracle Wallet allows passwords to be stored and read by Oracle Database, allowing safe storage of database login credentials. In addition, we added support for datetime and interval types when selecting data, which expands ROracle's support for date data.

See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Monday Apr 02, 2012

This Oracle R Enterprise (ORE) tutorial, on embedded R execution, is the third in a series to help users get started using ORE. See these links for the first tutorial on the transparency layer and second tutorial on the statistics engine. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Embedded R Execution refers to the ability to execute an R script at the database server, which provides several benefits: spawning multiple R engines in parallel for data-parallel operations, more efficient data transfer between Oracle Database and the R engine, leverage a likely more powerful server with more CPUs and greater RAM, schedule automated jobs, and take advantage of open source R packages at the database server. Data aggregates are computed in parallel, significantly reducing computation time, without requiring sophisticated configuration steps.

ORE provides two interfaces for embedded R execution: one for R and one for SQL. The R interface enables interactive execution at the database server from the client R engine, e.g., your laptop. It also has transparency aspects for passing R objects and returning R objects.In the R interface, the ore.doEval schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis. User-defined R functions can run in parallel, either on each row, sets of rows, or on each group of rows given a grouping column. The first two cases are covered by ore.rowApply, the second by the ore.groupApply function. ore.indexApply provides parallel simulations capability by invoking the script the number of times specified by the user. The R interface returns results to the client as R objects that can be passed as arguments to R functions.

The SQL interface enables interactive execution from any SQL interface, like SQL*Plus or SQL Developer, but it also enables R scripts to be included in production database-based systems. To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval.
The SQL interface allows for storing results directly in the database.

R Interface Function (ore.*)

SQL Interface Function (rq*)

Purpose

ore.doEval

rqEval

Invoke stand-alone R script

ore.tableApply

rqTableEval

Invoke R script with full table input

ore.rowApply

rqRowEval

Invoke R script one row at a time, or multiple rows in "chunks"

ore.groupApply

rqGroupEval

Invoke R script on data indexed by grouping column

ore.indexApply

N/A

Invoke R script N times

In addition, the SQL interface enables R results to be stored in a database table for subsequent use in another invocation (think data mining model building and scoring). It enables returning structured R results in a table. Results can also be returned as XML. The XML interface enables both structured data, such as data frames, R objects, and graphs to be returned. The XML capability allows R graphs and structured results to be displayed in Oracle BI Publisher documents and OBIEE dashboards.

Embedded R Execution: R Interface

The following example uses the function ore.groupApply, one of several embedded R execution functions, to illustrate how R users can achieve data parallelism through the database. This example also illustrates that embedded R execution enables the use of open source packages. Here we see the use of the R package biglm.

We specify a column on which to partition the data. Each partition of the data is provided to the function through the first argument, in this case the function variable dat. There is no need to send data from the database to R - the R function is sent to the database, which processes them in parallel. Output results may be stored directly in the database, or may be downloaded to R. Only when we want to see the results of these models do we need to retrieve them into R memory and perform, for example, the summary function.

Whereas the previous example showed how to use embedded R execution from the R environment, we can also invoke R scripts from SQL. This next example illustrates returning a data frame from results computed in Oracle Database. We first create an R script in the database R script repository. The script is defined as a function that creates a vector of 10 elements, and returns a data frame with those elements in one column and those elements divided by 100 in a second column.

Once the script is created, we can invoke it through SQL. One of the SQL embedded R executions table functions available is rqEval. The first argument is NULLsince we have no parameters to pass to the function. The second argument describes the structure of the result. Any valid SQL query that captures the name and type of resulting columns will suffice. The third argument is the name of the script to execute.

Embedded R scripts may generate any valid R object, including graphs. In addition, embedded R execution enables returning results from an R script as an XML string. Consider the following example that creates a vector from the integers 1 to 10, plots 100 random normal points in a graph, and then returns the vector. After creating the script in the database R script repository, we invoke the script using rqEval, but instead of specifying the form of the result in a SQL query, we specify XML.

While the actual graph looks like the following, the output from this query will be an XML string.

In the execution results shown below, the VALUE column returned is a string that contains first the structured data in XML format. Notice the numbers 1 through 10 set off by the <value> tags. This is followed by the image in PNG base 64 representation. This type of output can be consumed by Oracle Business Intelligence Publisher (BIP) to produce documents with R-generated graphs and structured content. Oracle BIP templates can also be used to expose R-generated content in Oracle Business Intelligence Enterprise Edition (OBIEE) web browser-based dashboards.

You can see additional examples using embedded R execution in action in the Oracle Enterprise Training, session 4, Embedded R Script Execution. These example will run as written in R 2.13.2 after installing Oracle R Enterprise. We'll be posting more examples using embedded R script execution in the coming months. In the meantime, questions are always welcome on the Oracle R Forum.

Wednesday Feb 29, 2012

The Oracle R Advanced Analytics team
is happy to announce the release of the ROracle 1.1-1 package on the
Comprehensive R Archive Network (CRAN). We’ve rebuilt ROracle from the
ground up, working hard to fix bugs and add optimizations. The new
version introduces key improvements for interfacing with Oracle Database
from open-source R.

We think ROracle 1.1-1 is a great step forward, allowing users to build
high performance and efficient R applications using Oracle Database.
Whether you are upgrading your existing interface or using it for the
first time, ROracle 1.1-1 is ready for download.. If you have any questions or comments please post on the Oracle R discussion forum. We'd love to hear from you!

Monday Feb 20, 2012

Oracle has released the Oracle R Distribution, an Oracle-supported
distribution of open source R. This is provided as a free download from Oracle. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and Oracle Big Data Appliance. The
Oracle R Distribution facilitates enterprise acceptance of R, since the
lack of a major corporate sponsor has made some companies concerned
about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R.

Oracle has already taken responsibility for and contributed modifications to ROracle - an Oracle database interface (DBI) driver for R based on OCI.As ROracle is LGPL and used for Oracle Database connectivity from R, we are committed to ensuring this is the best package for Oracle connectivity.

Thursday Feb 16, 2012

In an enterprise, databases are typically where data reside. So where
data analytics are required, it's
important for R and the database to work well together. The more
seamlessly and naturally R users can access data, the easier it is to
produce results. R users may leverage ODBC, JDBC, or similar types of
connectivity to access database-resident
data. However, this requires working with SQL to formulate queries to
process or filter data in the database,
or to pull data into the R environment for further processing using R.
If R users, statisticians, or data analysts are unfamiliar with SQL or
database tasks, or don't have database access, they often consult IT for
data extracts.

Not having direct access to database-resident
data introduces delays in
obtaining data, and can make near real-time analytics impossible. In
some instances, users request data sets much larger than required to
avoid multiple requests to IT. Of course, this approach introduces costs
of exporting, moving, and storing data, along with the associated
backup, recovery, and security risks.

Oracle R Enterprise eliminates the need to know SQL to work with database-resident data.
Through the Oracle R Enterprise transparency layer, R users can access
data stored in tables and views as virtual data frames. Base R functions
performed on these "ore.frames" are overloaded to generate SQL which is transparently sent to Oracle Database
for execution - leveraging the database as a high-performance
computational engine.

Check out Oracle R Enterprise for examples of the interface, documentation, and a link to download Oracle R Enterprise.

Friday Feb 03, 2012

For many in the Oracle community, the addition of R through Oracle R Enterprise could leave them wondering "What is R?"

R has been receiving a lot of attention recently, although it’s been around for over 15 years. R is an open-source language and environment for statistical computing
and data visualization, supporting data manipulation and
transformations, as well as sophisticated graphical displays. It's being taught in colleges and universities in courses on statistics and advanced analytics - even replacing more traditional statistical software tools. Corporate data analysts and statisticians often know R and use it in their daily work, either writing their own R functionality, or leveraging the more than 3400 open source packages. The Comprehensive R Archive Network (CRAN) open source packages support a wide range of statistical and data analysis capabilities. They also focus on analytics specific to individual fields, such as bioinformatics, finance, econometrics, medical image analysis, and others (see CRAN Task Views).

So why do statisticians and data analysts use R?

Well, R is a statistics language similar to SAS or SPSS. It’s a powerful, extensible environment, and as noted above, it has a wide range of
statistics and data visualization capabilities. It’s easy to install and use, and it’s free – downloadable from the CRAN R
project website.

In contrast, statisticians and data analysts typically don’t know SQL and are not
familiar with database tasks. R provides statisticians and data analysts access a wide range of analytical capabilities in a natural statistical language, allowing them to remain highly productive. For example, writing R functions is simple and can be done quickly. Functions can be made to return R objects that can be easily passed to and manipulated by other R functions. By comparison, traditional statistical tools can make the implementation of functions cumbersome, such that programmers resort to macro-oriented programming constructs instead.

So why do we need anything else?

R was conceived as a single user tool that is not multi-threaded. The
client and server components are bundled together as a single
executable, much like Excel.

R is limited by the memory and processing power of the machine where it
runs, but in addition, being single threaded, it cannot automatically leverage the CPU capacity
on a user’s multi-processor laptop without special packages and
programming.

However, there is another issue that limits R’s scalability…

R’s approach to passing data between function invocations results in
data duplication – this chews up memory faster. So inherently, R is not good for big data, or depending on the machine and tasks, even
gigabyte-sized data sets.

This is where Oracle R Enterprise comes in. As we'll continue to discuss in this blog, Oracle R Enterprise lifts this memory and computational constraint found in R today by executing requested R calculations on data in the database, using the database itself as the computational engine. Oracle R Enterprise allows users to further leverage Oracle's engineered systems, like Exadata, Big Data Appliance, and Exalytics, for enterprise-wide analytics, as well as reporting tools like Oracle Business Intelligence Enterprise Edition dashboards and BI Publisher documents.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.