Thursday Sep 27, 2012

Today's guest post comes from Jason Feldhaus, a Consulting Member of Technical Staff in the TimesTen Database organization at Oracle. He shares with us a sample session using ROracle with the TimesTen In-Memory database.

Beginning in version 1.1-4, ROracle includes support for the Oracle Times Ten In-Memory Database, version 11.2.2. TimesTen is a relational database providing very fast and high throughput through its memory-centric architecture. TimesTen is designed for low latency, high-volume data, and event and transaction management. A TimesTen database resides entirely in memory, so no disk I/O is required for transactions and query operations. TimesTen is used in applications requiring very fast and predictable response time, such as real-time financial services trading applications and large web applications.TimesTen can be used as the database of record or as a relational cache database to Oracle Database.

ROracle provides an interface between R and the database, providing the rich functionality of the R statistical programming environment using the SQL query language. ROracle uses the OCI libraries to handle database connections, providing much better performance than standard ODBC.

The latest ROracle enhancements include:

Support for Oracle TimesTen In-Memory Database

Support for Date-Time using R's POSIXct/POSIXlt data types

RAW, BLOB and BFILE data type support

Option to specify number of rows per fetch operation

Option to prefetch LOB data

Break support using Ctrl-C

Statement caching support

Times Ten 11.2.2 contains enhanced support for analytics workloads and complex queries:

Sunday Sep 23, 2012

Learn how
Oracle
R Enterprise is used to generate new insight and new
value to business, answering not only what happened, but why
it happened. View this YouTube Oracle
Channel video overview describing how analyzing
big data using Oracle R Enterprise is different from
other analytics tools at Oracle.

Oracle R
Enterprise (ORE), a component of the
Oracle Advanced Analytics Option, couples the wealth of
analytics packages in R with the performance, scalability, and
security of Oracle Database. ORE executes base R
functions transparently on database data without having to pull
data from Oracle Database. As an embedded component of the
database, Oracle R Enterprise can run your R script and open
source packages via embedded R where the database manages the
data served to the R engine and user-controlled data
parallelism. The result is faster and more secure access to data. ORE also works with
the full suite of in-database analytics, providing integrated
results to the analyst.

Monday Sep 17, 2012

In this podcast interview with Michael Kane, Data Scientist and Associate Researcher at Yale University, Michael discusses the R statistical programming language, computational challenges associated with big data, and two projects involving data analysis he conducted on the stock market "flash crash" of May 6, 2010, and the tracking of transportation routes bird flu H5N1. Michael also worked with Oracle on Oracle R Enterprise, a component of the Advanced Analytics option to Oracle Database Enterprise Edition. In the closing segment of the interview, Michael comments on the relationship between the data analyst and the database administrator and how Oracle R Enterprise provides secure data management, transparent access to data, and improved performance to facilitate this relationship.

ore.stepwise for high-performance stepwise regression

The deployment of ORE
within the Oracle micro-processor tools environment introduced a technology that
significantly expands our data analysis capabilities and opens the door to new
applications.

Oracle R Enterprise (ORE) has been recently deployed in the Oracle micro-processor tools environment, replacing a popular commercial tool as a production engine for data analysis. Fit/response models are important components of the simulation flows in the Oracle microprocessor tools environment; such models are used for a variety of purposes ranging from library generation to design yield prediction and optimization. Several tools were targeted for the migration to ORE; these tools are constructed around an iterative loop processing hundreds of data structures. At each iteration, a simulator engine generates data for multiple figures of metrics (targets), and a fit engine is called to construct response models for each target. The fit models are assembled into libraries for subsequent simulation within other flows or used on the fly.

A common characteristic of these
models is that they need to express strong nonlinear relations between the targets and large
sets of explanatory variables. Multiple interactions and many non-linear
dependencies are considered as candidate effects for the model construction; they often result from an automatic generation procedure attempting
to cover a large variety of anticipated relations between the dependent and
independent variables. For example, for a case with O[10^2]
main explanatory variables, the total number of
candidate regressors, nregs_tot, could
quickly rise to O[10^3-10^4]. Linear regression models with such a high number of terms
are not only too expensive to use, as they have, potentially, a large
number of terms with negligible coefficients, but are also likely to lead to instability and
inaccuracy problems. For example,
overfitting is an important concern with
models expressing fluctuations in the data rather than capturing the trend.
Moreover, if strong quasi-linear
interactions occur between large numbers of regressors,
the variance of the calculated (model) coefficients can be massively inflated.

In order to reduce the size of
the fit models while retaining significant dependencies, we use stepwise
regression. Stepwise regression is an iterative fitting method which
builds the model incrementally by adding and eliminating regressors (from a
specified candidate set) using a pre-defined selection mechanism/test of
statistical significance - the model converges when all regressors retained in
the model satisfy the significance test criteria. Commercial and open source
regression packages offer various flavors of stepwise regression which differ in
multiple ways through the choice of the selection mechanism (F-test, Information Criteria:
AIC, BIC, etc), choice of the 'direction' (forward, backward, or both), flexibility for specifying the model
scope, the significance threshold(s), the handling of interaction terms, etc.

ORE has developed a proprietary in-database algorithm for stepwise regression, ore.stepwise, which complements the functionality of R's step and,
especially, offers a very significant performance improvement through faster, scalable algorithms and in-database execution. The basic syntax is the following:

ore.stepwise allows the user to
specify a model scope and stepwise direction, and uses the F-test for regressor selection with the add.p and drop.p
significance levels for adding and removing regressors while the model is
iteratively constructed.

assign(sprintf("%s_model", modname), ore.stepwise(Lower_model, data = dataDB, scope = Scope_model, direction="both", add.p=..., drop.p=...)) ...
The in-database ore frame dataDB contains observations for several target modelsmdlA,mdlB,..and a list with all target names (list_modnames) is assembled for iterative processing. For each target, a model scope is specified within lower and upper bounds. In the example above the lower bound is the intercept
but the upper bound is customized so that each target model can be constructed from its own collection
of regressors. The results shown in Figure1 illustrate the performance difference between ore.stepwise
in using ORE and base R’s step function for both a bi-linear and a fully quadratic model constructed from 34 independent variables and 10k data observations.

ore.stepwise is approx. 65X faster than step
at similar R^2 and relative error as stepwise.

Quadratic
model

method

R^2

Number of Regressors

mean(rel_error)

Elapsed Time (seconds)

step

0.9962

154

1.05e-02

12600.0

ore.stepwise

0.9963

210

1.04e-02

69.5

performance difference

ore.stepwise is approx. 180X faster than step at similar R^2 relative
error.

Figure 1: Comparison of results for R's step function and ORE's ore.stepwise function for both bi-linear and quadratic models

ore.stepwise is faster thanR's stepby a factor of 66-180X.
The larger the data set and the number
of regressors, we observed greater performance with ore.stepwise compared to
R's step. The models produced by R’s step and ore.stepwise have a different number of regressors because both the selection mechanisms and interaction terms are handled differently. step favors the main terms - x1:x2 will be added only if x1 and x2 were previously added, and, reversibly, x1:x2 will be eliminated before x1 and x2 are eliminated, whereas ore.stepwise does not differentiate between main terms and interactions. With respect to collinearity, ore.stepwise detects strict linear dependencies and eliminates from start the regressors involved in multi-collinear relations.

In summary, the ORE capabilities for stepwise regression far surpass similar functionality in tools we
considered as alternatives to ORE. The deployment of ORE within the Oracle micro-processor tools environment introduced a technology which significantly expands the data analysis capabilities through the R ecosystem combined with in-database high performance algorithms and opens the door to new applications. This technology leverages the flexibility and extensibility of the R environment and allows massive and complex data analysis sustained by the scalability and performance of the Oracle database for Big Data.

Alexandre Ardelea is a principal hardware engineer at Oracle Corporation. Alex has a PhD and MS in Physics from
Ecole polytechnique fédérale de Lausanne and post-doctoral research in
non-linear physics, CFD and parallel methods. Alex's specialities
include response surface modeling, optimization strategies for
multi-parametric/objective/constraint problems, statistical process
characterization, circuit analysis and RF algorithms.

Oracle R
Enterprise (ORE) implements a transparency layer on top of the R engine
that allows R computations specified in the R environment to be pushed for execution in Oracle Database. A mapping is established between a special R object called
an ore frame and a corresponding database table or view, allowing analytical functions such as ore.stepwise to be executed on these mapped objects in Oracle Database. The
overloaded functions in the ORE packages generate SQL statements in the
background, export the expensive computations to the database for
execution, and return results to the R environment. Here is a simple example using ore.stepwise with the longley data, which is shipped in the datasets package with R:

Thursday Aug 02, 2012

Modern computer
processors are adequately optimized for many statistical calculations,
but large data operations may require hours or days to return a result. Oracle R Enterprise (ORE), a set of R packages designed to process large data computations in Oracle Database, can run many R operations in parallel, significantly
reducing processing time. ORE supports parallelism through the
transparency layer, where the database is used as a computational
engine, and embedded R execution, where R scripts can be executed in a
data parallel manner.

The
backbone of parallel computing is breaking down a resource intensive
computation into chunks that can be performed independently, while
maintaining a framework that allows for the results of those independent
computations to be combined. Writing parallel code is typically trickier than writing serial code,
but this is simplified using ORE, as there is no need for the user to
create worker instances or combine results. Using the transparency
layer, users simply execute their ORE code and the database implicitly
manages the entire process, returning results for further processing.

With ORE,
each R function invocation that operates on an ORE object, such as
ore.frame, is translated to a SQL statement behind the scenes. This SQL,
which may be stacked after several function invocations, undergoes
optimization and parallelization when parsed and executed. This
technique enables deferred evaluation, but that's a topic for
another blog. Depending on the resource requirements of the statement,
the database decides if it should leverage parallel execution.

For embedded R
execution, database degree of parallelism settings help determine the
number of parallel R engines to start. When data parallel functions
execute in parallel, each unit of work is sent to a different R external
process, or extproc, at the database server. The
results are automatically collated and returned as R-proxy objects,
e.g., ore.frame objects, in the R interface and SQL objects in the SQL
interface, which can be processed further in R or by SQL functions.
The SQL functions enable the operationalizion or productization of R
scripts as part of a database-based application, in what we refer to as
"lights out" mode.

In the ORE Transparency Layer,
where the database executes SQL generated from overloaded R functions,
parallelism is automatic, assuming the database or table is configured
for parallelism. Parallel computations in the transparency layer are
ideal for bigger data where functionality exists in the database.

Using Embedded R Script Execution, parallelism is enabled for row, group and index operations if specified using a function parameter or parallel cursor hint:

ore.groupApply and rqGroupEval* split the data into grouped partitions and invoke the R function on each partition in a separate engine at the database server.

ore.rowApply and rqRowEval split the data into row chunks and invoke the R function on each chunk in a separate engine at the database server.

ore.indexApply runs an R function x times, with each iteration of the function invoked in separate engine at the database server.

With embedded R
execution, the expectation is that the database server machine has
greater RAM and CPU capacity than the user's client machine. So
executing R scipts at the server will inherently allow larger data sets
to be processed by an individual R engine.

In addition, users can include contributed R packages in their embedded R scripts. Consider an example using a sample of the airline on-time performance data
from Research and Innovative Technology Administration (RITA), which
coordinates the U.S. Department of Transportation (DOT) research
programs. The data sample consists of 220K records of U.S. domestic
commercial flights between 1987 and 2008.

We use the R
interface to embedded R to partition the airline data table (ONTIME_S)
by the DAYOFWEEK variable, fit a linear model using the biglm package, and then combine the results. Note: To run this example, the biglm package must be installed on both the database server and client machine.

The call to
ore.groupApply uses Oracle Database to partition the ONTIME_S table by
the categories in the DAYOFWEEK variable. Each category is sent to an R
engine at the database server machine to apply the R function in
parallel. The individual category results are combined in the returned
result. Using embedded R alleviates the typical memory problems
associated with running R serially because we are fitting only a single
partition, or day of the week, in memory of an R engine. Using a Linux
server with 8 GB RAM and 4 CPUs, fitting the model in parallel by
setting parallel = TRUE in the call to ore.groupApply, reduces the
processing time from approximately 30 seconds to 10 seconds.

If the goal is to
integrate the model results as an operationalized process, we can use
rqGroupEval, the SQL interface equivalent to ore.groupApply. We create a
script to set up the structure of the input and grouping column and
then run the script in SQL. The nature of pipelined table functions
requires that we explicitly represent the type of the result, captured
in the package, and create a function that includes the column used for
partitioning explicitly.

We use a
parallel hint on the cursor that is the input to our rqGroupEval
function to enable Oracle Database to use parallel R engines. In this
case, using the same Linux server, the processing time is reduced from
approximately 25 seconds to 7 seconds as we used 7 parallel R engines
(one for each day of the week) across a single server. Of course, a
real-world scenario may utilize hundreds of parallel engines across many
servers, returning results on large amounts of data in short period of
time.

*To
enable execution of an R script in the SQL interface, ORE provides
variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These
functions are rqEval, rqTableEval, rqRowEval and rqGroupEval. The
ore.groupApply feature does not have a direct parallel in the SQL
interface. We refer to rqGroupApply as a concept, however, there is
specific code required to enable this feature. This is highlighted in
the second example.

Wednesday Jun 27, 2012

Part II – Solving Big Problems with Oracle R Enterprise

In the first post in this series (see https://blogs.oracle.com/R/entry/solving_big_problems_with_oracle),
we showed how you can use R to perform historical rate of return
calculations against investment data sourced from a spreadsheet. We
demonstrated the calculations against sample data for a small set of
accounts. While this worked fine, in the real-world the problem is much
bigger because the amount of data is much bigger. So much bigger that
our approach in the previous post won’t scale to meet the real-world
needs.

From our previous post, here are the challenges we need to conquer:

The actual data that needs to be used lives in a database, not in a spreadsheet

The actual data is much, much bigger- too big to fit into the
normal R memory space and too big to want to move across the network

The overall process needs to run fast- much faster than a single processor

The actual data needs to be kept secured- another reason to not want to move it from the database and across the network

And the process of calculating the IRR needs to be integrated
together with other database ETL activities, so that IRR’s can be
calculated as part of the data warehouse refresh processes

In this post, we will show how we moved from sample data
environment to working with full-scale data. This post is based on
actual work we did for a financial services customer during a recent
proof-of-concept.

...

On average, we performed 8,200 executions of our R function per second (110s/911k accounts)

On average, we did 41,000 single time period rate of return
calculations per second (each of the 8,200 executions of our R function
did rate of return calculations for 5 time periods)

On average, we processed over 900,000 rows of database data in R per second (103m detail rows/110s)

...

R + Oracle R Enterprise: Best of R + Best of Oracle Database

This blog post series started by describing a real customer
problem: how to perform a lot of calculations on a lot of data in a
short period of time. While standard R proved to be a very good fit for
writing the necessary calculations, the challenge of working with a lot
of data in a short period of time remained.

This blog post series showed how Oracle R Enterprise enables R
to be used in conjunction with the Oracle Database to overcome the data
volume and performance issues (as well as simplifying the operations and
security issues). It also showed that we could calculate 5 time
periods of rate of returns for almost a million individual accounts in
less than 2 minutes.

Thursday Jun 21, 2012

Abstract:

This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set.

Overview:

Databases are great for managing large amounts of data in a central
place with rigorous enterprise-level controls. R is great for doing
advanced computations. Sometimes you need to do advanced computations
on large amounts of data, subject to rigorous enterprise-level
concerns. This blog post shows how Oracle R Enterprise enables R plus
the Oracle Database enabled us to do some pretty sophisticated
calculations across 1 million accounts (each with many detailed records)
in minutes.

The problem:

A financial services customer of mine has a need to calculate the
historical internal rate of return (IRR) for its customers’ portfolios.
This information is needed for customer statements and the online web
application. In the past, they had solved this with a home-grown
application that pulled trade and account data out of their data
warehouse and ran the calculations. But this home-grown application was
not able to do this fast enough, plus it was a challenge for them to
write and maintain the code that did the IRR calculation.

Conceptutally, MapReduce is similar to combination of apply operations in R or GROUP BY in Oracle Database: transform elements of a list or table, compute an index, and apply a function to the specified groups. The value of MapReduce in ORCH is the extension beyond a single-process to parallel processing using modern architectures: multiple cores, processes, machines, clusters, data appliance, or clouds.

ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. R users write mapper and reducer functions in R and execute MapReduce jobs from the R environment using a high level interface. As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions. ORCH also gives R users the ability to test their MapReduce programs locally, using the same function call, before deploying on the Hadoop cluster. In the following example, we use the ONTIME_S data set typically installed in Oracle Database when Oracle R Enterprise is installed. ONTIME_S is a subset of the airline on-time performance data
(from Research and Innovative Technology Administration (RITA), which
coordinates the U.S. Department of Transportation (DOT) research
programs. We're providing a relatively large sample data set (220K rows), but this example could be performed in ORCH on the full data set,
which contains 123 millions rows and requires 12 GB disk space . This
data set is significantly larger than R can process on it's own using a typical laptop with 8 GB RAM.ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)[1] "ore.frame"
attr(,"package")
[1] "OREbase"

ORCH includes functions for manipulating HDFS data. Users can move data between HDFS and the file system, R data frames, and Oracle Database tables and views. This next example shows one such function, hdfs.push, which accepts an ore.frame object as its first argument, followed by the name of the key column, and then the name of the file to be used within HDFS.

The following R script example illustrates how users can attach to an existing HDFS file object, essentially getting a handle to the HDFS file. Then, using the hadoop.run function in ORCH, we specify the HDFS file
handle, followed by the mapper and reducer functions. The mapper function takes the key and value as arguments, which correspond to one row of data at a time from the HDFS block assigned to the mapper. The function keyval in the mapper returns data to Hadoop for further processing by the reducer.

The reducer function receives all the values associated with one key (resulting from the “shuffle and sort” of Hadoop processing). The result of the reducer is also returned to Hadoop using the keyval function. The results of the reducers are consolidated in an HDFS file, which can be obtained using the hdfs.get function.

The following example computes the average arrival delay for flights where the destination is San Francisco Airport (SFO). It selects the SFO airport in the mapper and the mean of arrival delay in the reducer.

Friday Apr 13, 2012

Oracle just released the latest update to Oracle R Enterprise, version 1.1. This release
includes the Oracle R Distribution (based on open source R, version 2.13.2), an improved server installation, and much more. The key new features include:

Extended Server Support: New support for Windows 32 and 64-bit server components, as well as continuing support for Linux 64-bit server components

Oracle has released an update to the Oracle R Distribution, an Oracle-supported distribution of open source R. Oracle R Distribution 2-13.2 now contains the ability to dynamically link the following libraries on both Windows and Linux:

The Intel Math Kernel Library (MKL) on Intel chips

The AMD Core Math Library (ACML) on AMD chips

To take advantage of the performance enhancements provided by Intel MKL or AMD ACML in Oracle R Distribution, simply add the MKL or ACML shared library directory to the LD_LIBRARY_PATH system environment variable. This automatically enables MKL or ACML to make use of all available processors, vastly speeding up linear algebra computations and eliminating the need to recompile R. Even on a single core, the optimized algorithms in the Intel MKL libraries are faster than using R's standard BLAS library.

Open-source R is linked
to NetLib's BLAS libraries, but they are not multi-threaded and only
use one core. While R's internal BLAS are efficient for most
computations, it's possible to recompile R to link to a different, multi-threaded BLAS library to improve performance on eligible calculations. Compiling and linking to R yourself can be involved, but for many, the significantly improved calculation speed justifies the effort.Oracle R Distribution notably simplifies the process of using external math libraries by enabling R to auto-load MKL or ACML. For R commands that don't link to BLAS code, taking advantage of database parallelism using embedded R execution in Oracle R Enterprise is the route to improved performance.

Wednesday Apr 04, 2012

Oracle recently updatedROracle to version 1.1-2 on CRAN with enhancements and bug fixes. The major enhancements include the introduction of support for Oracle Wallet Manager and datetime and interval types.

Oracle Wallet support in ROracle allows users to manage public key security from the client R session. Oracle Wallet allows passwords to be stored and read by Oracle Database, allowing safe storage of database login credentials. In addition, we added support for datetime and interval types when selecting data, which expands ROracle's support for date data.

See the ROracle NEWS for the complete list of updates.

We encourage ROracle users to post questions and provide feedback on the Oracle R Forum.

In addition to being a high performance database interface to Oracle Database from R for general use, ROracle supports database access for Oracle R Enterprise.

Monday Apr 02, 2012

This Oracle R Enterprise (ORE) tutorial, on embedded R execution, is the third in a series to help users get started using ORE. See these links for the first tutorial on the transparency layer and second tutorial on the statistics engine. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Embedded R Execution refers to the ability to execute an R script at the database server, which provides several benefits: spawning multiple R engines in parallel for data-parallel operations, more efficient data transfer between Oracle Database and the R engine, leverage a likely more powerful server with more CPUs and greater RAM, schedule automated jobs, and take advantage of open source R packages at the database server. Data aggregates are computed in parallel, significantly reducing computation time, without requiring sophisticated configuration steps.

ORE provides two interfaces for embedded R execution: one for R and one for SQL. The R interface enables interactive execution at the database server from the client R engine, e.g., your laptop. It also has transparency aspects for passing R objects and returning R objects.In the R interface, the ore.doEval schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis. User-defined R functions can run in parallel, either on each row, sets of rows, or on each group of rows given a grouping column. The first two cases are covered by ore.rowApply, the second by the ore.groupApply function. ore.indexApply provides parallel simulations capability by invoking the script the number of times specified by the user. The R interface returns results to the client as R objects that can be passed as arguments to R functions.

The SQL interface enables interactive execution from any SQL interface, like SQL*Plus or SQL Developer, but it also enables R scripts to be included in production database-based systems. To enable execution of an R script in the SQL interface, ORE provides variants of ore.doEval, ore.groupApply and ore.indexApply in SQL. These functions are rqEval, rqTableEval, rqRowEval and rqGroupEval.
The SQL interface allows for storing results directly in the database.

R Interface Function (ore.*)

SQL Interface Function (rq*)

Purpose

ore.doEval

rqEval

Invoke stand-alone R script

ore.tableApply

rqTableEval

Invoke R script with full table input

ore.rowApply

rqRowEval

Invoke R script one row at a time, or multiple rows in "chunks"

ore.groupApply

rqGroupEval

Invoke R script on data indexed by grouping column

ore.indexApply

N/A

Invoke R script N times

In addition, the SQL interface enables R results to be stored in a database table for subsequent use in another invocation (think data mining model building and scoring). It enables returning structured R results in a table. Results can also be returned as XML. The XML interface enables both structured data, such as data frames, R objects, and graphs to be returned. The XML capability allows R graphs and structured results to be displayed in Oracle BI Publisher documents and OBIEE dashboards.

Embedded R Execution: R Interface

The following example uses the function ore.groupApply, one of several embedded R execution functions, to illustrate how R users can achieve data parallelism through the database. This example also illustrates that embedded R execution enables the use of open source packages. Here we see the use of the R package biglm.

We specify a column on which to partition the data. Each partition of the data is provided to the function through the first argument, in this case the function variable dat. There is no need to send data from the database to R - the R function is sent to the database, which processes them in parallel. Output results may be stored directly in the database, or may be downloaded to R. Only when we want to see the results of these models do we need to retrieve them into R memory and perform, for example, the summary function.

Whereas the previous example showed how to use embedded R execution from the R environment, we can also invoke R scripts from SQL. This next example illustrates returning a data frame from results computed in Oracle Database. We first create an R script in the database R script repository. The script is defined as a function that creates a vector of 10 elements, and returns a data frame with those elements in one column and those elements divided by 100 in a second column.

Once the script is created, we can invoke it through SQL. One of the SQL embedded R executions table functions available is rqEval. The first argument is NULLsince we have no parameters to pass to the function. The second argument describes the structure of the result. Any valid SQL query that captures the name and type of resulting columns will suffice. The third argument is the name of the script to execute.

Embedded R scripts may generate any valid R object, including graphs. In addition, embedded R execution enables returning results from an R script as an XML string. Consider the following example that creates a vector from the integers 1 to 10, plots 100 random normal points in a graph, and then returns the vector. After creating the script in the database R script repository, we invoke the script using rqEval, but instead of specifying the form of the result in a SQL query, we specify XML.

While the actual graph looks like the following, the output from this query will be an XML string.

In the execution results shown below, the VALUE column returned is a string that contains first the structured data in XML format. Notice the numbers 1 through 10 set off by the <value> tags. This is followed by the image in PNG base 64 representation. This type of output can be consumed by Oracle Business Intelligence Publisher (BIP) to produce documents with R-generated graphs and structured content. Oracle BIP templates can also be used to expose R-generated content in Oracle Business Intelligence Enterprise Edition (OBIEE) web browser-based dashboards.

You can see additional examples using embedded R execution in action in the Oracle Enterprise Training, session 4, Embedded R Script Execution. These example will run as written in R 2.13.2 after installing Oracle R Enterprise. We'll be posting more examples using embedded R script execution in the coming months. In the meantime, questions are always welcome on the Oracle R Forum.

Tuesday Mar 13, 2012

Oracle provides the
Oracle R Distribution, an Oracle-supported distribution of open source
R. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and the Oracle Big Data Appliance.
The Oracle R Distribution facilitates enterprise acceptance of R, since
the lack of a major corporate sponsor has made some companies concerned
about fully adopting R. With the Oracle R Distribution, Oracle plans to
contribute bug fixes and relevant enhancements to open source R. Oracle
R Distribution also employs Intel's Math Kernel Library (MKL) to enable optimized, multi-threaded math routines, providing relevant R functions maximum performance on Intel hardware.

Oracle plans
to actively maintain interoperability with current R versions by
developing Oracle R Enterprise on the last stable point release of open
source R. With the
Oracle R Distribution, Oracle plans to expand support to the open source
community by contributing bug fixes and relevant enhancements to open
source R. All of these improvements will be made publicly available to the R community. Oracle has already released to the open source community an enhanced version of ROracle, which it now maintains.Support
for Oracle R Distribution, Oracle R Enterprise, and Oracle R Connector
for Hadoop is provided through standard Oracle Support channels for
licensed customers.

Wednesday Feb 29, 2012

The Oracle R Advanced Analytics team
is happy to announce the release of the ROracle 1.1-1 package on the
Comprehensive R Archive Network (CRAN). We’ve rebuilt ROracle from the
ground up, working hard to fix bugs and add optimizations. The new
version introduces key improvements for interfacing with Oracle Database
from open-source R.

We think ROracle 1.1-1 is a great step forward, allowing users to build
high performance and efficient R applications using Oracle Database.
Whether you are upgrading your existing interface or using it for the
first time, ROracle 1.1-1 is ready for download.. If you have any questions or comments please post on the Oracle R discussion forum. We'd love to hear from you!

Thursday Feb 23, 2012

This Oracle R Enterprise (ORE) statistics engine tutorial is the second in a series to help users get started using ORE. (See the first tutorial on the ORE transparency layerhere). Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The ORE statistics engine is a database library consisting of native database statistical functions and new functionality added specifically for ORE. ORE intercepts R functions and pushes their execution to Oracle Database to perform computationally expensive transformations and computations. This allows R users to use the R client directly against data stored in Oracle Database, greatly increasing R's scalability and performance.

A variety of useful and common routines are available in the ORE statistics engine:Significance TestsChi-square, McNemar, Bowker
Simple and weighted kappas
Cochran-Mantel-Haenzel correlation
Cramer's V
Binomial, KS, t, F, Wilcox

These R functions are overridden, such that when presented with ore.frame data, the function generates a SQL query that is transparently submitted to Oracle Database. This is the case for much of the base R and stats functionality, providing the benefit of employing the database as the computational engine and seamlessly breaking through R's memory barrier.

In this post, we introduce a simple yet typical data analysis using
functions from the ORE transparency layer. We begin configuring the local R environment by executing a few simple commands.
Load the ORE library and connect to Oracle Database:

> library(ORE)
> ore.connect("USER", "SID", "HOST", "PASSWORD")

Invoking ore.sync synchronizes the the metadata in the database schema with the R environment:

> ore.sync("SCHEMA")> ore.ls()

Attaching the database schema provides access to views and tables so they can be manipulated from a local R session:
> ore.attach("SCHEMA")

We use the ONTIME_S data set typically installed in Oracle Database when ORE is installed. ONTIME_S is a subset of the airline on-time performance data (from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. We're providing a relatively large sample data set (220K rows), but these examples could be performed in ORE on the full data set, which contains 123 millions rows and requires 12 GB disk space . This data set is significantly larger than R can process on it's own.ONTIME_S is a database-resident table with metadata on the R side, represented by an ore.frame object.

> class(ONTIME_S)[1] "ore.frame"
attr(,"package")
[1] "OREbase"

We focus on two columns of data: one numeric column, DEPDELAY (actual departure delay in minutes), and a categorical column, ORIGIN (airport of origin).
A typical place to begin, is, of course, looking at the structure of selected variables. We call the function summary to obtain summary statistics on the variable measuring departure delay.

This shows us that the 10% quantile (-4) is 4 units away from the median, while the 90% quantile (26) is 26 units from the median. For a symmetric distribution, the two quantiles would be about the same distance from the median. A measure of the data spread is the interquartile range, the difference between the 25% and 75% quantile. To allow computations on the data, we remove missing values by setting thena.rm parameter to TRUE, as we did above for the quantile function.

> with(ONTIME_S, IQR(DEPDELAY, na.rm = TRUE))
[1] 8

Other measures of data spread available are the variance and standard deviation.

Using in-database aggregation summaries, we can investigate the relationship between departure delay and origin a bit further.
We use aggregate to calculate the mean departure delay for each airport of origin. Results for the first five airports are displayed using the function head.> ONTIME.agg <- aggregate(ONTIME_S$DEPDELAY,
by = list(ONTIME_S$ORIGIN),
FUN = mean)
> head(ONTIME.agg, 5)
Group.1 x
1 ABE 216
2 ABI 29
3 ABQ 1392
4 ABY 9
5 ACK 2

Now that we have gained a basic impression and some insights into the ONTIME_S data, we may choose to view the data graphically. For example, we may want to get a visual impression of the distribution of departure delay. We use the hist function, which displays a histogram skewed on positive side, presumably because flights rarely leave early.

After analyzing the data through exploratory methods in ORE, we proceed to a possible next step: confirmatory statistics.
Let's compute a Student's t-test using the origin and departure delay variables we examined earlier. The goal is to decide whether average departure delay of one airport is different from the average delay of another.

data: DEPDELAY D = 0.3497, p-value < 2.2e-16alternative hypothesis: two-sided At this point we could continue our data exploration by performing additional distribution tests, or proceed with the rich set of modeling and functionality ORE offers.

ORE enables R users transparent
access to data stored in Oracle Database while leveraging Oracle Database as a
compute engine for scalability and high performance.We've only scratched the surface on ORE's statistical features - stay tuned for posts highlighting more advanced features of the statistical engine.

The Oracle R Enterprise User's Guide contains a number of examples demonstrating the functionality available in ORE. To view the documentation and training materials, visit our product page. Please feel free to visit our discussion forum and ask questions or provide comments about how we can help you and your team!

Monday Feb 20, 2012

Oracle has released the Oracle R Distribution, an Oracle-supported
distribution of open source R. This is provided as a free download from Oracle. Support for Oracle R Distribution is provided to customers of the Oracle Advanced Analytics option and Oracle Big Data Appliance. The
Oracle R Distribution facilitates enterprise acceptance of R, since the
lack of a major corporate sponsor has made some companies concerned
about fully adopting R. With the Oracle R Distribution, Oracle plans to contribute bug fixes and relevant enhancements to open source R.

Oracle has already taken responsibility for and contributed modifications to ROracle - an Oracle database interface (DBI) driver for R based on OCI.As ROracle is LGPL and used for Oracle Database connectivity from R, we are committed to ensuring this is the best package for Oracle connectivity.

Friday Feb 17, 2012

This Oracle R Enterprise (ORE) transparency layer tutorial is the first in a series to help users get started using ORE. Oracle R Enterprise is a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment. A mapping is established between a special R object called an ORE frame and a corresponding database table or view, allowing a wide range of R functionality to be executed on these mapped objects. The overloaded functions in the ORE packages generate SQL statements in the background, export the expensive computations to the database for execution, and return results to the R environment.

Here's a quick overview and an example using the airline on-time performance data from Research and Innovative Technology Administration (RITA), which coordinates the U.S. Department of Transportation (DOT) research programs. The data consists of 123 million records of U.S. domestic commercial flights between 1987 and 2008.

Configuring the R Environment

We begin by configuring the local R environment by executing a few simple commands. These commands may be saved in the .Rprofile file and executed during the initialization of each R session or typed directly into the R session.

Load the ORE library and connect to Oracle Database:

> library(ORE) > ore.connect("USER", "SID", "HOST", "PASSWORD")

Syncing with the database syncs the metadata in the database schema with the R environment:

> ore.sync("SCHEMA") > ore.ls()

Attaching the database provides access to views and tables so they can be manipulated from a local R session: Data Exploration > ore.attach("SCHEMA")

Travelers in the western United States may be interested in the mean arrival delay for the San Fransisco, Los Angeles and Seattle airports. Executing this R code produces a boxplot representing the delay profile for these selected airports:

The delay profile shows that, on average, arrival delays are greater in San Fransisco than Seattle and Los

With this information, we proceed with fitting a linear model where arrival delay is modeled as a linear function of departure delay and destination. Fitting models in ORE requires minimal modification to traditional R syntax. Simply replace R's linear modeling function lm with ore.lm, and the remaining syntax is transparent:

Many R functions compute and store more information than they report by default. Users can easily save these results to an object and extract the components they need. This allows the output of one function to be used as the input to another - a very powerful feature of the R programming environment. We will take advantage of this functionality by generating predictions from the linear model built in the database:

Using ORE objects, users transparently remain in the R language, requiring minimal modifications to their existing R scripts. Although supported, it's not necessary to pull data from the database into R. This eliminates the need to manipulate memory-bound R objects on the user's desktop system. R users may access open-source R packages containing many standard and cutting-edge routines for data analysis.

To learn more about ORE offerings, including statistics and modeling features and and advanced topics like the Oracle R Connector for Hadoop (ORCH), view the documentation and training materials on our our product page. To encourage collaboration, we provide a discussion forum for ORE topics.

Thursday Feb 16, 2012

In an enterprise, databases are typically where data reside. So where
data analytics are required, it's
important for R and the database to work well together. The more
seamlessly and naturally R users can access data, the easier it is to
produce results. R users may leverage ODBC, JDBC, or similar types of
connectivity to access database-resident
data. However, this requires working with SQL to formulate queries to
process or filter data in the database,
or to pull data into the R environment for further processing using R.
If R users, statisticians, or data analysts are unfamiliar with SQL or
database tasks, or don't have database access, they often consult IT for
data extracts.

Not having direct access to database-resident
data introduces delays in
obtaining data, and can make near real-time analytics impossible. In
some instances, users request data sets much larger than required to
avoid multiple requests to IT. Of course, this approach introduces costs
of exporting, moving, and storing data, along with the associated
backup, recovery, and security risks.

Oracle R Enterprise eliminates the need to know SQL to work with database-resident data.
Through the Oracle R Enterprise transparency layer, R users can access
data stored in tables and views as virtual data frames. Base R functions
performed on these "ore.frames" are overloaded to generate SQL which is transparently sent to Oracle Database
for execution - leveraging the database as a high-performance
computational engine.

Check out Oracle R Enterprise for examples of the interface, documentation, and a link to download Oracle R Enterprise.

News Facts

Oracle R Enterprise delivers
enterprise class performance for users of the R statistical programming
language, increasing the scale of data that can be analyzed by orders of
magnitude using Oracle Database 11g.

R has attracted over two
million users since its introduction in 1995, and Oracle R Enterprise
dramatically advances capability for R users. Their existing R
development skills, tools, and scripts can now also run transparently,
and scale against data stored in Oracle Database 11g.

Customer testing of Oracle R
Enterprise for Big Data analytics on Oracle Exadata has shown up to 100x
increase in performance in comparison to their current environment.

Oracle Data Mining, now part
of Oracle Advanced Analytics, helps enable customers to easily build and
deploy predictive analytic applications that help deliver new insights
into business performance.

Comprehensive In-Database Platform for Advanced Analytics

Oracle Advanced Analytics
brings analytic algorithms to data stored in Oracle Database 11g and
Oracle Exadata as opposed to the traditional approach of extracting data
to laptops or specialized servers.

With Oracle Advanced
Analytics, customers have a comprehensive platform for real-time
analytic applications that deliver insight into key business subjects
such as churn prediction, product recommendations, and fraud alerting.

By providing direct and
controlled access to data stored in Oracle Database 11g, customers can
accelerate data analyst productivity while maintaining data security
throughout the enterprise.

Powered by decades of Oracle
Database innovation, Oracle R Enterprise helps enable analysts to run a
variety of sophisticated numerical techniques on billion row data sets
in a matter of seconds making iterative, speed of thought, and
high-quality numerical analysis on Big Data practical.

Oracle R Enterprise
drastically reduces the time to deploy models by eliminating the need to
translate the models to other languages before they can be deployed in
production.

Oracle R Enterprise
integrates the extensive set of Oracle Database data mining algorithms,
analytics, and access to Oracle OLAP cubes into the R language for
transparent use by R users.

Oracle Data Mining provides
an extensive set of in-database data mining algorithms that solve a wide
range of business problems. These predictive models can be deployed in
Oracle Database 11g and use Oracle Exadata Smart Scan to rapidly score
huge volumes of data.

The tight integration between
R, Oracle Database 11g, and Hadoop enables R users to write one R
script that can run in three different environments: a laptop running
open source R, Hadoop running with Oracle Big Data Connectors, and Oracle Database 11g.

Supporting Quotes

“Oracle is committed to
meeting the challenges of Big Data analytics. By building upon the
analytical depth of Oracle SQL, Oracle Data Mining and the R
environment, Oracle is delivering a scalable and secure Big Data
platform to help our customers solve the toughest analytics problems,”
said Andrew Mendelsohn, senior vice president, Oracle Server
Technologies.

“We work with leading edge
customers who rely on us to deliver better BI from their Oracle
Databases. The new Oracle R Enterprise functionality allows us to
perform deep analytics on Big Data stored in Oracle Databases. By
leveraging R and its library of open source contributed CRAN packages
combined with the power and scalability of Oracle Database 11g, we can
now do that,” said Mark Rittman, co-founder, Rittman Mead.

Tuesday Feb 14, 2012

Following last week's press release, we wanted to post a series of demonstrations using Oracle R Enterprise. Stay tuned to learn more about Oracle R Enterprise (ORE),
a component in the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition.

The R programming language and environment was originally designed to hold data in memory, providing fast and efficient calculations by not
requiring the user's program to access information stored on the hard drive. Modern data set size has surpassed the rate which
RAM has increased. Consequently, R users will often encounter errors similar to the following:

"cannot allocate vector of length xxx"

This error occurs because R requires the operating system to provide a block of memory large enough to contain the contents of the data file, and the operating system responds that not enough memory is available. The maximum amount of memory that can be accessed by 32-bit R is 3GB. On 64-bit versions of R, larger objects may be created - theoretically up to 8TB. However, the Operating System imposes limitations on the resources available to a single process, and using such large objects may be unacceptably slow.

R Programmers with big data sets work around memory limitations in a variety of ways. Some opt to analyze data samples, and
some divide the data into manageable batches, run jobs sequentially on a single processor, and then combine the results.
This is both costly and time-consuming. For R users who like the flexibility of the R language and the support of the R
community, the option to analyze and model large data sets in R is an exciting enhancement.

The Oracle R Enterprise framework allows R users to operate on tables and views directly from R in Oracle Database.
Instead of loading large data files into memory, the R engine processing is moved to the database, requiring minimal resources on
the user's system, regardless of the size of the data.

In this introductory series, we'll cover everything you need to know to get started with Oracle R Enterprise, including:

Part 1: The ORE transparency layer - a collection of R packages with functions to connect to Oracle Database and use R functionality in Oracle Database. This enables R users to work with data too large to fit into the memory of a user's desktop system, and leverage the scalable Oracle Database as a computational engine.

Part 2. The ORE statistics engine - a collection of statistical functions and procedures corresponding to commonly-used statistical libraries. The statistics engine packages also execute in Oracle Database.

PART 3: ORE SQL extensions supporting embedded R execution through the database on the database server. R users can execute R closures (functions) using an R or SQL API, while taking advantage of data parallelism. Using the SQL API for embedded R execution, sophisticated R graphics and results can be exposed in Oracle Business Intelligence EE dashboards and Oracle BI Publisher documents.

PART 4: Oracle R Connector for Hadoop (ORCH) - an R package that interfaces with the Hadoop Distributed File System (HDFS) and enables executing MapReduce jobs. ORCH enables R users to work directly with an Oracle Hadoop cluster, executing computations from the R environment, written in the R language and working on data resident in HDFS, Oracle Database, or local files.

But we won't stop there - expect to see posts discussing many new
features in 2012, including expanded platform support and an extended
set of analytics routines. Please come back frequently for updates that can help your organization mature in its implementation of in-database analytics.

About

The place for best practices, tips, and tricks for applying Oracle R Enterprise, Oracle R Distribution, ROracle, and Oracle R Advanced Analytics for Hadoop in both traditional and Big Data environments.