Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Monday Feb 03, 2014

Data Miner provides Explorer node that produces descriptive
statistical data and histogram graph, which allows analyst to analyze input data
columns individually.Often time an analyst
is interested in analyzing the relationships among the data columns, so that he
can choose the columns that are closely correlated to the target column for
model build purpose.To examine
relationships among data columns, he can create scatter plots using the Graph
node.

For example, an analyst may want to build a regression model
that predicts the customer LTV (long term value) using the INSUR_CUST_LTV_SAMPLE
demo data.Before building the model, he
can create the following workflow with the Graph node to examine the
relationships between interested data columns and the LTV target column.

In the Graph node editor, create a scatter plot with an interested
data column (X Axis) against the LTV target column (Y Axis).For the demo, let’s create three scatter
plots using these data columns: HOUSE_OWNERSHIP, N_MORTGAGES, and
MORTGAGE_AMOUNT.

Here are the scatter plots generated by the Graph node.As you can see the HOUSE_OWNERSHIP and
N_MORTGAGES are quite positively correlated to the LTV target column.However, the MORTGAGE_AMOUNT seems less
correlated to the LTV target column.

The problem with the above approach is it is laborious to
create scatter plots one by one and you cannot examine relationships among
those data columns themselves.To solve
the problem, we can create a Scatterplot matrix graph as the following:

This is a 4 x4 scatterplot matrix of data column LTV,
HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT.In the top row, you can examine the relationships
between HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT against the LTV
target column.In the second row, you
can examine the relationships between LTV, N_MORTGAGES, and MORTGAGE_AMOUNT
against the HOUSE_OWNERSHIP column.In
the third and forth rows, you can examine the relationships of other columns
against the N_MORTGAGES, and MORTGAGE_AMOUNT respectively.

Let’s create the following workflow with the SQL Query node
to invoke the R script.Note: a Sample
node may be needed to sample down the data size (e.g. 1000 rows) for large data
set before it is used for charting.

Enter the following SQL statement in the SQL Query
editor.The rqTableEval is a R SQL
function that allows user to invoke R script from the SQL side.The first SELECT statement within the
function specifies the input data (LTV, HOUSE_OWNERSHIP, N_MORTGAGES, and
MORTGAGE_AMOUNT).The second SELECT
statement specifies the optional parameter to the R script, where we define the
graph title “Scatterplot Matrices”.The
output of the function is an XML document with the graph data embedded in it.

You can see what default R scripts are available in the R
Scripts tab.This tab is visible only
when the Oracle R Enterprise installation is detected.

Click the button
in the toolbar to invoke the R script to produce the Scatterplot matrix below.

You can copy the Scatterplot matrix image to a clipboard or
save it to an image file (PNG) for reporting purpose.To do so, right click on the graph to bring up
the pop-up menu below.

The Scatterplot matrix is also available in the Data Viewer
of the SQL Query node.To open the Data
Viewer, select the “View Data” item in the pop-up menu of the node.

The returning XML data is shown in the Data Viewer as shown
below.To view the Scatterplot matrix
embedded in the data, click on the XML data to bring up the icon
in the far right of the cell, and then click on the icon to bring up the viewer.