Better Decisions === Faster Stats

Data set

The Teradata add-on package for R

teradataR is a package or library that allows R users to easily connect to Teradata, establish data frames (R data formats) to Teradata and to call in-database analytic functions within Teradata. This allows R users to work within their R console environment while leveraging the in-database functions developed with Teradata Warehouse Miner. This package provides 44 different analytical functions and an additional 20 data connection and R infrastructure functions. In addition, we’ve added a function that will list the stored procedures within Teradata provide the capability to call functions from R.

This package allows users of R to interact with a Teradata database. R is an open source language for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible. Users can use many statistical functions directly against the Teradata system without having to extract the data into memory.

Enhancements included with this new 1.0.1 release include:

teradataR User Guide

addition of Mac OS X Package

addition of Red Hat Linux Package (added 2/23/12)

summary has been enhanced to run faster

JDBC support added to allow Windows or Mac users to run the package with JDBC

td.data.frame enhanced to allow support for manipulation to add columns and expressions

A new R package for Red Hat Linux has been added to the teradataR 1.0.1 release. This new package provides the same functionality as in the previously released Windows and Mac OS X packages, but is built for Red Hat Linux. This version was built and tested on Red Hat Linux 6.2 32-bit. (The R version for Red Hat Linux is 2.14.1)

Installing this package is the same as any normal R package; just extract it into your R library area, or use the install.packagescommand with the file path.

With plenty of prolific and enthusiastic developers, the number of packages for R is expected to grow tremendously. Statisticians and analysts using these packages will find innovative ways to use data to answer their research and business questions. And as organizations become more willing to rely on open-source software for mission-critical tasks, R is poised to become an essential tool for analyzing our complex world.

Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)

Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.

e-LICO Architecture and Components

The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.

Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.

Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.

Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.

The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.

You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL

Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.

JMP Add-in: Multidimensional Scaling using R

This add-in creates a new menu command under the Add-Ins Menu in the submenu R Add-ins. The script will launch a custom dialog (or prompt for a JMP data table is one is not already open) where you can cast columns into roles for performing MDS on the data table. The analysis results in a data table of MDS dimensions and associated output graphics. MDS is a dimension reduction method that produces coordinates in Euclidean space (usually 2D, 3D) that best represent the structure of a full distance/dissimilarity matrix. MDS requires that input be a symmetric dissimilarity matrix. Input to this application can be data that is already in the form of a symmetric dissimilarity matrix or the dissimilarity matrix can be computed based on the input data (where dissimilarity measures are calculated between rows of the input data table in R).

Chernoff Faces Add-in

One way to plot multivariate data is to use Chernoff faces. For each observation in your data table, a face is drawn such that each variable in your data set is represented by a feature in the face. This add-in uses JMP’s R integration functionality to create Chernoff faces. An R install and the TeachingDemos R package are required to use this add-in.

Support Vector Machine for Classification

By simply opening a data table, specifying X, Y variables, selecting a kernel function, and specifying its parameters on the user-friendly dialog, you can build a classification model using Support Vector Machine. Please note that R package ‘e1071′ should be installed before running this dialog. The package can be found from http://cran.r-project.org/web/packages/e1071/index.html.

Penalized Regression Add-in

This add-in uses JMP’s R integration functionality to provide access to several penalized regression methods. Methods included are the LASSO (least absolutee shrinkage and selection operator, LARS (least angle regression), Forward Stagewise, and the Elastic Net. An R install and the “lars” and “elasticnet” R packages are required to use this add-in.

MP Addin: Univariate Nonparametric Bootstrapping

This script performs simple univariate, nonparametric bootstrap sampling by using the JMP to R Project integration. A JMP Dialog is built by the script where the variable you wish to perform bootstrapping over can be specified. A statistic to compute for each bootstrap sample is chosen and the data are sent to R using new JSL functionality available in JMP 9. The boot package in R is used to call the boot() function and the boot.ci() function to calculate the sample statistic for each bootstrap sample and the basic bootstrap confidence interval. The results are brought back to JMP and displayed using the JMP Distribution platform.