Category Archives: How-To

Great news! I am scheduled to teach a brand new class at UCLA Extension this coming Winter 2014 quarter starting January 8. Here is a LINK to the class description in the UNEX course catalog. As part of the course materials, I am providing all the source code from the required textbook “The Art of R Programming” by Norman Matloff via dropbox. You’ll find a series of folders, one for each chapter, containing the R scripts for the examples found in the book, plus a few missing data files that were referenced but never provided. Unfortunately, the author did not provide most of the sample source code on the book website made available from the publisher, so rather than have my students type in all the code, I’ve provided an annotated version to help the learning process. Feel free to grab the scripts even if you’re not taking my class.

Additionally, Matloff has made available a 2009 draft of the book on his personal website, so feel free to download the book HERE (pdf).

The field of data science is heating up fast. The following list of educational resources will let you join the data revolution by getting up to speed with data science.

Data science — and the driving force behind it, machine learning — is the process of deriving added value from data assets. Commerce and research are being transformed by data-driven discovery and prediction. Skills required for data analytics at massive levels span a variety of disciplines and are not easy to obtain through conventional curricula. These include algorithms for machine learning (e.g., neural networks and clustering), parallel algorithms, basic statistical modeling (logistic regression and linear/non-linear regression), and proficiency with a complex ecosystem of tools and platforms.

Meetup groups

A good place to start is with meetup groups. Two of my favorite data science groups deal with the primary ingredients of data science work: R, which is the programming environment of choice for building algorithms, and machine learning. The LA area R user group is excellent; try to find one near you. The LA Machine Learning group has regular meetings that are extremely useful.

Open Courseware

The Massive Open Online Course (MOOC) movement is very active in the data science space and constitutes a superb educational resource. These free courses (some offer certifications) offer an excellent path toward obtaining the requisite background for becoming a data scientist. I’ve put together a Radical Data Science “pseudo degree program” for you to follow.

As the interest in data science continues to grow, and as the shortage in talent becomes apparent, the timing is excellent to retool yourself and climb aboard the data science gravy train. If you know of any other good educational resources for data science and machine learning, please leave a note for all of us.

This article is the first of a series of using R to answer typical business questions that can be answered with data science solutions. The question answered here is: what is prediction and how does it work? In order to demonstrate this concept, I’ve chosen one of the oldest statistical techniques used for predicting the future – linear regression. Our example will use a widely available data source along with the R statistical language to show how it’s done. There’s no room here to give a class on statistics or R, but I’ll provide a couple of resources for educating yourself on each.

Our test case will involve all the commonly used ingredients of a data scientist building a prediction model: finding an appropriate data set, doing a little exploratory analysis to fully understand the data, defining a linear model, fitting a regression line, making plots, and using the model for prediction. The only thing we won’t do here is “data munging,” the often tedious exercise of massaging a dirty data source into a clean version for machine learning purposes. For our example, we’ll use an already-clean data set.

Here is the R code required to install the package, and open the data set:

install.packages("UsingR")
library(UsingR)
data(galton)

The data set consists of 928 pairs of child and parent heights in inches. For unfamiliar data sets you typical explore the data with one or more of the following R commands (although there are many more ways to explore data sets in R):

head(galton) # Show first few records
tail(galton) # Show last few records
summary(galton) # Statistical summary
table(galton) # Show a distribution of values
hist(galton$child) # A visual distribution of values

We also can do a simple scatter plot of child height versus parent height. This plot exhibits an ideal case for using regression – a cloud shaped collection of data points.

plot(galton$parent, galton$child, pch=19, col=”blue”)

Fitting a Linear Model

Next, we can fit a line to the Galton data using a simple linear model. In R this is very easy to do, just use the lm function where parent is the explanatory variable (independent variable), and child is the response variable (to be predicted).

lm1 <- lm(galton$child ~ galton$parent)

If you display the contents of the lm1 model variable (just type lm1 at the RStudio command prompt), you’ll see it contains a number of items including the coefficients of the linear model: the intercept 23.942 and the slope 0.646. These values are stored in a 2 element vector lm1$coeff. This means that an increase in parent’s height by 1 inch, you increase the child’s height by 0.646 inch. In other words, think back to high school algebra class when you graphed a line. It is as simple as that!

Now let’s complete the picture by adding the regression line to the plot using the vector of fitted values found in lm1$fitted:

lines(galton$parent, lm1$fitted, col=”red”, lwd=3)

You can also plot the residual values (response values minus fitted values) using:

plot(galton$parent, lm1$residuals, pch=19, col=”blue”)

Using the Model to Predict

Now you can use the model to make predictions using new data. For example, say you have a parent height of 80 (which is outside the range of the data set we used to train the model) and you want to predict child height. Just use the coefficients of the model:

lm1$coeff[1] + lm1$coeff[2] * 80

The predicted answer you get is 75.6 for the child. You can improve model accuracy by using a larger training set and reevaluating the linear model.

Uses of Linear Regression

Businesses use linear regression forecasting techniques in a multitude of ways. For example, you can model Google pay-per-click advertising costs versus sales; just download your cost data from Google Adsense and built an Excel spreadsheet that matches up the sales based on time period. Another example might be using the linear model to predict productivity gains resulting from a training program. Uses of linear regression are very broad: manufacturing, supply chain, real estate, financial sector, and much more.

Educational Resources

To supplement this article, here are a couple of excellent resources for you to learn more about general statistics and R.

Here is a new webinar from Revolution Analytics that introduces the use of the R statistical programming environment for doing data mining. Presented by Joe Rickert, the seminar demos several examples of data mining with various R packages. Rickert’s slides can be downloaded HERE. Enjoy!