What is correlation?
From wikipedia
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.
In laymans terms, correlation is a relationships between data attributes. For a quick refresher, in data mining, a dataset is made up of different attributes. We use these attributes to classify or predict a label. Some attributes have more "meaning" or influence over the label's value. As you can imagine, if you can determine the influence that specific attributes have over your data, you are in a better position to build a classification model because you will know which attributes you should focus on when building your model.
In this example, I will use the kaggle.com Titanic datamining challenge dataset. This post will not uncover any information that is not readily available in the tutorial posted on kaggle.com.
Here are two screenshots. The first screenshot will show you some statistics about the dataset. The second screenshot will show a sample of the data.
Meta data view of the Titanic data mining challenge Training dataset
A data view of the dataset
The correlation matrix
First start by importing the Titanic training dataset into RapidMiner. You can use Read From CSV, Read From Excel, or Read from Database to achieve this step. Next, search for the "Correlation Matrix" operator and drag it onto the process surface. Connect the Titanic training dataset output port to the Correlation Matrix operator's input example port. Your process should look like this.
Now run the process and observe the output.
You are presented with several different result views. The first view will be the Correlation Matrix Attribute Weights view. The Attribute weights view displays the "weight" of each attribute. The purpose of this tutorial is to explain a different view of the Correlation matrix. Click on the Correlation Matrix view. This is a matrix that shows the Correlation Coefficients which is a measure of the strength of the relationship between our attributes. An easy way to get started with the Correlation matrix is to notice that when an attribute intersects with itself, you have a dark blue cell with the value of 1 which represents the strongest possible value. This is because any attribute matched with itself is a perfect correlation. A correlation coefficient value can be positive or negative. A negative value does not necessarily mean there is less of a relationship between the values represented. The larger the coefficient in either direction represents a strong relationship between those two attributes. If we look at the matrix and follow along the top row (survived) we will see the attributes that have the strongest correlation with the label in which we are trying to predict.
Just as the kaggle.com tutorial specifies, the attributes with the strongest correlation with the label (survived) are
sex(0.295), pclass(0.115), and fare(0.66)
Remember that the value as well as the color will help you to visually identify the stronger correlation between attributes.
If you are working with a classification problem, I'm sure you can see how valuable the correlation matrix can be in showing you the relationships between your label and attributes. Such insights let can provide a great start on where to focus your attention when building your classification model.
Thanks for reading and keep your eyes open for my next tutorial!

Greetings! And welcome to another wam bam, thank you ma'am, mind blowing, flex showing, machine learning tutorial here at refactorthis.net!
This tutorial is based on a machine learning toolkit called RapidMiner by RapidI. RapidMiner is a full featured Java based open source machine learning toolkit with support for all of the popular machine learning algorithms used in data analytics today. The library supports supports the following machine learning algorithms (to name a few):
k-NN
Naive Bayes (kernel)
Decision Tree (Weight-based, Multiway)
Decision Stump
Random Tree
Random Forest
Neural Networks
Perception
Linear Regression
Polynomial Regression
Vector Linear Regression
Gaussian Process
Support Vector Machine (Linear, Evolutionary, PSO)
Additive Regression
Relative Regression
k-Means (kernel, fast)
And much much more!!
Excited yet? I thought so!
How to create a decision tree using RapidMiner
When I first ran across screen shots of RapidMiner online, I thought to myself, "Oh boy.. I wonder how much this is going to cost...". The UI looked so amazing. It's like Visual Studio for Data Mining and Machine learning! Much to my surprise, I found out that the application is open source and free!
Here is a quote from the RapidMiner site:
RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.
I've been trying some machine learning "challenges" recently to sharpen my skills as a data scientist, and I decided to use RapidMiner to tackle the kaggle.com machine learning challenge called "Titanic: Machine Learning from Disaster" . The data set is a CSV file that contains information on many of the passengers of the infamous Titanic voyage. The goal of the challenge is to take one CSV file containing training data (the training data contains all attributes as well as the label Survived) and a testing data file containing only the attributes (no Survived label) and to predict the Survived label of the testing set based on the training set.
Warning: Although I'm not going to provide the complete solution to this challenge, I warn you, if you are working on this challenge, then you should probably stop reading this tutorial. I do provide some insights into the survival data found in the training data set. It's best to try to work the challenge out on your own. After all, we learn by TRYING, FAILING, TRYING AGAIN, THEN SUCCEEDING. I'd also like to say that I'm going to do my very best to go easy on the THEORY of this post.. I know that some of my readers like to get straight to the action :) You have been warned..
Why a decision tree?
A decision tree model is a great way to visualize a data set to determine which attributes of a data set influenced a particular classification (label). A decision tree looks like a tree with branches, flipped upside down.. Perhaps a (cheesy) image will illustrate..
After you are finished laughing at my drawing, we may proceed....... OK
In my example, imagine that we have a data set that has data that is related to lifestyle and heart disease. Each row has a person, their sex, age, Smoker (y/n), Diet (good/poor), and a label Risk (Less Risk/More Risk). The data indicates that the biggest influence on Risk turns out to be the Smoker attribute. Smoker becomes the first branch in our tree. For Smokers, the next influencial attribute happens to be Age, however, for non smokers, the data indicates that their diet has a bigger influence on the risk. The tree will branch into two different nodes until the classification os reached or the maximum "depth" that we establish is reached. So as you can see, a decision tree can be a great way to visualize how a decision is derived based on the attributes in your data.
RapidMiner and data modeling
Ready to see how easy it is to create a prediction model using RapidMiner? I thought so!
Create a new process
When you are working in RapidMiner, your project is known as a process. So we will start by running RapidMiner and creating a new process.
The version of RapidMiner used in this tutorial is version 5.3. Once the application is open, you will be presented with the following start screen.
From this screen you will click on New Process
You are presented with the main user interface for RapidMiner. One of the most compelling aspects of Rapidminer is it's ease of use and intuitive user interface. The basic flow of this process is as follows:
Import your test and training data from CSV files into your RapidMiner repository. This can be found in the repository menu under Import CSV file
Once your data has been imported into your repository, the datasets can be dragged onto your process surface for you to apply operators
You will add your training data to the process
Next, you will add your testing data to the process
Search the operators for Decision Tree and add the operator
In order to use your training data to generate a prediction on your testing data using the Decision Tree model, we will add an "Apply Model" operator to the process. This operator has an input that you will associate with the output model of your Decision Tree operator. There is also an input that takes "unlearned" data from the output of your testing dataset.
You will attach the outputs of Apply Model to the results connectors on the right side of the process surface.
Once you have designed your model, RapidMiner will show you any problems with your process and will offer "Quick fixes" if they exists that you can double click to resolve.
Once all problems have been resolved, you can run your process and you will see the results that you wired up to the results side of the process surface.
Here are screenshots of the entire process for your review
Empty Process
Add the training data from the repository by dragging and dropping the dataset that you imported from your CSV file
Repeat the process and add the testing data underneath the training data
Now you can search in the operators window for Decision Tree operator. Add it to your process.
The way that you associate the inputs and outputs of operators and data sets is by clicking on the output of one item and connecting it by clicking on the input of another item. Here we are connecting the output of the training dataset to the input of the Decision Tree operator.
Next we will add the Apply model operator
Then we will create the appropriate connections for the model
Observe the quick fixes in the problems window at the bottom.. you can double click the quick fixes to resolve the issues.
You will be prompted to make a simple decision regarding the problem that was detected. Once you resolve one problem, other problems may appear. be sure to resolve all problems so that you can run your process.
Here is the process after resolving all problems.
Next, I select the decision tree operator and I adjust the following parameters:
Maximum Depth: change from 20 to 5.
check both boxes to make sure that the tree is not "pruned".
Once this has been done, you can Run your process and observe the results. Since we connected both the model as well as the labeled result to the output connectors of the process, we are presented with a visual display of our Decision Tree (model) as well as the Test data set with the prediction applied.
(Decision Tree Model)
(The example test result set with the predictions applied)
As you can see, RapidMiner makes complex data analysis and machine learning tasks extremely easy with very little effort.
This concludes my tutorial on creating Decision Trees in RapidMiner.
Until next time,
Buddy James

In one of my previous posts called Machine learning resources for .NET developers, I introduced a machine learning library called numl.net. numl.net is a machine learning library for .NET created by Seth Juarez. You can find the library here and Seth's blog here. When I began researching the library, I learned quickly that one of Seth's goals in writing numl.net was to abstract away the complexities that stops many software developers from trying their hand at machine learning. I must say that in my opinion, he has done a wonderful job in accomplishing this goal!
Tutorial
I've decided to throw together a small tutorial to show you just how easy it is to use numl.net to perform predictions. This tutorial will use structured learning by way of a decision tree to perform predictions. I will use the infamous Iris Data set which contains data 3 different types of Iris flowers and the data that defines them. Before we get into code, let's look at some basic terminology first.
With numl.net you create a POCO (plain old CLR object) to use for training as well as predictions. There will be properties that you will specify known values (features) so that you can predict the value of an unknown property value (label). numl.net makes identifying features and labels easy, you simply mark your properties with the [Feature] attribute or the [Label] attribute (there is also a [StringLabel] attribute as well). Here is an example of the Iris class that we will use in this tutorial.
using numl.Model;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace NumlDemo
{
/// <summary>
/// Represents an Iris in the infamous Iris classification dataset (Fisher, 1936)
/// Each feature property will be used for training as well as prediction. The label
/// property is the value to be predicted. In this case, it's which type of Iris we are dealing with.
/// </summary>
public class Iris
{
//Length in centimeters
[Feature]
public double SepalLength { get; set; }
//Width in centimeters
[Feature]
public double SepalWidth { get; set; }
//Length in centimeters
[Feature]
public double PetalLength { get; set; }
//Width in centimeters
[Feature]
public double PetalWidth { get; set; }
//-- Iris Setosa
//-- Iris Versicolour
//-- Iris Virginica
public enum IrisTypes
{
IrisSetosa,
IrisVersicolour,
IrisVirginica
}
[Label]
public IrisTypes IrisClass { get; set; } //This is the label or value that we wish to predict based on the supplied features
}
}
As you can see, we have a simple POCO Iris class, which defines four features and one label. The Iris training data can be found here . Here is an example of the data found in the file.
5.1,3.5,1.4,0.2,Iris-setosa
6.3,2.5,4.9,1.5,Iris-versicolor
6.0,3.0,4.8,1.8,Iris-virginica
The first four values are doubles which represent the features Sepal Length, Sepal Width, Petal Length, Petal Width. The final value is an enum that represents the label that we will predict which is the class of Iris.
We have the Iris class, so now we need a method to parse the training data file and generate a static List<Iris> collection. Here is the code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace NumlDemo
{
/// <summary>
/// Provides the services to parse the training data files
/// </summary>
public static class IrisDataParserService
{
//provides the training data to create the predictive model
public static List<Iris> TrainingIrisData { get; set; }
/// <summary>
/// Reads the trainingDataFile and populates the TrainingIrisData list
/// </summary>
/// <param name="trainingDataFile">File full of Iris data</param>
/// <returns></returns>
public static void LoadIrisTrainingData(string trainingDataFile)
{
//if we don't have a training data file
if (string.IsNullOrEmpty(trainingDataFile))
throw new ArgumentNullException("trainingDataFile");
//if the file doesn't exist on the file system
if (!File.Exists(trainingDataFile))
throw new FileNotFoundException();
if (TrainingIrisData == null)
//initialize the return training data set
TrainingIrisData = new List<Iris>();
//read the entire file contents into a string
using (var fileReader = new StreamReader(new FileStream(trainingDataFile, FileMode.Open)))
{
string fileLineContents;
while ((fileLineContents = fileReader.ReadLine()) != null)
{
//split the current line into an array of values
var irisValues = fileLineContents.Split(',');
double sepalLength = 0.0;
double sepalWidth = 0.0;
double petalLength = 0.0;
double petalWidth = 0.0;
if (irisValues.Length == 5)
{
Iris currentIris = new Iris();
double.TryParse(irisValues[0], out sepalLength);
currentIris.SepalLength = sepalLength;
double.TryParse(irisValues[1], out sepalWidth);
currentIris.SepalWidth = sepalWidth;
double.TryParse(irisValues[2], out petalLength);
currentIris.PetalLength = petalLength;
double.TryParse(irisValues[3], out petalWidth);
currentIris.PetalWidth = petalWidth;
if (irisValues[4] == "Iris-setosa")
currentIris.IrisClass = Iris.IrisTypes.IrisSetosa;
else if (irisValues[4] == "Iris-versicolor")
currentIris.IrisClass = Iris.IrisTypes.IrisVersicolour;
else
currentIris.IrisClass = Iris.IrisTypes.IrisVirginica;
IrisDataParserService.TrainingIrisData.Add(currentIris);
}
}
}
}
}
}
This code is pretty standard. We simply read each line in the file, split the values out into an array, and populate a List<Iris> collection of Iris objects based on the data found in the file.
Now the magic
Using the numl.net library, we need only use three classes to perform a prediction based on the Iris data set. We start with a Descriptor, which identifies the class in which we will learn and predict. Next, we will instantiate a DecisionTreeGenerator, passing the descriptor to the constructor. Finally, we will create our prediction model by calling the Generate method of the DecisionTreeGenerator, passing the training data (IEnumerable<Iris>) to the Generate method. The generate method will provide us with a model in which we can perform our prediction.
Here is the code:
using numl;
using numl.Model;
using numl.Supervised;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace NumlDemo
{
class Program
{
public static void Main(string[] args)
{
//get the descriptor that describes the features and label from the Iris training objects
var irisDescriptor = Descriptor.Create<Iris>();
//create a decision tree generator and teach it about the Iris descriptor
var decisionTreeGenerator = new DecisionTreeGenerator(irisDescriptor);
//load the training data
IrisDataParserService.LoadIrisTrainingData(@"D:\Development\machinelearning\Iris Dataset\bezdekIris.data");
//create a model based on our training data using the decision tree generator
var decisionTreeModel = decisionTreeGenerator.Generate(IrisDataParserService.TrainingIrisData);
//create an iris that should be an Iris Setosa
var irisSetosa = new Iris
{
SepalLength = 5.1,
SepalWidth = 3.5,
PetalLength = 1.4,
PetalWidth = 0.2
};
//create an iris that should be an Iris Versicolor
var irisVersiColor = new Iris
{
SepalLength = 6.1,
SepalWidth = 2.8,
PetalLength = 4.0,
PetalWidth = 1.3
};
//create an iris that should be an Iris Virginica
var irisVirginica = new Iris
{
SepalLength = 7.7,
SepalWidth = 2.8,
PetalLength = 6.7,
PetalWidth = 2.0
};
var irisSetosaClass = decisionTreeModel.Predict<Iris>(irisSetosa);
var irisVersiColorClass = decisionTreeModel.Predict<Iris>(irisVersiColor);
var irisVirginicaClass = decisionTreeModel.Predict<Iris>(irisVirginica);
Console.WriteLine("The Iris Setosa was predicted as {0}",
irisSetosaClass.IrisClass.ToString());
Console.WriteLine("The Iris Versicolor was predicted as {0}",
irisVersiColorClass.IrisClass.ToString());
Console.WriteLine("The Iris Virginica was predicted as {0}",
irisVirginicaClass.IrisClass.ToString());
Console.ReadKey();
}
}
}
And that's all there is to it. As you can see, you can use the prediction model accurately and there's no math, only simple abstractions.
I hope this has peaked your interest in the numl.net library for machine learning in .NET.
Feel free to post any questions or opinions.
Thanks for reading!
Buddy James

The University of Illinois at Urbana-Champaign - College of Engineering has awarded $518,434 to Assistant Professor Maxim Raginsky to use to apply Machine Learning techniques to network analysis to try and discover how to make networks more efficient.
From the article
http://csl.illinois.edu/news/raginsky-receives-career-award-apply-information-theory-machine-learning-problems
“The overall design objective is to make sure that the network resources are allocated in a smart way, and each user receives only the data they need without significant waste of bandwidth or power,” said Raginsky, a member of Illinois' electrical and computer engineering faculty.
Raginsky uses ecological monitoring as an example. If someone is tracking a rare bird species in a specific habitat and wants to record how many of these birds fly in and out of the area, it would be a waste of resources to continuously stream video if what the person really wants is just the arrivals and departures of the birds. A big part of the problem is learning to detect events of interest and to reliably communicate only the data describing these events.
“So I want to make sure that only the relevant information gets to those who need it, despite the fact that everyone is using the same network and the kinds of information that are relevant to one user are different than the kinds of information that are relevant to somebody else,” Raginsky said.
These problems are messy and complex, and there is no hope to come up with an accurate model for all kinds of data being transmitted and received over networks because of the increasing size and complexity of both the networks and the data, Raginsky said. Machine learning offers a variety of tools for extracting predictively relevant information from observations, but to date most of the research on machine learning has not focused on the network aspect and all the resource constraints that it imposes.
This project will systematically explore what is and is not possible in these types of large networks with multiple learning agents, specifically identifying the effect of bandwidth limitations, losses, delays and lack of central coordination on the performance of statistical learning algorithms, thus helping develop efficient and robust coding/decoding schemes.
The NSF CAREER Award is awarded by the National Science Foundation specifically to “junior faculty members who demonstrate their roles through outstanding research and education,” according to NSF’s website.
Raginsky said that because these awards are for 5-year projects, the proposals take a lot of time and effort. “You propose to research something you’re really passionate about, and presumably you want to work on this topic even if it did not get funded,” Raginsky said. “So, when I heard about my proposal being recommended for funding, of course it was a relief. I will have a good time working on this problem.”
Raginsky is a member of the Decision and Control group at CSL.
I think that this is a wonderful problem domain in which Machine learning can prove useful. Machine learning is a powerful set of technologies, and we have yet to even scratch the surface of what it can do for human kind. This goes to show you that there are other great uses besides targeted advertising systems, though that is where most of the jobs are at the moment.
Do you have ay ideas as to some practical applications of Machine learning that have yet to be tested?
Please share by leaving a comment.

About the author

My name is Buddy James. I'm a Microsoft Certified Solutions Developer from the Nashville, TN area. I'm a Software Engineer, an author, a blogger (http://www.refactorthis.net), a mentor, a thought leader, a technologist, a data scientist, and a husband. I enjoy working with design patterns, data mining, c#, WPF, Silverlight, WinRT, XAML, ASP.NET, python, CouchDB, RavenDB, Hadoop, Android(MonoDroid), iOS (MonoTouch), and Machine Learning. I love technology and I love to develop software, collect data, analyze the data, and learn from the data. When I'm not coding, I'm determined to make a difference in the world by using data and machine learning techniques. (follow me at @budbjames).