Note that, at the time of writing, ML is in preview, so the details may change. However, the basic concepts should still apply.

Obtaining the data set

The University of California, Irvine (UCI) maintains a repository of machine learning data sets. We’ll use their data set of breast cancer cases from Wisconsin to build a predictive model that distinguishes between malignant and benign growths.

Create a new Azure Machine Learning workspace

Before we can start building our prediction model we need to create an ML workspace. Log into your Azure portal and, on the left-hand side (scroll down) you’ll see the Machine Learning tab. Select that and click the New button at the bottom.

ML Tab

Configure the workspace as shown in the following screenshot. You’ll need to select a unique storage account name.

Create Workspace

Click the Create an ML workspace button and wait while Azure creates your workspace. When the workspace has been created, it will appear in the main list. Select it and then Sign-in to ML Studio on the following page.

Create a new experiment

When you first sign-in you’ll be presented with an empty list of “experiments”. Experiments are ML models. Add a new experiment using the button at the bottom of the screen.

Create an Experiment

Select the Blank Experiment template. This results in a blank canvas ready for us to build our prediction model.

Experiment Template

Load the data

The first thing we need to do is access the breast cancer data set. As this is available on-line, we can use the ML Reader module to make it available in our experiment. We’re using a relatively small data set here, so reading it directly from the URL makes sense, but we could just as easily draw on a big data resource in Azure Storage, for example.

Search for the Reader module using the search control at the top-left. Drag the Reader module onto the experiment canvas and configure it as follows (using the URL from earlier):

Add Reader Module

Click on the Run button in the toolbar at the bottom of the screen. After a short delay, the Reader module will display a green check. This means that it has successfully read the data.

Right-click on the “connection” circle at the bottom of the Reader module. Select “Visualize” from the pop-up menu.

Visualize Data

This displays the cancer data set in tabular form. Charts at the top of the columns summarize the data. We can see the class field on the far-left has two values, 2 and 4, representing benign and malignant growths, respectively. There are more benign cases in the data set than malignant ones.

Note that all the data, apart from the diagnosis (class) and ID variables, is in the same range (1–10).

Close the visualization to return to the experiment canvas.

Prepare the data

There are three problems with this data set.

The bare nuclei column has missing values in some cases

The arbitrary ID data isn’t relevant to the analysis, so we need to remove it

The class—i.e. benign or malignant—is represented by 2 and 4, respectively, which is hardly user-friendly

Removing cases with missing values

Some of the cases in the data set have missing values. For example:

1057013,8,4,5,1,2,?,7,3,1,4

We can remove these cases from the data set using the Missing Values Scrubber module. Search for the module, drag it onto the canvas and configure it as shown in the following screenshot. The key option is choosing Remove entire row for missing values. Join the output of the Reader module to the input of the Missing Values Scrubber module.

Missing Values Scrubber Module

Removing the ID field

The Project Columns module can be used to choose which fields to take forward into subsequent stages of the analysis. We can use it to exclude the ID field.

Add the Project Columns module to the canvas and configure it as follows:

Project Columns Module

The red exclamation mark on the module tells us we have more work to do. Click the Launch column selector button in the right-hand sidebar to chose the column we wish to exclude. We wish to include all columns except the ID column.

ID Column Dialog

Recode the class field values

At present, the class field—representing the diagnosis—takes values of either 2 or 4. Benign cases = 2, whereas malignant cases = 4. That’s not very intuitive, to say the least. So, we’re going to convert this field into a true/false value where true denotes that the growth is malignant. We’ll use the Apply Math Operation module to do this.

Configure the module as shown in the following screenshot. We want to compare (EqualTo) the class to 4, so that the result will be true when the growth is malignant. We don’t need the original data so we use the Inplace replacement output mode. Use the Launcher column selector button to specify the class column.

Math Operations Module

Building the predictive model

Our prediction model is going to use logistic regression classification. We will need to teach it how to make diagnoses by presenting it with a number of examples. These examples are the cases in our newly-cleaned breast cancer data set.

As we have a binary output (true/false) we’ll use the Two-Class Logistic Regression module to denote our classification method. It’s default settings are fine.

Logistic Regression Module

We also want to be able to evaluate our model by testing how well it predicts new cases. So, we’ll hold back some of the data to use for testing.

Let’s split the data into training and testing sets—70% of the data will be used for training and remaining 30% for testing. This can be done using the Split module.

Split Module

Time to train the model using the…Train Model module. Connect the classification method and the training data to it, as in the following screenshot. Make sure that you specify the class column as the training output using the Launch column selector button.

Train Model Module

Now that the model is trained, we’ll run the test data through it and see how well it performs. This is achieved using the Score Model module. Connect our newly-trained model and the test data to it.

Score Model Module

Add the point we could run the model and launch the visualizer on the Score Model module’s output to see what diagnoses the model predicted from the test data. Comparing this with the actual diagnoses from the original data set would allow us to calculate the accuracy of the model.

However, this would be quite tedious—and ML provides us with a module that does this work for us. Drag an Evaluate Model module onto the canvas and wire it to the test results.

Evaluate Model Module

Now for the fun. Run the model using the toolbar button at the bottom of the screen. Watch the clocks on the modules turn to green checks as the analysis progresses.

When the analysis is complete visualize the output of the evaluation by right-clicking on the output node of the Evaluate Model module.

Visualize Results Menu

Among other data, this summarizes the number of correct and incorrect predictions made by the model.

Visualize Results

We can see that the accuracy of the model is 98%. It made two incorrect benign predictions (false negatives) and two incorrect malignant predictions (false positives).

Azure Machine Learning – What You’ve Learned and Where to Go From Here

You can see how easy it is to undertake machine learning projects in Azure. No programming is required—it’s all drag and drop. You can use other classification methods (e.g. neural networks) by dragging them onto the canvas and wiring them up to the Train Model model (replacing the current Two-Class Logistic Regression module).

Another significant benefit of using Azure Machine Learning is that you can publish your experiments as web services, allowing your web or mobile apps to make use of your predictive models, recommendation engines, etc. This is a “point and click” process instigated by the “Publish web service” button in the experiment toolbar.