WEBVTT
00:00:06.860 --> 00:00:10.190
Microsoft Azure machine learning
allows you to use R code in
00:00:10.240 --> 00:00:11.230
any experiment.
00:00:12.500 --> 00:00:15.850
You can take advantage of extensive,
familiar, Open Source R
00:00:15.900 --> 00:00:18.330
libraries in your existing
analytics code.
00:00:19.440 --> 00:00:23.460
To use R in an Azure ML experiment,
you need to use the execute
00:00:23.510 --> 00:00:24.830
R script module.
00:00:25.860 --> 00:00:26.650
Let's try it.
00:00:27.730 --> 00:00:30.460
Just paste your R code into
the script window.
00:00:31.230 --> 00:00:35.630
We'll use an existing sample data
set, the adult census dataset.
00:00:36.480 --> 00:00:40.340
It has features such as education,
gender, and marital status.
00:00:40.920 --> 00:00:43.870
In this example, we'll try to predict
if a person's income is
00:00:43.920 --> 00:00:46.870
greater than or less than $50,000.
00:00:47.660 --> 00:00:50.200
First, let's connect the
data to an input port.
00:00:51.530 --> 00:00:55.880
Noted that the R modules that three
input ports. Two for datasets
00:00:55.930 --> 00:00:59.760
and a third for zip files. So you
can load additional R code,
00:00:59.810 --> 00:01:01.710
R packages and datasets.
00:01:02.970 --> 00:01:06.220
Now that we've connected our dataset,
we can use that dataset
00:01:06.270 --> 00:01:07.740
inside our script.
00:01:09.010 --> 00:01:12.390
Using the map input port function,
we'll map the first port to
00:01:12.440 --> 00:01:14.030
the dataset one variable.
00:01:15.280 --> 00:01:18.560
Since we don't have a second dataset,
we can delete this line.
00:01:19.800 --> 00:01:23.290
Let's also get rid of the sample code
that is in the script by default.
00:01:24.300 --> 00:01:27.760
Let's do some simple plots of
the data. Azure ML comes with
00:01:27.810 --> 00:01:31.580
many R packages already installed
but it's easy to add more.
00:01:32.740 --> 00:01:36.410
We'll need the GG plot 2 and data
table libraries. We can add
00:01:36.460 --> 00:01:39.390
those libraries as we would in RStudio.
00:01:40.240 --> 00:01:43.010
First, you need to clean up some
of the names of the data columns
00:01:43.060 --> 00:01:44.920
so that we can use them
in our R code.
00:01:46.080 --> 00:01:49.330
The dataset uses a dash pattern and
we can replace it with a dot.
00:01:50.800 --> 00:01:53.840
Next, let's make a few histogram
and density plots using the
00:01:53.890 --> 00:01:55.100
G plot function.
00:01:56.040 --> 00:01:59.760
Set the output of the G plot equal
to foo and then we can print foo.
00:02:00.940 --> 00:02:04.190
By doing this it will print the
plot to the second output port,
00:02:04.240 --> 00:02:07.640
which is where R console output
and graphics device output.
00:02:08.530 --> 00:02:11.470
The other output port is
for the results dataset.
00:02:12.290 --> 00:02:15.760
Let's change the code to output our
data with the updated column names.
00:02:17.170 --> 00:02:18.650
Let's see what this looks like.
00:02:24.640 --> 00:02:27.290
You can see we have all the plots
that we created just as you
00:02:27.340 --> 00:02:31.100
would see in RStudio. We can also
create we can also create new
00:02:31.150 --> 00:02:32.810
features in our dataset.
00:02:33.580 --> 00:02:37.310
Let's create a new dataset and
add average age by occupation
00:02:37.640 --> 00:02:40.280
and age as a ratio of average
by occupation.
00:02:42.490 --> 00:02:46.240
Let's output the new dataset. You
can see we have two new columns
00:02:46.290 --> 00:02:50.790
and we can see basic statistics.
Min, max, standard deviation
00:02:50.840 --> 00:02:52.810
and the histograms for those columns.
00:02:54.960 --> 00:02:59.330
Next we need to train a model. First
we have to choose an algorithm.
00:02:59.910 --> 00:03:03.750
We want to do a classification
problem. Let's try two class
00:03:03.800 --> 00:03:05.110
boost decision tree.
00:03:06.240 --> 00:03:09.680
We need to split our data so that we
can save some data for testing.
00:03:11.970 --> 00:03:15.050
Then we need the train module to
connect the regression algorithm
00:03:15.100 --> 00:03:15.900
and the data.
00:03:17.080 --> 00:03:20.040
We also need to tell the train module
which column contains the
00:03:20.090 --> 00:03:21.230
label for our data.
00:03:22.460 --> 00:03:25.810
Remember that we are predicting
income so this is our label.
00:03:26.960 --> 00:03:29.770
Next, we want to score the test
data that we held out using our
00:03:29.820 --> 00:03:30.680
trained model.
00:03:31.960 --> 00:03:35.400
Finally, we want to see the metric
such as AUC, to determine
00:03:35.450 --> 00:03:36.770
how well our model did.
00:03:37.420 --> 00:03:39.340
Now we are ready to run the experiment.
00:03:43.630 --> 00:03:46.170
We can look at the results of
00:03:51.110 --> 00:03:53.650
the scoring. This algorithm predicts
the probability that the
00:03:53.700 --> 00:03:55.090
result is greater than 50K.
00:03:56.040 --> 00:04:00.730
A probability over 0.5 is labeled
as greater than 50K and less
00:04:00.780 --> 00:04:02.550
than 0.5, less than 50K.
00:04:04.240 --> 00:04:06.930
Now we can visualize the
metrics for the scores.
00:04:08.020 --> 00:04:11.650
We've got the standard outputs like
the confusion matrix and accuracy.
00:04:12.840 --> 00:04:16.870
The AUC is 0.92, which is pretty
good. But we could go back
00:04:16.920 --> 00:04:19.570
and try to improve the results
by trying different parameters
00:04:19.620 --> 00:04:22.520
with parameter sweeping or trying
different algorithms.
00:04:23.810 --> 00:04:26.520
Once you are happy with the results
of your model, you can create
00:04:26.570 --> 00:04:27.580
a web service.
00:04:28.770 --> 00:04:32.160
Next, we create the scoring experiment.
This simplifies our
00:04:32.210 --> 00:04:33.450
graph for publishing.
00:04:38.560 --> 00:04:41.740
We run the new graph and then publish
the web service. Now we
00:04:41.790 --> 00:04:51.660
have an API that we can call, passing
the features for a person
00:04:51.710 --> 00:04:55.780
to predict whether or not they make
more or less than $50,000.
00:04:55.960 --> 00:04:56.830
Thanks for watching.