Sebastian Thrun is the head of the project of google automatic driving car.
He uses supervised classification to train the car.
Supervised mean that we're giving lot of correct examples, and then give it as material lesson to the system as a student.
That's what the learner do, observing a teacher, for e.g. we saw parents learn to drive to correctly, then it's our turn to drive it.
That's what machine learning does.
To win DARPA Grand Challenge 2006, Stanley, the car project he did at Standford, observe thousand miles how man drive in the dessert.

If we plot two features, and it seems like we can make line separable, it may easier to draw conclusion based on the next point.
But if it the case that the we plot like in the second, it could be unclear to us to make a conclusion. It's better to add new samples, or manipulate the features (use polynomial,log10..etc)

Okay, we have enough description of the machine learning. Let's dig deeper about Naive Bayes. Bayes is actually a religious man trying to prove the existing of God, the algorithm that he makes that makes it naive.

In [16]:

Image('nb-ud/Screen Shot 2014-11-19 at 12.35.35 PM.jpg')

Out[16]:

Naive Bayes itself later will make decision boundary as the one in the picture.
So the the incoming sample will be known its label by plotting in this graph.

# %%writefile GaussianNB_Deployment_on_Terrain_Data.py#!/usr/bin/python""" Complete the code below with the sklearn Naaive Bayes classifier to classify the terrain data The objective of this exercise is to recreate the decision boundary found in the lesson video, and make a plot that visually shows the decision boundary """fromprep_terrain_dataimportmakeTerrainDatafromclass_visimportprettyPicture,output_imagefromClassifyNBimportclassifyimportnumpyasnpimportpylabasplfromggplotimport*features_train,labels_train,features_test,labels_test=makeTerrainData()### the training data (features_train, labels_train) have both "fast" and "slow" points mixed### in together--separate them so we can give them different colors in the scatterplot,### and visually identify themgrade_fast=[features_train[ii][0]foriiinrange(0,len(features_train))iflabels_train[ii]==0]bumpy_fast=[features_train[ii][1]foriiinrange(0,len(features_train))iflabels_train[ii]==0]grade_slow=[features_train[ii][0]foriiinrange(0,len(features_train))iflabels_train[ii]==1]bumpy_slow=[features_train[ii][1]foriiinrange(0,len(features_train))iflabels_train[ii]==1]### draw the decision boundary with the text points overlaidprettyPicture(clf,features_test,labels_test)#output_image("test.png", "png", open("test.png", "rb").read())

In [62]:

%%writefile classify.py
from sklearn.naive_bayes import GaussianNB
def NBAccuracy(features_train, labels_train, features_test, labels_test):
""" compute the accuracy of your Naive Bayes classifier """
### import the sklearn module for GaussianNB
from sklearn.naive_bayes import GaussianNB
### create classifier
clf = GaussianNB()
### fit the classifier on the training features and labels
clf.fit(features_train,labels_train)
### use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)
### calculate and return the accuracy on the test data
### this is slightly different than the example,
### where we just print the accuracy
### you might need to import an sklearn module
accuracy = clf.score(features_test,labels_test)
return accuracy

That does in fact we have close to 90% accuracy in predicting our data.
It always important to split your dataset, (this course recommend 90:10, others 80:20) for the test set so that we know is whether our learning is overfitting.

Accoding to Sebastian Thrun, here i quote "Bayes Rule is perhaps the Holy Grail of probabilistic inference". It found by Rev Thomas Bayes, who's trying to infer the existing of a God. What he didn't know back then, is he open endless possiblity for Artificial Inteligence background that we know today.

In [40]:

Image('nb-ud/Screen Shot 2014-11-19 at 2.19.16 PM.jpg')

Out[40]:

Here Bayes should infer the possibility given the condition. Now let's take it into a quiz.

This question is tricky, especially since both specificity and sensitivity are 90%. Intuitively, given the test result is positive, we know we are in the shaded region (blue and red). The true positive is depicted by red. As an estimate, which answer best describes the ratio of the red shaded region to the total (red + blue) shaded region?

In [42]:

Image('nb-ud/Screen Shot 2014-11-19 at 2.25.56 PM.jpg')

Out[42]:

If we can see at the graph, we actually observe the probability of the cancer inside positive test probability. We Ignore the whole population for now. And of that 90% test positive, we're calculating person that actually have disease. For this to happen, we have to know the other side of test positive, which is the person who doesn't have cancer but tested posititive. We do this to have the probability of the test positive, independent whether the person have the disease or not.

Here's the total probability for the problem

In [44]:

Image('nb-ud/Screen Shot 2014-11-19 at 2.40.25 PM.jpg')

Out[44]:

And here's the simpler intuition

In [46]:

Image('nb-ud/Screen Shot 2014-11-19 at 2.43.11 PM.jpg')

Out[46]:

By adding P(C|Pos) and P(notC|Pos) you have the total probability of 1