Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is implemented as so:

In [2]:

fromsklearn.linear_modelimportLinearRegression

Estimator parameters: All the parameters of an estimator can be set when it is instantiated, and have suitable default values:

In [3]:

model=LinearRegression(normalize=True)print(model.normalize)

True

In [4]:

print(model)

LinearRegression(copy_X=True, fit_intercept=True, normalize=True)

Estimated Model parameters: When data is fit with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:

In Supervised Learning, we have a dataset consisting of both features and labels.
The task is to construct an estimator which is able to predict the label of an object
given the set of features. A relatively simple example is predicting the species of
iris given a set of measurements of its flower. This is a relatively simple task.
Some more complicated examples are:

given a multicolor image of an object through a telescope, determine
whether that object is a star, a quasar, or a galaxy.

given a photograph of a person, identify the person in the photo.

given a list of movies a person has watched and their personal rating
of the movie, recommend a list of movies they would like
(So-called recommender systems: a famous example is the Netflix Prize).

What these tasks have in common is that there is one or more unknown
quantities associated with the object which needs to be determined from other
observed quantities.

Supervised learning is further broken down into two categories, classification and regression.
In classification, the label is discrete, while in regression, the label is continuous. For example,
in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a
classification problem: the label is from three distinct categories. On the other hand, we might
wish to estimate the age of an object based on such observations: this would be a regression problem,
because the label (age) is a continuous quantity.

K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

model=LinearRegression()model.fit(X,y)# Plot the data and the model predictionX_fit=np.linspace(0,1,100)[:,np.newaxis]y_fit=model.predict(X_fit)plt.plot(X.squeeze(),y,'o')plt.plot(X_fit.squeeze(),y_fit);

Scikit-learn also has some more sophisticated models, which can respond to finer features in the data:

In [18]:

# Fit a Random Forestfromsklearn.ensembleimportRandomForestRegressormodel=RandomForestRegressor()model.fit(X,y)# Plot the data and the model predictionX_fit=np.linspace(0,1,100)[:,np.newaxis]y_fit=model.predict(X_fit)plt.plot(X.squeeze(),y,'o')plt.plot(X_fit.squeeze(),y_fit);

Whether either of these is a "good" fit or not depends on a number of things; we'll discuss details of how to choose a model later in the tutorial.

Explore the RandomForestRegressor object using IPython's help features (i.e. put a question mark after the object).
What arguments are available to RandomForestRegressor?
How does the above plot change if you change these arguments?

These class-level arguments are known as hyperparameters, and we will discuss later how you to select hyperparameters in the model validation section.

Unsupervised Learning addresses a different sort of problem. Here the data has no labels,
and we are interested in finding similarities between the objects in question. In a sense,
you can think of unsupervised learning as a means of discovering labels from the data itself.
Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and
density estimation. For example, in the iris data discussed above, we can used unsupervised
methods to determine combinations of the measurements which best display the structure of the
data. As we'll see below, such a projection of the data can be used to visualize the
four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:

given detailed observations of distant galaxies, determine which features or combinations of
features best summarize the information.

given a mixture of two sound sources (for example, a person talking over some music),
separate the two (this is called the blind source separation problem).

given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.

Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful
features in heterogeneous data, and then these features can be used within a supervised
framework.

Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data.

Note that these clusters will uncover relevent hidden structure of the data only if the criterion used highlights it.

In [20]:

fromsklearn.clusterimportKMeansk_means=KMeans(n_clusters=3,random_state=0)# Fixing the RNG in kmeansk_means.fit(X)y_pred=k_means.predict(X)pl.scatter(X_reduced[:,0],X_reduced[:,1],c=y_pred,cmap='RdYlBu');

Scikit-learn strives to have a uniform interface across all methods,
and we'll see examples of these below. Given a scikit-learn estimator
object named model, the following methods are available:

Available in all Estimators

model.fit() : fit training data. For supervised learning applications,
this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)).
For unsupervised learning applications, this accepts only a single argument,
the data X (e.g. model.fit(X)).

Available in supervised estimators

model.predict() : given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new (e.g. model.predict(X_new)),
and returns the learned label for each object in the array.

model.predict_proba() : For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict().

model.score() : for classification or regression problems, most (all?) estimators implement
a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

Available in unsupervised estimators

model.predict() : predict labels in clustering algorithms.

model.transform() : given an unsupervised model, transform new data into the new basis.
This also accepts one argument X_new, and returns the new representation of the data based
on the unsupervised model.

model.fit_transform() : some estimators implement this method,
which more efficiently performs a fit and a transform on the same input data.

An important piece of machine learning is model validation: that is, determining how well your model will generalize from the training data to future unlabeled data. Let's look at an example using the nearest neighbor classifier. This is a very simple classifier: it simply stores all training data, and for any unknown quantity, simply returns the label of the closest training point.

With the iris data, it very easily returns the correct prediction for each of the input points:

For each class, all 50 training samples are correctly identified. But this does not mean that our model is perfect! In particular, such a model generalizes extremely poorly to new data. We can simulate this by splitting our data into a training set and a testing set. Scikit-learn contains some convenient routines to do this:

This paints a better picture of the true performance of our classifier: apparently there is some confusion between the second and third species, which we might anticipate given what we've seen of the data above.

This is why it's extremely important to use a train/test split when evaluating your models. We'll go into more depth on model evaluation later in this tutorial.