Navigation

This is my first post on machine learning, and hopefully not the last one. The
main goal of these posts is to serve as a quick reference for simple machine
learning problems and their solutions, meanwhile allowing me to get a better
understanding of the field itself. That said, don’t take anything for granted.

Given a set of inputs and outputs we would have to solve a set equations for
\(\beta\). The vector form
can be solved using linear algebra; the solution takes the form of:

\[\beta = (X'X)^{-1}X'Y\]

This particular solution will find betas using the method of ordinary least
squares. If linear
algebra is not an option, another way to find betas is to minimize the
difference between actual values and predicted values by using some function
optimization method, e.g. gradient descent.

In the examples below I am going to implement a simple linear regression model
that takes only a single input. This way we can much more easily visualize our data
and model; however, the code can be easily extended to support more than one input.

In this example we are going to model the sepal length using sepal width only
for setosa flower class.

importcsvdefread_iris_data(filename):""" :Parameters: - `filename`: name of the file that contains Iris data set. """sepalWidths=[]sepalLengths=[]withopen(filename)asfd:reader=csv.DictReader(fd)forrowinreader:# Widths will be our inputs, include interceptsepalWidths.append([1.0,float(row['sepalWidth'])])sepalLengths.append(float(row['sepalLength']))returnsepalWidths,sepalLengths

One thing to note here is that we are adding intercept to our inputs (ones).
It’s not very important in this particular case since our inputs will never
evaluate to 0. In other cases it may be useful.

importnumpyasnpdeffit(inputs,outputs):""" Solve a set of linear regression equations for betas using linear algebra. :Parameters: - `inputs`: a list of inputs. - `outputs`: a list of outputs for given inputs. """X=np.mat(inputs)Y=np.mat(outputs).Tx_trans_x=X.T*Xifnp.linalg.det(x_trans_x)==0:raiseException('Cannot inverse singular matrix')return(x_trans_x.I*(X.T*Y)).A.flatten()

The fit() function expects a list of tuples as inputs, and a list of output
values. The model parameters will be returned as a list. If we are going to run
this function on our data set we will get a these beta coefficients
[2.639001250.69048972]. Thus, our regression function looks as follows:

scikit-learn package comes with a linear model which can be used to solve
linear regression problems.

importnumpyasnpfromsklearnimportlinear_modelX=np.array(inputs)Y=np.array(outputs)# Don't include intercept since we already added it while loading# data.model=linear_model.LinearRegression(fit_intercept=False)model.fit(X,Y)

The model parameters are stored in model.coef_ and for our dataset they are
array([2.63900125,0.69048972]). Predictions can be done by calling
model.predict method.

For simple cases that have only a single input we can plot the regression line
and the data points together to see how well our model fits the data.

The red line represents our original model, which I think looks fairly well.
However, if we look at our dataset the first data point looks like an outlier.
We can try to build a model using a data set without the outlier. The green
regression line represents the model where the outlier was removed from the
training set.

By looking at the graphs it’s diffucult to tell which model is a better one. We
can take a look at the coefficient of determination (\(R^2\)) for
each of them. scikit-learn model provides score method which can be used to
obtain it. For the first model we have \(R^2\) of 0.551375580392 and
for the second one—0.547248091457. The closer \(R^2\) to 1 the better
our model is.