Machine Learning – Basic Implementation (Part I – Linear Regression)

Abstract

This post is a series of how to do basic implement specific machine learning algorithm to solve a problem. This implementation will help you, step-by-step, tune plenty of processes, in order, to optimise model.

Prerequisites

You should have overview mathematical of linear regression and concept of terms: features, gradient descent, cost function …
We also assume that all data be collected and cleaned before doing any implementation

Deployment

1. Normalize data
Base on the variety of data scale, we should normal data before do any visualisation . It can easy apply with python numpy library

import numpy as np
def normalize_features(df):
"""
Normalize the features in the data set.
"""
mu = df.mean()
sigma = df.std()
if (sigma == 0).any():
raise Exception("One or more features had the same value for all samples, and thus could " + \
"not be normalized. Please do not include features with only a single value " + \
"in your model.")
df_normalized = (df - df.mean()) / df.std()
return df_normalized, mu, sigma

2. Visualize data
Of course, first steps to choose any model to train is to see how data (x,y) interact by graph. You will know linear regression to be correct algorithm if your data (or graph of your data) satisfy all criteria below:

The scatter of points has to be around the best-fit line and same standard deviation all long the curve. if there are many points too far (high or low) from best-line, that should be not a linear regression

The measurement of x (the features) should be exactly correct. Imprecision of measuring X (if happen) should be very small compared to biological variability of Y

The data input should be independent each other. it mean if there is a change in one data experiment, the others should not be change

Is a correlation between X and Y. for example: midterm score vs total score. while midterm score is a parameter (or component) to calculate total score, linear regression is not valid for these datas.

3. Gradient descent vs statmodel OLS
Before talking about tuning model, we get started with a basic step using gradient descent and statmodel OLS to find a first set of parameter theta. This set of theta might not be the best, but it provide overview step how to tune model later.