Logistic regression with softmax function

The logistic regression model is a simple but popular generalized linear model. It is used to make classification on binary or multiple classes. Here, we will try to implement this model with python, test the results on simulated data and compare its performance with the logistic regression module of scikit-learn.

Review of Logistic Regression

Logit function

The linear regression model will build a quantitative relation between the value of explanatory variables and that of the dependent variable .

(1)

However, it is not reasonable to apply this model on the binary classification problems. Why? In that situation, the dependent variable is denoted by a qualitative dummy variable, usually 0 and 1, where 1 represent the object belongs to one class and 0 the other class. But in linear regression, the dependent variable is quantitative and usually ranges from to .

A natural transformation is to make the dependent variable the conditional probability of the object belonging to one class, which ranges from (0,1). Then use a monotonic link function to transform the probability from to . Then we can build a linear regression on the new dependent variable.

The transformation we describe above is the logit function

(2)

Then, we have a logistic regression model as

(3)

With formula (3), we have :

(4)

Now we have got the probability of the object belonging to one class, we could derive the probability of the object belonging to another class by and estimate the parameters with MLE (Maximum Likelihood Estimation). However, we can also build two parallel logistic regression for two classes by using the softmax function and optimize the cross-entropy loss for estimation parameters.

softmax function

What is a softmax function? It is a generalization of the logistic function which takes in a K-dimension vector and normalize each element of the vector into the range [0,1] by

(5)

In our parallel logistic regression model, or softmax classifier, we have

(6)

(6)

(8)

The softmax function here normalize the results from two linear model and squashes them to the range of (0,1). However, we notice that the and are redundant parameters because we can acquire the same results by building only one logistic regression. Therefore, we have to include regularization item in our loss function to restrict the magnitude of our parameters.

The cross-entropy loss, or the negative maximum likelihood, with regularization is

(9)

where and .

The set represent all the parameters.

With the theory above, we can start to build the model.

Code from scratch

Necessary modules

We first need some necessary modules

import numpy as np
import random
import matplotlib.pyplot as plt

Simulate the data

We will simulate a data set for testing the performance of the data. To facilitate our convenience, we only simulate data with 2 dimension.

The argument of the function is a vector from the affine transformation of original data () and the returned value is a vector of the same dimension as the input vector but is normalized by the softmax function.

Calculate the gradient

To estimate the parameters of the model, we will use stochastic gradient descent to reduce our cost function. Thus, we need the gradient for each step The loss function is

The softmaxGradient function have 5 arguments and returns the value of cost function, gradient of W and b. We will use the output from this function to update the W and b in the following SGD step.

Stochastic gradient descent

The Stochastic gradient descent also known as incremental gradient descent, is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. In other words, SGD tries to find minima or maxima by iteration.

Compare with scikit-learn logistic regression module

To evaluate the performance of our self-constructed model, we will compare it with the scikit-learn LogisticRegression module by making classification on our test data, and compare the training accuracy.

They are pretty close, indicating that our self-made logistic regression is effective.

Decision boundary visualization

Besides evaluating the training accuracy, we can also plot the decision boundary of the logistic regression from our self-made model on the original data 2-D plane.

The boundary have successfully captured the difference between the red class and the blue class.

Summary

Now, we have finished the construction of the logistic regression from scratch. It has good performance but is much slower than the scikit-learn implementation because the scikit-learn uses the advanced gradient descent solver. However, by doing this, we can have a better understanding of the logistic model structure, the factors that influence its performance and its close relation to the softmax function. With this knowledge, we can build our own extended algorithm to solve the real world problem that are not fit to any machine learning framework.