Machine Learning Algorithms Explained – Logistic Regression

Logistic regression is a supervised statistical model that predicts outcomes using categorical, dependent variables. Categorical variables take on values that are names or labels, such as: win/lose, healthy/sick or pass/fail. The model can also be used on dependent variables with more than two categories, in which case it is called multinomial logistic regression.

Logistic regression is used to build a classification rule for a given dataset based on historical information that is divided into categories. The formula of the model is:

The terms are defined as follows:

are all possible categories of the dependent variable Y.

is the probability that the dependent variable is equal to category “c.”

are the coefficients of the regression which, when transformed, express the importance of each variable in explaining the probability.

are the independent variables.

We are going to use the iris dataset that we have used in previous blog posts to illustrate how logistic regression works. The data consists of 150 plants categorized by features such as plant species (there are three separate species in this dataset), sepal and petal length, and sepal and petal width. The goal is to characterize each species using only the sepal and petal measures. We are also going to build a classification rule that can determine the species of a new plant introduced in the dataset. Figure 1 illustrates the sepal and petal measurements of an iris plant.

To start, we must split the dataset into two subsets: training and testing. The training set is used to fit the model to the data and is comprised of 60 percent of the full dataset. The testing set accounts for the other 40 percent of the data and is used to check if the model fits correctly with the given data.

Using the formula illustrated above, we fit the data into the logistic regression model. In this case, the dependent variable is plant species, the number of categories is equal to 3 and the independent variables are sepal and petal length and width. Figure 2 shows a subset of the data.

Figure 2. Iris dataset sample

In Table 1, we present an estimation of the coefficient for each independent variable in each of the three plant species. As is evident in the data, petal length and width are the most significant variables in the characterization process. As a result, those two variables will be emphasized in the feature-importance plot for each species (Figure 3).

Table 1. Estimation of coefficients

Figure 3. Feature importance

Next, we build a confusion matrix to check the performance of the model. This matrix compares the known category of an iris plant in the test dataset with the category predicted by the fitted model. Our goal is for the predicted category be the same as the real one. In Table 2, we see that the model is performing relatively well: Only two versicolor plants were incorrectly classified.

Table 2

Based on these results, we were able to successfully classify each species of iris plant in the dataset. However, as mentioned before, we must now formulate a classification rule. The next step is to calculate the probability that a new iris plant belongs to a given category by multiplying the values of the independent variables for the new plant by the coefficient estimations in Table 1. The results for a new iris plant are shown below in Table 3:

Table 3. Independent variable values for a new iris plant

Then, we calculate the probability of the iris plant falling into each category using the formula referenced earlier. The results verify that the iris described above most likely pertains to the Virginica species.

Table 4. Probability of belonging to each species for the new iris plant