Research techniques and education

KMeans Clustering in Python

Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.
The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.

Data preparation

Determine the number of clusters

Conduct analysis

Data Preparation

Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.

You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.

Determine the Number of Clusters

Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.

The elbow method measures the within sum of error in each cluster. As the number of clusters increasings this error decrease. However, a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.

The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.

KMeans Analysis

The code for the kmeans analysis is as follows

km=KMeans(3,init='k-means++',random_state=3425)
km.fit(df.values)

We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the objet called km

The km object has the .fit function used on it with df.values

Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.

You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.

It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.

We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a sting in our cluster map dictiojnary.

Next, we make a different dictionary to color the points in our graph.

d_color={'Weak':'y','Average':'r','Strong':'g'}

Here is what is happening in the code above.

We set the ax object to a value.

A for loop is used to go through every example in clust_map.values so that they are colored according the color

Lastly, a plot is called which lines upo the perf and clust values for color.

The groups are clearly separated when looking at them in two dimensions.

Conclusion

Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.