Introduction:

Cluster analysis is a valuable machine learning tool, with applications across virtually every discipline. These include anything from analyzing satellite images of agricultural areas to identifying different types of crops, to finding themes across the billions of social media posts that are broadcast publicly every month. ​Clustering is often used together with natural language processing (NLP). NLP allows us to turn unstructured text data into structured numerical data. This is how we transform human language into the computer-friendly inputs required to apply machine learning.

For a straightforward application of how these two techniques can be applied in sequence, we’ve chosen an easily accessible public data set: the 'Top 50 Songs of 2015' by Genius, the digital media company. In our example, we look at how songs can be broken into similar clusters by their lyrics, and then conduct further statistical analysis against the resulting clusters.​Machine learning and other data science methodologies increasingly confer great powers upon those who practice it. It’s not, however, without its limitations. The misapplication or misinterpretation of these methodologies can lead not only to reduced ability to extract insights but also to categorically false or misleading conclusions. We examine potential mishaps in the transformation and analysis of the data at various stages in this study.

This study explores the use of cluster analysis to answer the following questions:

Can you use machine learning models to analyze the lyrics of a set of songs, and cluster them into similar kinds of songs?

Do word count or breadth of vocabulary hold predictive value for a song's rank, within that set?

Do the averages of word count and breadth of vocabulary for each cluster of song hold predictive value for the average rank for songs in that cluster?​

Notes:

This analysis was inspired by the thoughtful cluster analysis of the top 50 Reddit datasets, made available by Ari Morcos.

Data was sourced from Genius.com's user-curated list of top songs for 2015, available here.

Methodology:

We performed this analysis in Python, using Pandas, Sklearn, and NLTK to model the data. Graphs were projected with matplotlib and/or seaborn.​Steps:

Clean and tokenize each song’s lyrics

Create a dictionary containing every unique word appearing in the data, and its frequency

Select the words which appear above an average of twice per song ​(only the top 36, out of 3,126 unique words)

Assign the normalized frequency of each of these top 36 words to a vector, for each song

Calculate the euclidean distance between each song, and project these values into a matrix

Perform primary component analysis (PCA) on the matrix

Feed the matrix with reduced dimensionality into a K-Means clustering algorithm*

​ * We first fed the data into an affinity propagation model, but found the model unsuited to the data.

Further Analysis:

After clustering, we analyzed the relationships between word count, vocabulary, songs, and the clusters the songs were assigned to. Results are below.

​Project Results and Analysis:​

After preprocessing the data and determining the similarity coefficient for each song, we projected them into a matrix:

Cooler colors are more similar, while hotter colors are more dissimilar.

​Interestingly, Major Lazer's "Lean On" is so lyrically dissimilar from every other song, that it occupies the top five most dissimilar ranks, against five other songs.

Primary Component Analysis

Unfortunately, PCA Analysis here does not help as much as it may in other cases. We require the first 9 vectors in our new matrix of primary components, to account for only just over 60% of the variance.

Clustering the songs

The Reddit dataset this analysis was inspired by used an affinity propagation algorithm to perform it's clustering. Affinity propagation does not require you to input the initial number of clusters. This means that, unlike the K-Means, the number of clusters is NOT arbitrary. This is therefore often a great choice, when it works.

We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.

This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.

After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.

Of the 8 classes we get, five of them have six or more songs within them.

Now that we've clustered the Genius Top 50 Songs of 2015 into eight groups, how can we visually represent them?

​If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.

Besides lyrics, what do classes share in common?

​First, let's visualize some of our distributions:

1. On the left, we have a box plot showing the distribution of words per song.

The mean is just over 400 words,

The standard deviation is large relative to the mean, and word count is distributed fairly equally around it.

The first and third quartiles range between ~300 and ~600 words

2. On the right, we have a histogram depicting the number of songs which fall into bins characterized by word quantity.

The distribution is right-skewed, due to a single song with over 1400 words (3.5x the mean)

Next, let's see whether there is any correlation between classes, and the average number of unique words within each class.

​There is no reason there *should* be, as songs were clustered by the words themselves, rather than the amount of words.

Still, there may be some correlation. Perhaps songs with similar lyrics share a similar breadth of vocabulary (amounts of unique words per song).

There is no correlation between the number of words or number of unique words, and the class a song was assigned to.

This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.

This graph itself, though, conveys valuable information.

If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.

There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.​

Word Count & Unique Words Vs Song Ranking​

​Is Song Rank a function of the number of unique words in a song?Is Song Rank a function of a song's word count?

No. With R-Squared values of 0.0037, and 2.4e-6, there is absolutely no correlation between the number of unique words - or total words - in a song, and the song's ranking.​

Is there a correlation between the AVERAGE rank per class, and the AVERAGE word count or unique word count per class?

Note, that we dropped 3 clusters here. These clusters had only 1, 1, and 3 songs respectively.If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.Each of these has between 6 and 12 songs in it.

Yes. There does appear to be a correlation (13% and 34% of variance).

This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.

* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.

** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.​

Conclusion

1. Songs can be effectively clustered together by analysis of their lyrics.

2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.

3. After clustering songs into similar kinds of songs, relationships do emerge:

1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking

* This can account for ~13% of the variance.

2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.

* This can account for ~34% of the variance

Considerations:

1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)

* It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking

2. Subjective rankings.

* Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."

* While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.

3. Relatively few words per song.

* The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.

* While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from.