d3 stacked area

In part IV (and final) of topic discovery in black metal lyrics, we’ll address the issue of assigning topics to record labels based on the lyrical content of their black metal releases. In other words, we want to find out if a given label has a tendency to release bands that write about a particular theme. We’ll also investigate the temporal evolution of these topics, that is, what changes happened through the years regarding the usage of topics in black metal lyrics. This aims to shed some light on the issue of whether lyrical content has remained the same throughout the years.

In order to address these questions, I turned once more to topic modeling. This machine learning technique was mentioned in parts I, II and III of this post, so knock yourself out reading those. If that does not appeal to you, let’s sum things up by saying that topic modeling aims to infer automatically (i.e., with minimum human intervention) topics underlying a collection of texts. “Topic” in this context is defined as a set of words that (co-)occur in the same context and are semantically related, somehow.

Instead of using the topic model built for parts II and III, I generated a new one after some (sensible, I hope) cleaning of the data set. This pre-processing involved, among other things, removal of lyrics that were not fully translated to english and lyrics with less than 5 words. In the end, I reduced the data set to 72666 lyrics (how ominous!) and generated a topic model of 30 topics with the Stanford Topic Modeling Toolbox (STMT).

Like in previous attempts, of these 30 topics, 2 or 3 seemed quite generic (they were composed of words that could occur in any context) or just plain noisy garbage, but for the most part the topics are quite concise. I’m listing those I found the most interesting/intriguing. For each of them I added a title (between parentheses) that tentatively describes the overall tone of the topic:

One nice functionality that the STMT offers is the ability of “slicing” the data with respect to the topics. This means that when slicing the data by date, one is able to infer what percentage of lyrics in a given year falls into each topic.

In order to observe the temporal evolution of some of these 30 topics between 1980 and 2014, I chose to use a NVD3 stacked area chart instead of just plotting twenty-something lines (which would be impossible to understand given the inevitable overlapping). The final result looks very neat and tidy, but can also be misleading and give the impression that all the topics are rising and diminishing at the same points in time. This is not true: when inspecting the stacked area char below remember that what represents the topic for a given year is the area of the topic at that point. There’s also the possibility of deselecting all topics (in the legend, top-right corner) except the one you want to examine, or simply clicking its area in the graph.

It seems that “Pain, Sorrow & Suffering” is consistently the most prevalent topic, peaking at 10.3% somewhere around 2006. “Fucking” has a peak in 1992, and “Warriors & Battles” represents more than 20% of the topic assignment in 1986. For the most part, the topic assignment percentages seem to stabilize after 92/93 (after the Norwegian boom or second wave or whatever it’s called).

And finally, when slicing the data set by the record labels, the output can be interpreted as the percentage of black metal releases by a given label that falls into each topic. After doing precisely that for records labels that have a minimum of 10 black metal releases, I selected a few labels and plotted for each the percentage of releases that were assigned to the topics with some degree of confidence. The resulting plot is huge, so I removed a few generic topics for the sake of clarity. By hovering the mouse on the topic titles, a set of some words that represent it will pop-up. Similarly, by hovering the mouse over a record label name, the circles will turn into percentages. The larger the circle’s radius, the higher the percentage of releases from that label were assigned to that circle’s corresponding topic.

Some observations of results that stand out: it seems that more than 20% of Depressive Illusions‘ releases were assigned to “Pain, Sorrow and Suffering. End All Life (which has released albums by Abigor, Blacklodge and Mütiilation, to name a few) top three topics are “Mind & Reality”, “Pain, Sorrow & Suffering” and “Chaos, Universe & Cosmos”. Also, almost 1/4 of all Norma Evangelium Diaboli‘s releases (which include Deathspell Omega, Funeral Mist and Katharsis) seem to pertain to “The Divine” topic.

And that’s it for now, I’m done with topic modeling for the time being until I have the time and patience to fine-tune the overall representation of the data and the algorithm’s parameters. In the next few weeks I’ll turn to unsupervised machine learning techniques, such as clustering, to discover hidden relationships between bands.