Woods of Infinity

In part II of this post, we explored a topic model built for the whole black metal lyrics data set (if you don’t know what a topic model is, read this as well, but to sum things up let’s just say topic modeling is a process that enables discovery of the “meaning” underlying a document, with minimum human intervention). In said post we analyzed 1) the relationship between topics, and 2) the importance of individual words in their characterization by means of a force directed graph, which (let’s face it) is a bit of a bubbly mess.

In order to understand better the second point stated above, I decided to build a zoomable treemap. In it, each large box (distinguished from the surrounding boxes by a label and a distinct color) represents a topic, i.e. a set of words that are somehow related and occur in the same context(s). By clicking on a label, the map zooms into it and presents the ten most relevant words within that topic. For example, by clicking on “Coldness”, you’ll see the top 10 terms that compose it (“ice”, “frost”, “snow” and so on). The area of each word is proportion to its importance in characterizing the topic: in our “Coldness” example, “cold” occupies as larger area than the rest, being the most relevant word in this context.

Similarly, the total area of each topic is proportional to its incidence in the black metal lyrics data set. For example, “Fire & Flames” has a larger area than “Mind & Reality” or “Universe & Cosmos”, making it more likely to occur when infering the topics that characterize a song.

By the way, these topic labels were chosen manually. Unfortunately I couldn’t devise an automated process that would do that for me (if anyone has an inkling on how to do this, let me know) so I had to pick meaningful and reasonable (I hope) representative titles for each set of words. In most cases, like the aforementioned “Coldness”, the concept behind the topic is evident. There are, however, a few cases where I had to be a bit more creative because the meaning of the topic is not so obvious (“Urban Horror” comes to mind).

There are also two topics which are quite generic, with terms that could occur in almost any context, so they’re simply labeled “Non-descriptive”.

As mentioned in part II of this post, one goal of this whole mess is to find out which lyrics “embody” a specific topic. Given that the lyrical content of a song is seen by the topic model as a mixture of topics, then we’re interested in discovering lyrics that are composed solely (or almost in their entirety, let’s say more than 90%) of a single topic. Using the topic inferencing capabilities of the Stanford Topic Model Tool I did just that, selecting at least 3 representative lyrics for 14 of the topics above. They’re displayed in the collapsible tree below.

For the most part the lyrics seem to have a high degree of correlation with the topic assigned to them: for instance Immortal’s “Mountains of Might” fits the “Coldness” topic fairly well (surprise, surprise…) and Vondur’s cover of an Elvis Presly song obviously falls into the heart stuff category. But there is one intriguing result: after reading Woods of Infinity’s “A Love Story”, I was expecting it to have the “Dreams & Stuff from the Heart” topic assigned to it. It falls in the “Fucking” topic instead, so maybe the algorithm detected something (creepy) between the lines.