(Mental) maps of texts

Earlier in the summer, I was thinking about the way that words distribute inside of long texts – the way they slosh around, ebb and flow, clump together in some parts but not others. Some words don’t really do this at all – they’re spaced evenly throughout the document, and their distribution doesn’t say much about the overall structure of the text. This is certainly true for stopwords like “the” or “an,” but it’s also true for words that carry more semantic information but aren’t really associated with any particular content matter. For example, think of words like “completely” or “later” – they’re generic terms, free-agents that could be used in almost any context.

Other words, though, have a really strong semantic focus – they occur unevenly, and they tend to hang together with other words that orbit around a shared topic. For example, think of a long novel like War and Peace, which contains dozens of different conceptual threads. There are battles, dances, hunts, meals, duels, salons, parlors – and, in the broadest sense, the “war” sections and the “peace” sections. Some words are really closely associated with some of these topics but not others. If you open to a random page and see words like “Natasha,” “Sonya,” “mother,” “love,” or “tender,” it’s a pretty good bet that you’re in a peace-y section. But if you see words like “Napoleon,” “war,” “military,” “general,” or “order,” it’s probably a war section. Or, at a more granular level, if you see words like “historian” or “clock” or “inevitable,” there’s a good chance it’s one of those pesky historiographic essays.

To borrow Franco Moretti’s term, I was looking for a way to operationalize these distributions – some kind of lightweight, flexible statistic that would capture the structure of the locations of a term inside a document, ideally in a way that would make it easy to compare it with with the locations of other words. I started poking around, and quickly discovered that if you know anything about statistics (I really don’t, so take all of this with a grain of salt), there’s a really simple and obvious way to do this – a kernel density estimate, which takes a collection of observed data points and works backward to approximate a probabilty density function that, if you sampled it the same number of times, would produce more or less the same set of data.

Kernel density estimation (KDE) is really easy to reason about – unlike the math behind something like topic modeling, which gets complicated pretty fast, KDE is basically just simple arithmetic. Think of the text as a big X-axis, where each integer corresponds to a word position in the text – “novel time,” as Matt Jockers calls it. So, for War and Peace, the text would stretch from the origin to the X-axis offset of 573,064, the number of words in the text. Then, any word can be plotted just by laying down ticks on the X-axis at each location where the word shows up in the document. For example, here’s “horse” in War and Peace:

An obvious first step is to create a simple histogram:

A kernel density estimate is the same idea, except, instead of just chopping the X-axis up into a set of bins and counting the points, each point is represented as a “kernel” function. A kernel is just some kind of weighting function that models a decay in intensity around the point. At the very simplest, it could be something like the uniform kernel, which just converts the point into a rectangular region over the X-axis, but most applications use something smoother like the Epanechnikov or Gaussian functions. The important thing, though, is that the kernel transforms the point into a range or interval of significance, instead of just a one-dimensional dot. This is nice because it maps well onto basic intuitions about the “scope” of a word in a text. When you come across a word, where exactly does it have significance? Definitely right there, where it appears, but not just there – it also makes sense to think of a kind of “relevance” or “meaning energy” that dissipates around the word, slowly at first across the immediately surrounding words and then more quickly as the distance increases.

Anyway, once the all of the kernels are in place, estimating the density function is just a matter of stepping through each position on the X-axis and adding up the values of all the kernel functions at that particular location. This gives a composite curve that captures the overall distributon of the term. Here’s “horse” again:

This makes it possible to visually confirm the earlier intuitions about the groups of words that tend to hang together in the text. Here’s the peace-y cluster from above:

And the war-y cluster:

And all together, which shakes out the contours of the two general categories. When one goes up, the other goes down:

“More like this”

These are fun to look at, but the real payoff is that this makes it easy to compute a really precise, fine-grained similarity score that measures the extent to which any two words appear in the same locations in the text. Since the end result is always just a regular density function, we can make use of any of the dozens of statistical tests that measure the closeness of two distributions (see this paper for a really good survey of the options). One of the simplest and most efficient ways to do this is just to measure the size of the geometric overlap between the two distributions. This gives a score between 0 and 1, where 0 would mean that the two words appear in completely different parts of the text, and 1 would mean that the words appear in exactly the same places (ie, they’re the same word). For example, how similar is “horse” to “rode”?

Very close – their density functions have about an 80% overlap, which puts “rode” just a bit closer than “galloped,” which weighs in at ~0.78:

Or, at the opposite end of the spectrum, words that show up in very different parts of the document will have much less overlap, and the score will edge towards 0. For example, battles and dances don’t have much to do with each other:

This points to a interesting next step – for any given word, you can compute its similarity score with every other word in the text, and then sort the results in descending order to create a kind of “more-like-this” list. For example, here are the twenty words that distribute most closely with “Napoleon,” all clearly related to war, conquest, power, etc:

Or, at the other end of the spectrum, “Natasha” sits atop a stack of very Natasha-esque words related to family, emotion, youth, and general peace-time happiness (with the exception of “sad,” which, presumably, is the unhappy endings with Anatole and Andrei):

By skimming off the strongest links at the top of the stack, you end up with a custom little “distribution topic” for the word, a community of siblings that intuitively hang together. It’s sort of like really simple, “intra-document” form of topic modeling.

Twisty little passages

The cool thing about this, though, is that it makes it possible to traverse the internal topic structure of the document, instead of just sliding back and forth on the linear axis of words. For example, once you’ve computed the sibling community for “napoleon,” you can then do the same thing for any of the other words in the stack. If you take the second word, for example – “war” – and compute its sibling community, you’ll see many of the same words again. But, since the distribution of “war” is a bit different, other terms will start to creep into view. Each time you do this, the semantic field will shift to center most closely on the anchoring word at the top of the stack. Over time, you start to traverse into completely different domains of meaning. Each sibling community is like a room in a maze, and each of the words is like a door that leads into an adjacent room that occupies a similar but slightly different place in the overall organization of the document.

This fascinates me because it de-linearizes the text – which, I think, is closer to the form it takes when it’s staged in the mind of a reader. Texts are one-dimensional lines, but we don’t really think of texts as lines – or at least not just as lines. We think of them as landscapes, diagrams, networks, maps – clusters of characters, scenes, ideas, emotional valences, and color palettes, all set in relation to one another and wired up in lots of different ways. Notions of “proximity” or “closeness” become divorced from the literal, X-axis positions of things in the document. In War and Peace, for example, I think of the battles at Borodino and Austerliz as being very “close” to one another, in the sense that they’re the two major military set pieces in the plot. In fact, though, they’re actually very “distant” in terms of where they actually appear in the text – they’re separated by about 300,000 words, and their density functions only have an overlap of ~0.32, meaning, essentially, that they don’t overlap with each other about 70% of the time:

So, how to operationalize that “conceptual” closeness? It turns out that this can be captured really easily just by building out a comprehensive network that traces out all of the connections between all the words at once. The basic idea here – converting a text into a network – is an old one. Lots of projects have experiment with representing a text as a social network, a set of relationships between characters who speak to one another or appear together in the same sections of the text. And lots of other projects have looked into different ways of representing all the terms in a text, like I’m doing here. Back in 2011, a really interesting project called TexTexture devised a method for visualizing the relationships between words that appear within a 2- or 5-word radius in the document. As I’ll show in a moment, though, I think there are some interesting advantages to using the density functions as the underlying statistic – the distributions tease out a kind of architectural “blueprint” of the document, which often maps onto the cognitive experience of the text in interesting ways.

Anyway, once we’ve laid down all the piping to compute the little distribution topic for a word, the last step is just to do this for all of the words, and then shovel the strongest connections into the network. For example, if we take the top 10 strongest links, the “napoleon” topic would result in these edges:

Once this is in place, we get access to the whole scientific literature of graph-theoretic concepts, and the conceptual relationship between “austerlitz” and “borodino” falls out really easily – we can use Dijkstra’s algorithm to get the shortest path between the two, which, unsurprisingly, makes just a single hop through the word “battle”:

'austerlitz' -> 'battle' -> 'borodino'

With a path length of ~1.12, which puts “borodino” as the 17th closest word to “austerlitz” out of the 1000 most frequent words in the text, closer than 98% of the list, even though they only co-occur about 30% of the time:

Mapping the maze

This is useful as a confirmation that the network is capturing something real about the text. But it’s sort of like stumbling through one little passage in the labyrinth with a torch, tracing out a single thread of connection in the document. What you really want is to be able to zoom back and see a bird’s-eye view of the entire thing at once, to wrap your head around the complete set of relations that bind all of the words together. This is a perfect task job for any of the off-the-shelf network layout algorithms, which treat all of the nodes as “particles” that repel one another by default, but which are bound together by a set of attractive forces exerted by the edges that connect them. Force Atlas 2 in Gephi works well – War and Peace unfolds into a huge, spindly triangle:

War to the left, peace to the right, and history on top, between the two. Of course, the “on top” has no meaning in and of itself, since the orientation of the layout is random – here and elsewhere, I’ve rotated the final render to make it easy on the eyes. What does have meaning, though, is the relative position of the words, the relationships between the regions – that history is “between” war and peace, in this case.

This makes it possible to position different elements of text as they relate to the high-level categories – kind of like a big, nerdy, literary “where’s Waldo.” For example, look at the huge amount of of space between “Napoleon” and “Bonaparte,” which I would have expected to hang together pretty closely. “Napoleon” sits along the top left shoulder of the triangle, along the gradient between “battle” and “history,” in the middle of a section related to military strategy and tactics (“military,” “plan,” “campaign,” “men,” “group”). Whereas “Bonaparte” is way down at the bottom of the triangle, almost exactly in the middle of the gradient running between war and peace, just shy of a cluster of words related to the aristocratic salon (“Anna,” “Pavlovna,” “sitting,” “laughing”) and right next to “company,” which has the perfect polysemy to bind the two sides together – social company to the right, and the military company to the left. The two names enact different roles in the text – “Napoleon” is the man himself, winning battles and participating in the abstract notion of history, and “Bonaparte” is the Russian imagination of the man, a name whispered at parties in Moscow and St. Petersburg. Pierre, meanwhile, shows up near the connection point with the history cluster, surrounded by words of spiritual anxiety and questing – “doubt,” “soul,” “time,” “considered.” Anatole is in the furthest reachest of the peace section, right next to “visitors” (he was one) and “daughters” (he eloped with one). Rostov and Andrei (Andrew, in the translation) are at the bottom center, right near “Bonaparte” in the bridge between war and peace. The women and children, meanwhile, are almost completely confined to the peace cluster – Natasha, Marya, Sonya, Anna, Helene, along with basically all words about or related to women – “lady,” “girl,” “mother,” “countess,” “daughter,” etc. Women essentially instantiate peace, and have very little interaction with history or war – it’s almost as much War and Women as War and Peace.

Also, take a look at the gradients that run between the conceptual extremes – the means by which the different sections transmute into one another. For example, look again at the bottom center of the network, near “Bonaparte,” right where war crosses over into peace. How is that transition actually accomplished? If you look closely, there’s a cluster of terms right between the two related to the body and physical contact – “lips,” “hand,” “fingers,” “touched,” “eyes,” “face,” “shoulders,” “arm,” “foot,” “legs.” Which, it seems, are used to describe both the physicality of military life and the niceties of Russian high society – the embraces, clasps, arms over shoulders, pats on backs, etc. War becomes peace by way of the body, which is subject both to the violence of war and the sensuality of peace. Or, more broadly, look at the left and right sides of the triangle, the gradients running from war to history on the left and peace to history on the right. In both cases, these are also gradients from concrete to general, specific to abstract. The individual women and children that represent the furthest extreme of the peace corner give way to a cluster of terms about family in general – “children,” “wife,” “husband,” “family” – before rising up into the history cluster by way of “life” and “live.” On the right side, terms related to the specifics of battle – “guns,” “flank,” “line,” “borodino,” “battle” – give way to Napoleon’s cluster of words related to strategy and tactics – “plan,” “military,” “campaign,” “strength,” “number” – which then join the history section by way of “direction.” It’s a big diagram of the idea of the text.

The world of Concord is at the bottom – “civilization,” “enterprise,” “comforts,” “luxury,” “dollars,” “fashion.” As you move up, this gives way to Thoreau’s narrative about his attempt to build his own, simplified version of the this world – “roof,” “built,” “dwelling,” “simple.” Which in turn bleeds into the world of his day-to-day existince at Walden, anchored around the word “day” – “hoeing” the field, “planting beans,” “singing” to himself, “sitting”, “thinking.” Then the network crosses over completely into the world of the pond – “water,” “surface,” “depth,” “waves,” and “walden.” Remarkably, at the very top of the network, along with “lake” and “shore,” is “boat,” which is eerily similar to the “raft” on top of the Odyssey – the most extreme removal from human civilization, the smallest outpost of habitable space. Both enact the same opposition – between a world of men on land, and a world of solitude out in the midst of some kind of watery wilderness.

The Divine Comedy looks almost exactly like Walden, except Concord/Walden is replaced with hell / heaven, with, fittingly enough, “christ” perched on top of the whole thing:

It’s kind of like reading literary x-rays (hopefully not tea leaves). Here’s Notes from Underground, which, like the text, splits along the center into two sections – the existentialist rant of “Underground” on the left, the adventures with Zverkov and Liza from “Apropos of the Wet Snow” on the right:

Here’s the Origin of Species, which I’ve only read in small parts. But, it’s actually interesting to do this with a text that you don’t know, and see what you can infer about the organization of the document. Origin of Species gives a lot of structure to chew on:

Failures, null results

This big weakness with this, of course, is that it doesn’t work nearly as well with texts that don’t naturally split up into these kinds of cleanly-defined sections. For example, Leaves of Grass:

It’s more scrambled, less differentiated, less obviously “accurate” than the tidy triangle of War and Peace or the cosmological pillar of the Divine Comedy. If you squint at it for a few minutes, it starts to assemble into some recognizable constellations of meaning, but it’s much more of an interpretive exertion to make a case for how the lines should be drawn. Two regions of meaning are fairly clear – on top, a section about war (“soldiers,” “battle,” “camp,” “armies,” “war”), and, at the bottom left, a big, diffuse shoulder of the network related to the body, sensuality, sexuality – “neck,” “fingers,” “limbs,” “flesh,” “kiss,” “touch,” “hand,” “happy.” The right side of the network doesn’t hold together as well, but, if this post weren’t already much too long, I’d argue that lots of things on the right side converge on a shared preoccupation about time – “eidolons,” from the inscription of the same name about how the actions and accomplishments of people are ground into shadows over time; “pioneers,” from “Pioneers! O Pioneers,” one of the triumphalist narratives about the inevitability of American expansion in the west; and a cluster of terms related to geopolitics and deep time – “universe,” “nation,” “modern,” “centuries,” “globe,” “liberty,” “kings,” “America,” “mighty.” This is Whitman looking back at Europe and forward to what he sees as an American future, both in a political and cultural sense but also in terms of his own relationship, as a poet, to literary and intellectual tradition. It’s Whitman thinking about how things change over time. (If you buy this, the war/body/time triad starts to look interestingly similar to war/peace/history).

But, this is much more of a stretch – it’s muddled, less legible. In one way, this probably just reflects something true about Leaves of Grass – it’s more finely chopped, more heterogeneous, more evenly mixed than something like War and Peace. But I think this is also exposing a weakness in the technique – my intuition is that there are, actually, some really distinct topic clusters that should be surfaced out of Leaves of Grass, and I wonder if they’re getting buried by the simplistic way that I’m picking the words that get included in the network. Right now, I just take the top X most frequent words (excluding stopwords), and compute the relations among just those words. The problem with this, I think, is that it doesn’t do anything to filter out words that are very evenly distributed – words that aren’t “typical” of any particular topic. Which, since they’re similar to everything, act like binding agents that lock down the network and prevent it from differentiating into a more useful map of the document. This happens to a lesser degree in all of the networks, which tend to have a big clump of words in the ceter that don’t really get pulled out towards the more conceptually focused regions at the edges. Or, to borrow again from the terminology of topic modeling, I wonder if there’s a way to automatically pick the words that anchor the most “interpretable” or “coherent” distribution topics – the terms that serve as the most reliable markers for whether or not a given topic is “active” at some given point in the text. In War and Peace, for example, “battle” should score very highly, since it’s way off at the edge of the war region, but words like “make” or “began” should get low scores, since they end up right in the middle of the network and don’t glom onto any particular thread of meaning in the text. You want the “clumpy” words – terms that appear very frequently, but also very unevenly.

Anyway, this was fun, but I’m still just stabbing around in the dark, for the most part. The visualizations are interesting as a type of critical “deformance,” to borrow a word from Jerry McGann, Lisa Samuels, and Steve Ramsay – derivative texts, the products of an algorithmic sieve that can confirm or challenge intuitions about the originals. In the long run, though, I’m actually more interested in the question of whether this kind of network information could be tapped to get a more general understanding of the “shapes” of texts in the aggregate, at the scale of hundreds or thousands of documents instead of just a handful. Would it be possible to move beyond the visual close reading, fun as it might be, and find a way to classify and compare the networks, in the same way that the something like the Bray-Curtis dissimilarity makes it possible to operationalize the distributions of the individual words?

For example, the Divine Comedy and Walden just look almost identical to the eye – but how to capture that quantitatively? What exactly is the underlying similarity? Is it something real, or is it just a coincidence? Could it boiled out as some kind of portable, lightweight, network-theoretical measurement? Maybe some notion of the “width” or “breadth” of the text – the maximum distance between nodes, the extent to which the text traverses across a semantic space without looping back on itself? If this is computable – what other texts are “wide” or “long” in this way? Do they cohere at a literary level? Do they all peddle in conceptual opposites, like the Divine Comedy and Walden – heaven/hell, nature/wilderness, land/water? Maybe they’re all “travelogues,” in a conceptual sense – narratives about a continuous, gradual movement from one pole to another, which prevents the network from folding back on itself and wiring up a shorter distance between the most removed terms? What other new taxonomies and measurements of “shape” might be possible? Maybe some measure of “modularity” or “clumpiness,” the extent to which a text separates out into discrete, self-contained little ciruits of meaning? How many shapes are there? Does every text have a unique shape, or at they all variations on a small set of archetypes (the line, the triangle, the loop, etc.)? If patterns do exist, and if they could be quantified – (how) do they correlate with author, plot type, genre, period, region, nationality, etc.?

This is very cool; thanks for doing it and writing it up. I share your sense that it would be great to have a way in which to compare a whole bunch of these maps to suss out something like their range of forms. Seems as though there must be a graph-theoretic approach to the problem (or maybe a bunch of them), but that’s pretty far afield for me. Sort of related: Have you seen Simon Pröll’s piece in LLC last year on clustering maps by formal similarity? [http://llc.oxfordjournals.org/content/28/1/108] Maybe applicable?

dclure

Hey Matthew,

Very interesting, thanks for that link! I bet it would be possible to take the set of coordinate points produced by the layout algorithm and do a 2D kernel density estimation [1] to get a kind of abstract profile of the network, and then use something like the technique described in that paper to compute the similarity between networks. But, possibly a showstopper – I can’t really think of a way that you could standardize the orientation of the networks, which is random (unlike spatial data, where everything is anchored onto a fixed grid). For instance, the two long-and-skinny texts in this post (Walden and the Divine Comedy) came out of Gephi at totally different orientations, and I had to manually rotate them to get the conceptual poles to line up. It almost becomes a computer vision problem – do these two shapes _look_ the same? And if so, how to rotate them so that they’re most closely aligned? I know nothing about this, but I wonder if something like OpenCV [2] would have something useful.

Since infrequency is a (coarse) measure of salience, or frequency in a document as compared to in a document collection (TF-IDF), how about filtering words above some frequency threshold for the work/document, or maybe below some ratio of doc frequency to a corpus of the English language (http://corpus.byu.edu/). Sort of TF-IDF but with the collection all of English. Or build a collection of a few contemporary works…thinking out loud here. Nice work – love the idea of document shapes, though not sure what one does with them yet. Great start!