Search

(Note: the above shot is an update; see Kristen Long’s comment below.)

You’ve seen tableau one, a word (or tag) cloud, before, a now-standard, next-big-thing journo tool, this one floated into the firmament by Kirsten Long of the Politico web site.

Tableau two scrapes the same data – the text of President Obama’s September 8 jobs bill exhortation – but here makes primitive resort to your correspondent’s key-word spreadsheet (which I continue to refine), introduced in the previous post. The obvious question, then: how do the two modes of presentation compare?

Sure – you’ll accuse me of irredeemable, self-leaning bias, and I’ll have to take that hit; and while I’m at it, I’ll add another, incriminating mea culpa: I don’t think well visually, either. Ok, then; that means you’ll have to humour, or even better, patronize, me as we move along. But hey – it’s my blog.

But enough about me. We know perfectly well what it is the cloud means to tell us, but

With what precision is relative word incidence conveyed? It’s clear the cloud invokes two presentational parameters – word size and shading, in the service of the comparisons, and are we entitled to assume as a result that sizing is properly proportioned, i.e., does a word appearing 12 times in the speech grow twice as large as a word appearing six? And how do the blacks and grays comport with one another? I see no apparent rule at work for daubing this or that hue across this or that word. Perhaps there is no rule, and the colors merely subserve some fetching aesthetic variety. But is this something about we should be wondering?

More subtly perhaps, we could ask as well about word positioning. It’s clear that the words above are not made to stack in upright, size-hierarchical relation, but if not that, then what? If the design intent is merely random, ok – but what then do we make of that presentational gestalt, and its news-informational value?

The word cloud conceit surmises a way to look at categorical data – essentially qualitative information, of the classic gender-religion-ethnicity stripe. These data can’t be directly scaled – maleness can’t be “more” of anything than femaleness – though of course they can be counted. Here the data are words, and because they aren’t mapped to either of the great existential properties of space or time – that is, they aren’t forced into place – the words can prance anywhere across the viz, owing to designer discretion. By way of counter-example: a data viz of crimes mapped to latitudes and longitudes necessarily slots the crimes into the coordinates at which they’ve been perpetrated. Word clouds incur no such compulsion, thus vesting placement decisions in the designer alone. Where, then, do the words go? The point is that they can go anywhere – and that is the point of the question.

My prosaic little spreadsheet-driven word count, on the hand, does just that – it counts word frequencies rather determinately, and sorts them too. Pushing the matter to the retro breaking point, you could even do this:

Now that’s not cool. But what does the reader want and need to know?

There’s something curiously interstitial about word clouds. Driven by a thrall with word frequencies – typically the province of academics and game designers – the clouds yet give pride of place to the imagery festooning the data. We’re given a beautiful arrangement of nuts and bolts – but for whom?

Do a Google Images search for word clouds and you’ll see what I mean. Or wend your way to any of the word cloud-construction sites – e.g., wordle, tagxedo, worditout, etc., and check out the possibilities. It’s fun to make word clouds, and there’s nothing wrong with that – but what are journalists doing with them?

True – Kristen Long’s cloud is a good deal more sober that those espoused by the novelty sites, and it delivers its macro point, to be sure. But don’t the basic informational questions remain there to be asked? (Also look at Wikipedia’s examples in its entry on clouds, including a nice key-word comparison between George Bush’s 2002 state of the Union speech and Barack Obama’s 2011 oration, the words sorted in alphabetical order.) How do clouds contend with the demands for reportorial precision?

Needless to say, all of the above points to the much larger question about the visual portrayal of data that might be otherwise delivered through more plebeian means, e.g., spreadsheets. Sorry – I can’t answer the question here, only ask it. But are journalists – properly seeding the clouds?

4 Responses to “Cloudspotting: A Word about Word Clouds”

Hi, I agree with everything you’ve said here re the imprecision of word clouds, but there is something deeper that is wrong with them. Word clouds cannot convey meaning. They are certainly suggestive, but stripped of the context which gives words their semantic content, we are forced to provide our own..by guessing! It should be obvious, but its very rarely said – If you want to understand a text, READ IT, don’t count the words! I believe that the necessity for understanding large datasets by aggregating and categorising has blinded people to the fact that natural language texts are not large datasets. Their meaning is intrinsic and a wordcloud is as emotive as it is non-representative. Sorry, rant over, but word clouds can ONLY be decorative, and are ALWAYS obscuring the information, not revealing it.

Hi and thanks for your note. No real argument with anything you’ve said, other than to suggest that given the nature of a speech or text excerpt one could perhaps infer semantic intent, more or less. Word frequencies are also a staple of literary analysis, but your point would apply there, too.

Hey, Kristen Long here, the creator of the chart and former POLITICO information designer. I agree with you about the lack of depth in data, but at the time I was also working on a tight deadline between getting the speech from the White House and sending the page to print. I might’ve done it differently with more time. The point, really, was to provide that macro view of the speech and fill some space on that page in the paper because we didn’t have that many good photos either.

Hi and many thanks for your note, and the explanation. How’d you find out about the blog? True – the “job” “jobs” conflation can be troublesome, though I thing I disambiguated them properly in my spreadsheet breakout. You’re quite right about “middle class”, which I now count at 4 instances. The problem is that my search routine rids the text of non alpha/numeric characters, and so I ended up with “middleclass” in the searchable text, and didn’t think to look for that. Note too that MS Word’s word count deems “middle-class” one word.