The Segway of Digital Searching

It could be a failure of my imagination, but I’m having a very hard time seeing what would be useful about Google’s new Ngram Viewer. It’s being presented in the Times as some sort of breakthrough, and I suppose it is–it’s never been possible to do this before. But then it was never possible to ride around on a two-wheeled upright self propelled conveyance before either, and that didn’t turn out to be as earthshaking as proclaimed.

So I’m a historian–I go and enter the word “slavery,” and waddya know: they used it a lot in the 19th century, and it peaked in 1860. Intriguing! why would that be, I wonder?[1. this is snark: it is of course right around the year the Civil War broke out.] This would seem to be an instance of using elaborate methods to establish the already known.

It’s interesting to know that slavery peaks in 186o, makes a mild comeback, then declines till the 70s when it bumps up again. I’m guessing this is the effect of the TV show Roots.Which is another problem with the data–you don’t know the relative frequency of the word, because you don’t know how many books were published in each era, or what might have inspired sudden increased usage.

So let’s try word pairings: “slavery” and “freedom.” Wow, Freedom really experienced a bull market in the late 40s, and practically ran off the charts at the same time as the Roots boomlet, then it declined.

Yawn. The really interesting thing isn’t how often the word appears, it’s what it’s made to mean in the text–100o monkeys mindlessly typing the word freedom would not elucidate the word’s meaning, but they would show you a spike in the word’s usage.

With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The lines eventually cross paths about 1986.”

Are we really to believe that feminism drove the usage of the word “women?” I mean, it’s just as likely to be the opposite: medical discourse on women and hysteria, conservative arguments about women’s naturally blah blah etc. nature; sentimental novels designed to appeal to women readers but bearing no relation whatever to feminism as Cohen means it–it tells you almost nothing to know that the word increases in frequency.

Can I prove that? Let’s see–I’ll try searching for “women” and “hysteria.” Hysteria is a long flat line, way down there at the bottom, while “women” moves around like stock prices. There must be no connection between women and hysteria.

Ok, so we just missed the enormous psycho sexual literature about women and neuresthenia, the anxiety about about education and hysteria, we entirely missed the work of Sigmund Freud (probably a good thing); we don’t understand that such a thing as the “the Yellow Wallpaper” exists, or the phenomenon it documented.

In this sense it’s worse than useless: it allows Cohen, a smart woman working under deadline pressure, to confirm a really vapid cliche.

It’s not hard to make something like this massively more useful for humanist types: word pairings within the same text. How many times does “women” and “hysteria” appear in a single text? Or proximity searching–find “women” near hysteria.” Or show me what words “women” is most often paired with–that would be extremely useful.

But if you enter word pairs in Ngram Viewer it actually gives you the wrong impression–it gives you separate lines which only intersect in frequency. In that sense it does more harm than good–it reinforces some kind of odd idea that words bear no relation to other words. (While I don’t know for a fact that there a linguists who believe that, I’d risk a significant sum betting that there are).

And as Dan Cohen points out, we need to get to the texts right away–from the point of view of a historian, as a tool for getting to actual texts, Ngram Viewer is just a distraction. It’s no better than just entering “women” in the Google Books search box with a date range specified.

It’s not really all that useful to learn that the word “gangster” shot up in the 1920s and peaked in 1940: I learned that from Bugs Bunny cartoons.

If this post has an irked tone it’s because so much money and hype and attention has been paid to something so obtuse.

If someone wants to show me what can be done with this tool, I’m eager to learn.

I want to defend the ngrams data and suggest that many of things which are irking you are as much a function of the primitive search interface as they are the data.

The reason to be excited is that, right now, so much data is available. What we can do with this stunningly enormous bag of glyphs (I would not even use the term “word” yet) is indeed primitive. But I do think it already opens itself to possibilities you haven’t fully allowed. Searching on big abstract nouns is, at best, like trying to read braille through a burlap sack. But searching on other terms is more productive.

No new knowledge or research agenda here; but I’d be loath to say that even this primitive visualization is worthless. (I would further qualify the results and insist that what is of interest is the trend; even the relative heights can be confounded by other people with the same surname or names with unusual diacritics). I, like Dan Cohen and Benjamin Schmidt, was impressed by the way in which the Science paper attempted to examine censorship and suppression by a comparing ngram curves in different languages across the same historical period.

Or consider the wonderful search which someone much cleverer than I came up with of “beft/best.” (Dan Cohen mentions this example with reference to Danny Sullivan’s post.) That one image confirms what we already know about book and tyographical history. But it’s hard to imagine a more compelling visualization of this fact, isn’t it?

(I have turned off the smoothing in each case which, I think, prevents the ugliness of the underlying data from being prematurely obscured.)

My point? The data itself remains non-ideal, the OCR is, let’s be kind and say “imperfect.” The absence of any link back to the texts from which the ngrams are extracted hampers research. There are reasons to question the quality of the metadata which provides the dates (or the justification/reasoning about how to handle multiple editions of a single work, etc). The lack of any real sense of precisely what books are being searched is a bigger problem still. The absence of periodicals, newspapers, etc, is an enormous lack. (All these points are eloquently made by Mark Davies of the Corpus of Historical American English.) Heck, I have deeper reservations than these about this sort of quantification of culture.

But to judge this data based on its current state, and the currently usable interface is premature. The analogy is not to the Segway (we can do it, so what?), but to the first combustion engines (yeah it runs, but it doesn’t take us anywhere).

But the data itself is available. There is nothing but will and know-how (and the frighteningly large processing requirements) preventing someone from taking the 4-gram or 5-gram data and making it queryable in precisely the fashion you describe: show me most frequent collocates of “hysteria.” This seems worthy of (not uncritical) celebration and support.

The “beft, best” search tell me nothing about typographical history: they merely tell me that Google can’t OCR some pre-1820 texts properly. Spellings such as “beft” did not exist until relatively modern times after the tall lower-case “s” (extending full height and below the baseline) fell out of use, and eventually people became unable to recognise it as an “s” and mistook it for an “f” instead. (If you look at books using that form of “s,” you’ll note that “f” is quite a different form.

This is similar to the way that we now see signs saying “Ye Olde Shoppe”; having lost the letter thorn, people now substitute an entirely different letter with a similar shape.

That the database won’t find words because it contains mis-OCR’d data is not encouraging.

I think the “beft,best” search does tell you something; it tells you when the shift occurred, so long as you know to look for it. The wikipedia page for the long s has added a similar image (“last,laft”) to illustrate (in a general way) when the change occurred. This is not new information. But it is, to my mind, a remarkably successful illustration of how even a simple search on very rudimentary data can illuminate or access aspects of the historical print record which the database itself was not designed to address. The knowledge isn’t “new” to scholars; but is “new” in the sense that it is extractable from the ngrams data without being deliberately put in.

I’m not denying the many problems with the data. If you switch to American English, the data gets sparser (in part, no doubt, because fewer books were printed in American in 1700) making any comparison of English and American printing impossible.
(And the complexities of how they’re the British and American corpuses are another issue entirely.)

You call this mis-OCR’d data. To my mind, I prefer the data in this state. If the langauge model the OCR used “knew” about typographical history, it would make a search for “best” easier. But it would actually reduce the number of things one could learn from the data using this basic interface.

(1) Metadata: Google Books has many texts tagged with the wrong date, due perhaps to plain error, perhaps to the use of foundation dates of publications for all of their content, and perhaps to the use of later editions of earlier texts. As an example, a Google bigram for “Nelson Mandela” between 1885 and 1910 reveals a few peaks, and clicking through the links below shows that these are indeed references to the former president of South Africa (who was born in 1918), all misdated.

Ought to have added: the medial-s OCR issue is not such a problem in a case like “beft / best” (where it might be revealing), but what about “fend / send”, “fore / sore” and similar cases, where both forms are English words?

Actually, the search could be done just as well, or even better, by allowing a search for “best,beſt”. (Yes, Unicode does have a long-S code point—if those two words look the same to you it’s due to the font you’re using).

Corrupting your source database in order to enable a particular type of narrow-interest search through one particular interface is, well, “not good practice.” (I have much stronger words for it than that, actually, but I won’t use them here.)

As I explained in my comment on the next blog post, the value in this project is not in the sample interface to the database, but in the database itself. And I have confirmed (by downloading and checking it) that the database is corrupt. In the first (number 0) file of 1-graphs alone there are 842 instances of the string “beft” (as all or part of a word) and none of “beſt”.

I realize I neglected to make the point about the effects of corruption should one chose to use only the ASCII character set, thus forcing the decision of whether to substitute (short) “s” or “f” for the long “s”. If you want to see the problems that causes, try searching for “sin,fin”. Essentially, the two words are mostly indistinguishable before 1800.

[…] too, so I don’t need to link to it. Just one thing, though: for pity’s sake, people, quit running it down because dinking around with it for less than a week hasn’t produced superlatively original […]

[…] with, but, as I’m sure all my hermeneutically suspicious readers know, there are plenty of objections to taking the findings seriously. The team of non-digital-humanist scientists behind it have since […]

[…] wariness over the prospects for a humanist mining of the corpus. (See, for example, Dan Cohen and Mike O’Malley for historians who are cautiously optimistic and crankily skeptical respectively. The response in […]