Monthly Archives: December 2010

[Note: I got bogged down on reading and reporting on the culturomics paper a little too closely, and this is a reboot.]

Peter Norvig, one of the co-authors of the culturomics paper and the director of research at Google, was also the co-author on another significant article with the suggestive title, “The Unreasonable Effectiveness of Data”. The invention of the term “culturomics” suggests a scientific programme for attacking questions of culture that stresses statistical models based on large amounts of data, a programme that has been very successful both academically and commercially for linguistics and artificial intelligence, to say nothing of the informatics approaches of genomics and proteomics upon which the term culturomics is based. The slogan, “every time I fire a linguist, the performance of my system goes up,” (a slight misattribution of something Frederick Jelinek, a pioneering computational linguist, said), is another restatement of this. Among the bets and assumptions made by this approach are:

(1) The goal of science is better-engineered systems, which have practical, commercializable outcomes.
(2) Models must be empirically testable, with precise, independent, repeatable evaluation metrics and procedures.
(3) Simple quantitative models based on large amounts of data will perform better, in the senses of (1) and (2), than complex qualitative models based on small amounts of data.

Among the successes attributable to big data programmes include effective search engines, speech interfaces, and automated translation. Google’s rigorous approach to big data affects nearly every aspect of their business; core search for starters, but even more important are the big data approaches to Google’s ability to make money on its search, as well as decrease its operating costs.

Studies of culture are currently, for the most part, either done using complex, qualitative models, or based on relatively small amounts of data. The Google N-gram data is, perhaps, an opening salvo in an attack on qualitative/small data approaches to studies of culture, to be replaced with quantitative/big data approaches. The quantitative/big data programme has been “unreasonably effective” in overturning how linguists, artificial intelligence, and cognitive science researchers approach their field and get their projects funded. The bet, here, is that the same will occur in other culture studies.

There are many problems with the example experiments described in the culturomics papers. The experiments are often not described in enough detail to be replicable. Proof is often by example rather than by large-scale evaluation metrics. The proofs often resemble just-so stories (explanations without adequate controls) or unsurprising results (for example, that Nazis suppressed and censored writers with whom they disagreed is borne out by the data). The scope of the experiments is often very limited (for example, the section on “evolution of grammar” laughably only describes the changes occurring in a small subset of strong/weak verbs).

Because this is an overview paper, it may be that some of the important details are missing for reasons of space. Some of these things are addressed in the supporting materials, but by no means all. For example, something as basic as the methods used for tokenization—how the successive strings of characters in the digital copies of the books of the corpora—is not really defined well enough to be repeatable. How, for example, does the system tokenize “T.S. Eliot”? Is this tokenized the same way as “T. S. Eliot” or “TS Eliot”? Based on the sample N-gram viewer, it appears that, to find mentions of T.S. Eliot, the search string, “TS Eliot,” (similarly WH Auden, CS Lewis) must be used. The supplemental and related supplemental material give many details, but in the end refer to proprietary tokenization routines used at Google.

And yet, there are some useful ideas here. Because the N-gram data is time-stamped, looking at some kinds of time-varying changes is possible. The idea of measuring half-life of changes is a powerful one, and the varying amounts of time it takes to fall to half-life is interesting in their analysis of “fame” (in reality, their analysis of name mentions). Seeing how some verbs are becoming strong in the face of a general tendency towards regularization is interesting. And the lexicographic estimates seem very valuable (if not very “culturomic” to me).

A danger in the approach of culturomics is that, by focusing on what can be processed on a large scale, and measured with precision, interesting scientific questions will be left unexplored, perhaps especially when those questions are not of obvious economic benefit. Engineers build better engines, not necessarily better theories.

Having said all of this, I remain optimistic about the release of the Google N-gram data, even as I resist the term and approach suggested by culturomics. Yes, Google needs to provide better descriptions of the data, and continue to clean up the data and metadata (as well as describe and report on what “cleaned up” means; some of this is, in fact, described in the supplementary data [4]), and to be much more transparent about access to the underlying documents, when permissible by law. It would be very useful for Google to provide algorithmic access to the data rather than just make the (very large) data sets available. But these data can be mined for interesting patterns and trends, and it will be interesting to see what researchers do with them. Let’s just call it time-stamped N-gram data, though, and eschew the term culturomics.

I’ve now had a chance to read the Science Express article describing “culturomics,” that is, the “Quantitative Analysis of Culture Using Millions of Digitized Books,” recently published by researchers at Google and several high-profile academic and commercial institutions. The authors claim to have created a corpus of approximately four percent of all books ever published (over five billion books) and supplying time-stamped ngram data based on 500 billion words in seven languages, primarily English; from the 1500s until roughly the present, primarily more recently. The “-omics” of culturomics is by analogy to genomics and proteomics: that is, high-speed analysis of large amounts of data. As someone who has done some work with ngrams on the web (with data provided both by Bing, my employer, and earlier data provided by Google), this work is of great interest to me. Digitized books are, of course, a different animal from the web (and queries made to search engines on the web), and so it is of interest for this reason. The addition of time-stamps makes some kinds of time-based analyses possible, too; this is the kind of data we have not had before (Bing’s ngram data, assuming they continue to provide older versions, might eventually do so).

There have been a number of criticisms of the culturomics programme, many of them well-founded. It is worth-while to describe a few of these. First, there are problems with the meta-data associated with the books that Google has digitized. As a result, it is unclear how accurate the time-stamps are. This is not addressed in the article, although this has been a well-known problem. Certainly, they could have sampled the corpus and given estimates on the accuracy of the time-stamp data. Related to this is the lack of a careful description of the genres and dialects represented (partly a failure in the meta-data, again). Second, there are systematic errors in the text scans; this is especially true for older books, which often used typographic conventions and fonts not common today (and, one assumes, errors made due to language models based on modern texts rather than pre-modern ones). Consider, for example, the “long s” previously used in many contexts in English; this is often read as an “f,” instead of a long s. Incidentally, according to the tokenization goals of the project, the right thing to do would be to record the long s, not regularize it to modern, standard ‘s.’ Otherwise, it becomes more difficult to track the decline of the long s, except via guessing OCR errors. The whole notion of what it means to constitute a countable thing in these corpora–that is, what are the tokenization rules for generating the 1-grams, is given short shrift in this article, although it is a fairly important issue.

Third, the presentation of the Google Labs ngram viewer has made it overly easy to tell “just so” stories. For example, consider the contrast between “Jesus” and “Christ” from 1700 to 2008. It’s easy to tell this just-so story: People talked about Jesus or Christ before the revolution, but then not so much in the run-up to the War and its aftermath. But, with the Great Awakening, a large number of books were published, with “Christ” much more common that Jesus. Over time, due to increasing secularization of the United States, people wrote about Jesus or Christ less and less. The individualistic Evangelical explosion of the last thirty years has started to reverse the trend, with “Jesus” (a more personal name) becoming a more popular name than “Christ” (a less personal name). Natalia Cecire describes this, better and more succinctly, as “Words for Snowism“. Cicere also views the Ngram Viewer as a guilty pleasure; as epistemic candy.

The Science Express article describes several experiments made with the Google books data, and it is worth spending time examining these, because this gives good hints as to what the data are likely to be good for; where the culturomics programme is likely to head.

The first set of experiments describe attempts at estimating the size of the English lexicon, and its growth over time. The authors describe (qualitatively, for the most part) their sampling technique for determining whether an 1-gram token was an English word form: it had to appear more than once per billion; a sample of these common potential word forms was manually annotated with respect to whether they were truly English word forms, or something else (like a number, a misspelling, or a foreign word). Sample sizes, procedure, inter-rater reliability, etc., were not reported; an important flaw, in my opinion. They show, for example, that the English vocabulary has been increasing by over 70% in the past 50 years, and contrast this to the size of printed dictionaries. This first set of experiments will be of great interest to lexicographers; indeed, it is just this gap that commercial enterprises like Wordnik are trying to fill). It is hard to see how this says much about “culture,” except as fodder for lexicographical historiography or lexicographical evangelism: there are many words that are not in ‘the dictionary:’ get used to it.

The second set of experiments purports to describe “the evolution of grammar,” but, not surprisingly, only attacks a very small subset of lexical grammar: the change in use of strong and weak verbs in English. Given the time-stamped, word-form based data, it is relatively simple to check the movement from “burnt” to “burned,” and compare and contrast this to other strong and weak verbs. One wishes for more description how they reach their conclusions, for example, “high-frequency irregulars, which are more readily remembered, hold their ground better.” This reasonable statement is “proved” by a single contrast between finded/found and dwelled/dwelt. Some of the conclusions are based on dialect: there are differences between American English and British English. Unfortunately, the actual scope of the differences is “proved” by a single contrast between the use of burned/burnt in American and British English. Again, knowing the accuracy of dialect assignment, list of verbs used, etc, would be very useful.

Some years are hinge years; a door swinging from one place to another. 2010 has been a hinge year for us—Jane, Mark, Will and Bess Fitzgerald, with Jane and Mark shifting life stages and transitions for the church that meets in our home. Along the way, though, we’ve had many joys and normal engagements. Here is a bit of the way life has swung for us this year. …

Deresiewicz decries social networking sites, especially Facebook, especially in the way that relationships between users of Facebook are called “friends.” Writing a status message is “like pornography.” We have no time for stories. Our relationships are commercialized. We are overwhelmed with the number of false friends. Facebook “seduces” us. It’s a “mirage.” Facebook friends are simulacra of friends. These sites “have falsified our understanding of intimacy itself.” Friendship is devolving.

His argument runs something like this: in the good old days, we had real friends; you know, like Jonathan and David. Then, a lot of things changed—Democracy! Capitalism! Equality! Industrialization! Mass Media! Friendship wasn’t just between male non-homosexuals anymore. And now Facebook is changing things again. It has disadvantages—it affords shallow communications more than deep ones, it’s built for commercial purposes, and so our social relationships are deformed toward commercial ends—and therefore, friendship is doomed. “We have given our hearts to machines, and now we are turning into machines.” (As further proof: a “recent book on the sociology of modern science” notes that at a networking event (that is, an event whose main purpose is to make social connections), “There do not seem to be any singletons—disconsolately lurking at the margins—nor do dyads appear, except fleetingly.” In other words—at an event created so people could meet other people, people were meeting other people instead of being alone or with one other person.

It is unfortunate that Deresiewicz took out the reactionary essay template, and started filling in the blanks. I think he could describe the advantages and disadvantages, the risks and opportunities of social networking sites. He could help us mitigate against the bad and amplify the good. Instead, he crafted a classic rant, full of guilt-by-association logical fallacies and arguments from silence. He could have helped us understand better what it could mean to be a friend—its limits and extents, instead of just looking backward to the days of “10-page missives.” (You know, in the happy days before universal literacy and a 40-hour work week. Hmm, I’d actually be willing to wager a significant sum of money that the per capita production of the modern equivalent of 10-page missives—what, about 2000 words?—is greater now than in even the days of the Bloomsbury Group.)

I have a few more than 500 “friends” on Facebook. Like many people I know, I would call some of them friends and some of them Facebook friends. By calling someone a “Facebook friend,” I am signaling some attenuation in the meaning of “friend.” Many of these people are co-workers or former co-workers. These people I call co-workers or former co-workers. A few, I am glad to say, I can also call friend. A few are former students; these I call former students. Many of these are people I sing with throughout the country (avocationally, I enjoy Sacred Harp singing); these I call people I sing with. Some of them are also my friends. Some are members of my family; these I call my wife, my son, my daughter, my niece, my brother, etc. A few are people who I know through work, colleagues or potential clients or employers, or people whose career I might be able to further, or who might be able to further my career. Most of these I would call a Facebook friend, but probably not a friend.

You see what I am doing, I think. That Facebook decided to call this connection friend does not imply that I adopt these connections as friendships. Facebook’s adoption of the term friend has more to do with some of the social and cultural factors which Deresiewicz, in his calmer moments, describes well, than with Facebook friend changing the very nature of friendship. And I don’t think I am especially aware; the existence of the expression Facebook friend in and of itself signals this, as I stated previously.

Some have argued strongly that social networking sites like Facebook should have a more nuanced public ontology of relationships; a teacher’s primary school students are not “friends” like the teacher’s co-workers are, and they shouldn’t have access to the same pictures, statuses, etc. [2]. Fair enough; this is a real issue. In point of fact, Facebook, has recently announced a revision of the user profile [3], including something called “Featured Friends”:

You can now highlight the friends who are important to you, such as your family, best friends or teammates. Create new groups of friends, or feature existing friends lists. I opted to feature my Ultimate Frisbee teammates, giving the rest of my friends a way to learn more about that part of my life.

Also included are more structured ways to describe one’s work and school history, and other facets of one’s life. But, one must say, this is likely mostly to benefit Facebook; most of the people in my social network either already know or do not care that Dan Fitzgerald is my brother, for example; and that Daniel E Fitzgerald is the Facebook account he has. But the people who do care are Facebook itself and the companies to which it sells advertising and other network data. Knowing that this account belongs to my brother instead of just my friend is of great economic value to them—of course, by this I mean the social network, now annotated with relationship labels.

I suspect there will be, and perhaps should be, a pushback against this further structuring of relationship labeling. I suspect we may even long for the day when we just labeled people as (Facebook) friend, bleached of the depth of meaning I have in my real friendships and other relationships.

I entitled this essay “computer-assisted friendship,” because, in general, I am very glad that networking and computer technology has made it easier to maintain friendships and relationships with a wider variety of people, from my past and my present, who live near me and who far away from me, at levels ranging from transactional, to superficial, to amusing, and to deep engagement, in ways not so easily supported by the technologies of paper, pen, highway transportation systems, and a postal service.

So, I’m grateful for all these “friends” and friends: The girl from my high-school Christian band lives on a farm not so far away. Interesting. The singing friend with whom I share jokes, car rides, and nurturing wisdom. I’m grateful. The singing friend from a red state with whom I never discussed politics, but who posts anti-tax and pro-America messages. We’re learning how to engage and disagree respectfully. The woman whom I have barely met, but who is a recent widow and who posts long and heart-breaking weblog posts as well as Auburn University football fandom. A call to prayer, and a new team to know about. The artist whose work in on our walls. I’m glad to know about her trips to Africa and where she’s selling her art. The pastor in San Francisco who worries that technology will destroy friendship? I hope this essay will lessen his worries.

“Stentorian” derives from Stentor, a herald of the Greeks during the Trojan War. This post is a response to Robert L Vaughn’s post on stentorian. I thought stentorian meant “in a grand rhetorical style,” but I think it does mean just “powerfully loud,” so Sacred Harp or black gospel music could be said to be sung in a typically stentorian manner, I think. But I’m not quite sure: there are not many musical examples. But “The Stentorian Harp” would be a cool name for a shape note songbook.