“Distant Reading” and Web Archiving

The following is a guest post by Andrea Fox, Web Archiving Intern at the Library of Congress.

Andrea Fox

When Abbie Grotke of Web Archiving took me on for an internship, I thought well of myself for a few minutes until realizing I had no clue what Web Archiving was or what it wanted from me. Abbie didn’t seem to mind I had no digital background and was majoring in linguistics (studying over the summer allowed me a break in winter classes). She may have had misgivings after our phone interview.

Abbie:So, this internship isn’t really related to linguistics, but you say you’re interested in archival work and organization. Do you have good computer skills?

Me:Yes, I do the computer.

Abbie: Right. And how about your experience with detail-oriented work?

Me:Oh, I’ve worked with a number of details.

Abbie:…Okay. Back to your earlier question about what artifacts we’re archiving. You do understand you’ll be helping with archiving the web itself and not—as those unfamiliar with digital futures might conclude from our department title—archiving by means of the web?

Me:Yes, absolutely. I am in no way hiding the fact I didn’t know that until this moment.

This dialog, if not factual, gives an impression of my feelings at the time. Though born in DC, I’ve lived in isolated areas for most of my life. I wasn’t nervous about working in the city, but knowing now that I am capable of becoming disoriented in the one block between Eastern Market and the Eastern Market Metro, perhaps I should’ve been.

Once I met the Web Archiving team, my fears dissolved. They introduced me to concepts quickly but patiently and gave me feedback as I completed data cleanup tasks to prepare archived resources for access. Scrolling through thousands of collection entries to catch repeated or inconsistent titles turned up several interesting finds. In a series of websites of U.S. election candidates, for instance, I noticed such contenders as Vermin Supreme, Jon Trailerpark Jackson, and Kinky Friedman.

In the meantime I also worked with Michael Neubert. Remembering my linguistics major, Mr. Neubert encouraged me to look into computational analysis of digitized texts. He suggested I start by exploring how researchers can use a collection like Chronicling America to perform a “distant reading” of thousands of pages, finding patterns an individual reader cannot.

The report (pdf) begins with an exploration of the Google Ngram Viewer, a tool that graphs word and phrase frequencies in Google Books’ collection over time. I played around with different languages for a while, working off of linguistic trends I’d read of and wanted to test.

Spelling changes often demonstrate gratifying patterns. Here you can see how connoisseur, known to modern English and old French, subsequently changes to connaisseur in modern French:

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

Another transformation occurred after the English used their beef from the French bœuf to form beefsteak, a word that was then reborrowed in an altered form into French:Figure 2. Borrowing of French bœuf into English beef, followed by reborrowing of English compound beefsteak into French bifteck (both multiplied to show detail). (Original graph at http://tinyurl.com/q6vkcr2).

I couldn’t have made these comparisons without previous knowledge of words that have undergone change. Researchers who specialize in analyzing texts on a large scale, however, could potentially automate and expand these types of linguistic searches with more advanced tools, making conclusions larger than the words themselves (read the paper (pdf) produced by the Google Ngram Viewer team). Though the Ngram Viewer provides a relatively shallow view—the user can’t see the original context of the searched words—its scope allows for unusual insights.

Moving beyond spelling changes, I searched for revolution in eight languages (nine groups of texts when you count British English) to get a superficial idea of which countries discussed the term when. The result looks like a fluke:

Figure 1. Borrowing of French connoisseur (blue) into English (green) and subsequent change of the French spelling to connaisseur (red). (Experiment with the original graph at http://tinyurl.com/qfc8byj).

The major spike in Chinese corresponds with the 1966 Cultural Revolution. What shocked me is how the Chinese frequency dwarfs that of the other languages: at its peak in 1969, the simplified Chinese 革命 appears nearly twenty times as often as its closest competitor, the German Revolution. As a benchmark, the word the hovers around 5 percent usage in English. Have, I, and for (0.35 to 0.6 percent) appear roughly as often as 革命 in Chinese (0.35). (For comparison, see the same graph excluding Chinese at http://tinyurl.com/opxqlxs).

Barring a double meaning not listed in the dictionary, an imbalance in the Chinese texts, or some bizarre mechanism behind Google Books, it seems only a censorship bubble of strictly-policed word choice could cause this disproportion. I found a similar asymmetry the graphs of leader and censorship. Though not nearly as exaggerated as the spike in revolution, these findings suggest a political agenda has skewed the results.

Several other words, listed at the end of this post, show upturns, though some during different periods. I’ve included pairs of words in which only one demonstrates a strong pattern. I used neutral words, such as town and language, as controls. Apart from a tendency for Hebrew to rank highly with more concrete words—though not as highly and narrowly as Chinese—these words don’t seem to demonstrate significant patterns.

I’ve chosen to share these Google Ngrams because they are visually appealing and user-friendly (you can click on any one of the chart links above or in the list below and adjust the input phrase(s), time period, and number of languages for new results). More advanced and research-driven tools, however, have already been put to use on bodies of text, including the Library of Congress’ Chronicling America collection, that do not have the same access restrictions as does Google Books. The technique of topic modeling, for instance, can point to trends in ChronAm’s digitized newspapers by identifying distinct “topics,” each with its own cluster of related words, based on the likelihood of these words appearing near one another.

There is tremendous potential to conduct these types of analysis on web archive collections to help researchers navigate large quantities of available material. When you let the machine page or scroll through thousands of pages and begin the pattern-making process, you can better focus on those patterns–whether expected or unexpected–and decide which ones warrant your attention.

After three months, I’ve accomplished more than I thought I would. I still don’t understand one in four words in the weekly Web Archiving meetings, but I’ll chalk it up to lack of technical background and smile smugly again.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Find NDIIPP on:

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.