Computer science is a fast-changing field of research. Can we track some of its changes in the past? To answer this or any other question, experimental computer science researchers have essentially three options:

Write a program.

Search the Internet.

Write a program that searches the Internet.

We'll take the last approach in this post. We can pick a term and look at term frequency, that is the fraction of abstracts containing the term, as a function of time. Conveniently the ACM Digital Library and Guide have an advanced search that we can exploit to analyze term frequency across decades of abstracts. We'll start with a generic term, performance:

Let me say now, in case it becomes unclear later, that all the data presented here is actually derived from the ACM's database; it is not made up. (As for the interpretation of the data, well, ...) Moving on, we can plot a couple other terms, which and that:

The naive reader will conclude that conjunctions went out of fashion in the mid 80's, and came back after the dot com crash. However, the inappropriately perspicacious reader realizes that this conclusion is subtly flawed, because the word "that" might be a pronoun, adjective, or adverb: not just a conjunction. Fortunately the astute reader realizes we can look at relative instead of absolute frequencies to help clean up the data set.

The next group of figures are normalized by the term performance. For example, the blue line in the first plot below shows freq(mobile) / freq(performance).

Select plot:
Click on the image below for a larger version.

Now some "matchups", where x v. y = freq(x) / freq(y). Click on an image to enlarge.

distributedv.centralized

flatv.hierarchical

Berkeleyv.Stanford

Usv.Them

They're winning, but we're gaining on Them.

Now our final, and perhaps most important, plot.

Goodv.Evil

While computer science is solidly on the side of good, several abstracts in this search were disturbingly megalomaniacal, such as J. R. Landry's "Can computing professionals be the unintentional architects of evil information systems?", in ACM SIGMIS-CPR, 2008. The author "discusses how the technical rational paradigm supports the creation of systems that embody administrative evil" and aims to "determine if information systems can harm or be evil, the frequency of harm, and response to harm by designers and users."

We can be thankful that according to its abstract, the paper is only "a research-in-progress".

During a recent trial, both attorneys handling a case before Judge Timothy M. Frawley misused the word "acclimated" while questioning a witness. Frawley ... waited for a break in the proceedings and, outside the presence of the jury and witness, alerted the attorneys to the correct use of the word.

Given two words, can we connect them by a chain of synonyms? For example: minuscule — little — short — poor — wretched — ugly — frightful — tremendous — enormous. Try some:

[Note, the tool's offline right now due to a system reinstallation ... should come back shortly ... -- Dec 6, 2011]

Connect two words with a chain of synonyms:

Look up a word:

In the graphs above, each node is a word and an edge connects each pair of displayed words that can be synonymous. The raw data on semantic relationships is from Princeton's WordNet project.

Many word pairs are not linked. How large is the largest connected component—that is, the largest set of words such that any pair in the set can be connected by a chain of synonyms? An interesting graph theoretic question.

The answer is that the largest component has 25,105 words, or 17% of all words in the database. Meanwhile, the second largest component is over six hundred times smaller, with only 38 words (show it above).

Is this structure bizarre? Actually it's roughly what one would expect knowing random graph theory, which I'll now attempt to explain in one paragraph. A classic (Erdős–Rényi) random graph consists of n nodes, connected by a bunch of edges chosen uniform-randomly. Suppose the number of edges is such that on average, each node has d neighbors (for us, d synonyms). Imagine exploring a connected component by starting at one node and expanding outward: first looking at the nodes that are one step away from the starting point, then two steps away, and so on. What happens to the size of this frontier? If d shrink by a certain percentage in each step, so with high probability it dies out before the component gets very large. On the other hand, if d > 1, then the frontier tends to expand by a certain percentage in each step. Chances are pretty good that the frontier just gets bigger and bigger until the component includes a good fraction of all n nodes in the graph. That's the giant component. The second largest component must be very small, specifically O(log n) nodes, since otherwise it's likely to intersect with, and thus be absorbed by, the giant component.

Our graph of English has d = 2.92 synonyms per word. But of course English is not a random graph—which, with the same d, would have a largest component of 93% of its nodes and a second largest of about 5 nodes. Considering that we get within an order of magnitude without modeling any of the structure of the language except the number d, this is not so bad. And WordNet doubtless does not perfectly represent English: for example, it's plausible that common words are better annotated with synonymy relations than, say, abulia or vituperation.

I was led to think of this topic during conversation with some folks from UCL and CMU. But others have built a graph out of WordNet too (after all, it is called WordNet). A paper by Kamps, Marx, Mokken, and de Rijke, Using WordNet to Measure Semantic Orientations of Adjectives, scores words based on their relative distance to "good" vs. "bad".