Edited by Hugh McGuire and Brian O'Leary

Making Books Out of Words (Erin McKean)

Erin McKean is the founder of Wordnik.com. Previously, she was the editor in chief for American Dictionaries at Oxford University Press, and the editor of the New Oxford American Dictionary, 2E. You can find her on Twitter at: @emckean.

Traditionally, the writer’s reference toolkit has consisted of a dictionary and a thesaurus (sometimes augmented with books of quotations or allusions). These are books that are used to make other books, allowing writers (in principle) to check the strength and suitability of their words before committing them to sentences and paragraphs, before completing their explanations and narratives.

But just as some aerialists work without a net, many writers work without these tools, or are told that they ought to. Hemingway said “Actually if a writer needs a dictionary he should not write,”[1] and Simon Winchester (himself the author of two books about the Oxford English Dictionary) called the thesaurus “a tool for the lexically lazy”[2]). And the lamentations over the shortcomings of the automatic spell-checker[3] have been loud and long.

It’s no wonder that some have warned against the use of these tools: static paper (or static electronic) dictionaries and thesauruses (and books of quotations and allusions) are by their very nature inclined to be out-of-date, incomplete, and inadequate, because they have only been able to track the uses and meanings of words from a relatively small selection of previously published works. (Even the current online version of the Oxford English Dictionary includes only 3 million citations; the first edition of the OED cited only 4500 works.) The chain of events by which a new lexical creation is captured—a writer creates or changes the meaning of a word, publishes a work in which the new word is used, a dictionary editor sees it, a dictionary is published including it—is so long, and has so many points of breakage, that it’s a wonder new words are ever found, or that new dictionaries are ever published.

Dictionaries and thesauruses, too, because they’ve long been creatures of print, are also necessarily over-compressed, reducing words to their broadest applications and their lowest common denominators. In a print-based dictionary, there’s no way to include every possible word of English, much less to account for every possible context in which a word could be used, or for a printed thesaurus to give enough information around a word’s possible synonyms and antonyms to be truly helpful to even the most conscientious writer.

But when all books are “truly digital, connected, ubiquitous,” there won’t be a need for the traditional (inadequate, static) dictionary or thesaurus: the collected sea of words will itself be “the dictionary” (with a little computational help). The dictionary will no longer be a separate thing (or two separate things, dictionary and thesaurus). The dictionary will be a ubiquitous metalayer on top of all digital text, matching content and context to answer questions of both production and comprehension (or mere curiosity).

A true dictionary metalayer would be instantaneously and continuously updated, near-infinite, multilayered, context-driven and context-rich, interactive, and, eventually, no longer a separate thing, but an intuitive part of reading and writing. At some point it could even be a push rather than a pull technology, learning from readers’ and writers’ behavior, glossing unknown words and phrases or automatically and transparently suggesting alternatives to overused adjectives. Ideally, it would be accessible from every text, both atomic and interconnected, like the Internet itself.

For any word, the reader or writer could call up the “nearest” (or most relevant) information about that word. They could look for other examples of that word in the same book, by the same author (or by writers similar to the author), in the same genre, in books on the same or similar subjects or by the same publisher, or in texts published in the same year or in the same geographic area. Readers or writers could look for other examples where the word shares context: for instance, all the sentences where apples are golden and appear near the names of figures in Greek mythology.

This would be a two-way street, allowing readers and writers to comment, vote, recommend, and advocate for and against any particular usage, meaning, collocation, phrase, or quotation. This engagement would give every usage enthusiast (or “grammar Nazi”) the chance to opine on the acceptability of “impact” as a verb or “funner” as a comparative adjective. Connections between words could be weighted by use (both frequency of being written and frequency of being read, annotated, commented upon, or looked up) and context (a thesaurus where distinctions between words could be visualized more concretely, making the process of selecting a synonym more accurate).

Instead of the insistently obdurate suggestions of current spellchecking programs (and “grammar check,” both of which are sorely lacking in what can only be called “theory of mind”), we could have context-driven fuzzy matching, able to differentiate between “there” and “their” and equally capable of understanding proper names, novel combinations of morphemes (knowing “fallacious” and the prefix “omni-,” it shouldn’t choke on “omnifallacious”). Fuzzy matching techniques (such as the wonderfully-named “bursty n-grams”[4]) could help trace the development of ideas (or nab plagiarists). Techniques for identifying lexical and rhetorical patterns could automatically highlight and link sentences that are famous quotations (or that ought to be). Allusions (which could be thought of as lexicalized bits of history) could be made explicit (or, of course, left opaque, to keep from spoiling the thrill of discovery).

This information layer would incorporate traditional sources, dictionaries, and thesauruses both generalized and specialized, but it would also treat every text as a source of lexical information: sentences where words are the subject of the writing, not the raw material for writing about things, are relatively abundant in the language (Wordnik.com explicitly searches for these sentences in text–we call them “free-range definitions”). Some are very didactic:

“An aguantadora is someone who puts up with “it” and keeps going—“it” being whatever life throws our way.”[5]

Others are more parenthetical:

“The symposium will feature a new volume of 52 essays about association copies—books once owned or annotated by the authors—and ruminations about how they enhance the reading experience. Some include etymological information:[6] “Absorptive capacity,” a term coined in the late twentieth century, refers to the general ability to recognize the value of new information, choose what to adopt, and apply it to innovation.”[7]

[The one thing traditional dictionaries do that computational techniques find difficult is etymology: if you think of meaning as the demographic data around words, etymology is the genealogy of words, and requires specialized human investigation.]

It would also be possible to incorporate sentiment information, both explicit and implied: a search for “I love the word X,” for example, turns up words like ”burp,” ”curiosities,” ”douchebag,” and ”Shmorg”; ”I hate the word X” turns up ”retard,” ”abstinence,” ”hubby,” ”willpower,” and ”moist.” (A similar technique is used by the site sucks-rocks.com.[8])

Because dictionaries and encyclopedias differ mainly in scale (you can think of a dictionary as a specialist encyclopedia limited to words-as-things, instead of things-as-things), our sea of text could be mined to find encyclopedia-style facts, as well. (This is obviously akin to the semantic web.[9]) By looking for obviously factual statements (or factual statements about imaginary things, e.g. “Unicorns are beautiful creatures,” “John Carter is a Civil War veteran who was transported to Mars,” etc.) encyclopedia articles could be augmented with explicitly-sourced statements (or created automatically where human editors were not motivated to create them).

Discovery of related content, not just information about particular words, is a logical extension (and the original impetus behind most text digitization projects). Topical indexes and bibliographies would be bigger, more indexable, and more dynamic, although certainly more prone to problems of information-gluttony and the “just look at one more source” problem.

The traditional questions we have about words (what they mean, who uses them, and how) are not the only ones we can fish for answers for in the sea of words. Large-scale text analysis can be used to answer wider-scale questions about language and about culture. More than 5 million digitized books from Google Books have been released (in the form of n-gram counts[10]) for the express purpose of giving researchers (admittedly blunt and primitive) tools to investigate ideas as represented by words.

For instance, we can look at the relative trends in the use of –ess forms (like proprietress, ambassadress, etc.) as a proxy for the changing roles of women. A paper published in December 2010 in Science outlined this new field of “culturomics,” which they defined as “the application of high-throughput data collection and analysis to the study of human culture.” The authors also estimated that 52% of the English lexicon, “the majority of the words used in English books” were “lexical dark matter,” not found in traditional reference books.

Treating all digital text as a single sea of words would allow for more playfulness as well as more scholarship. Imagine navigating from one text to another via chains of connected words, going from:

“To be sure they all sleep together in one apartment, but you have your own hammock, and cover yourself with your own blanket, and sleep in your own skin,”

in Moby Dick, to a sentence ending with the same phrase (your own skin) in Uncle Tom’s Cabin:

“When you’ve been here a month, you’ll be done helping anybody; you’ll find it hard enough to take care of your own skin!”

Games could be built around getting from one text to the next in the shortest number of steps with the longest phrases, or in finding the most unlikely connections between far-flung texts (by date or place published, or by topic, or by political sentiments of the authors). Imagine games where extra points could be awarded for playing words used by Nabokov or Shakespeare or Nora Roberts (or where prizes in games could be sponsored by publishers to increase discovery or awareness of their books and authors). The possibilities for Mad Libs, crossword puzzles, and word-search jumbles are endless.

Although one of the strengths of digital text is the possibility of human-driven annotation, bookmarking and sharing, the downside of human-driven data is the possibility of bias and limited attention span. In his accompanying essay, “Why Digitial Books Will Become Writable,” Terry Jones asks:

“Why bother with the complexities of semantics or natural language understanding when, if you simply let them, users will happily tell you what they’re interested in and what web pages are about?”

But users often ignore “what everyone knows” and concentrate on the unusual. The OED faced this same problem early on. Readers focused on rare and learned words, ignoring the workhorses of English vocabulary. Editor James Murray complained that “Thus of Abusion, we found in the slips about 50 instances: of Abuse not five….”[11] By using statistical techniques, rather than pure crowdsourcing of items of interest, it’s possible to see not only hotspots of attention, but gaps and lacunae in coverage. This use parallels what the Berkman Center for Internet & Society does with their Media Cloud[12] project, which tracks coverage of issues in the news.

This metalayer over the sea of words could drive all sorts of other tools: hotspot maps of texts and genres and topics, as well as instant visualizations of patterns of reading and writing and changes in ways of referring to things. (When does the whiz kid inventor turn into the captain of industry? When does being a locavore no longer need a parenthetical explanation?) There are certain opportunities (as well as certain problems) that only become apparent at scale. Will we find forgotten gems or drown in megabytes of irrelevance?

There are obviously technical issues involved in creating and navigating this sea of words–there will never be one single vat in which we store every text. But making every little bucket and cup able to share aggregate statistics and indexes and metadata across an information layer would go a long way towards creating the universal grasp of knowledge that readers and writers have longed for since Milton, apocryphally thought to be the last man in Europe who had–or could have–read everything there was to read.

All contributors to this collection maintain the copyright on their contributions to this book.

Printed in the United States of America. Published by O’Reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.

This book is available online at: https://book.pressbooks.com.

The print, ebook and web versions of this book were produced and typeset using PressBooks.com, a single-file-source book production tool that outputs EPUB, typeset PDF, and web versions of all books. For more information, visit http://pressbooks.com.

Production Editor: Dan Fauxsmith

July 2012: First Edition.

While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.