Ben Zimmer’s recent post on “MySpace”, “Misplace” and spell-checkers is very interesting. As noted in his post, the word MySpace is now in the lexicon of the
Office 2007 spellchecker. I guess nobody will complain about that addition
(that shows that the tools evolve as well, like our vocabulary, and solving the
Cupertino issues he regularly describes on Language Log is an
algorithmic problem, but also a problem related to the coverage of the
dictionary).

As you know, a spell-checker has
two main functions: it should spot mistakes, but it should also try to suggest
the most likely word form to replace the erroneous input. Computing the
suggestions is usually an algorithmic process based upon the concept of “edit
distance”, which measures the number of character manipulations that were
necessary to turn a correct word into an incorrectly spelled one: deleting,
adding, transposing or replacing a character are the most common manipulations.
Here are examples of such manipulations (the word to the right of the arrow is
flagged with a red squiggle):

Deleting a character:

information

→

infomation

Adding a character:

developing

→

developping

Transposing 2 characters:

believe

→

beleive

Substitution:

independent

→

independant

When a word is not found in the
speller dictionary, the speller tries to find the nearest candidate in terms of
edit distance. This algorithmic process is used to compute the order of the
suggestions. In addition to “edit distance”, which is a general,
language-independent concept, some language-specific knowledge may also be
used to fine-tune the order of suggestions. There can be a specific rule saying
for instance that some users have problems with the letters “gh” which they
sometimes mix up with “f”: if you write “rouf”, you will therefore see that the
application of the edit distance mechanism is responsible for the suggestion
“roof” appearing in the first position, but you will also see “rough” in the
list of suggestions offered by the speller, even though, in terms of edit
distance, there are more manipulations to delete “gh” and add “f”: this is
based upon an English-specific typology of errors which enables us to take into
account frequent mistakes.

Ben cites one of his readers who points out that the latest Word spellchecker gives misplace
as the first suggestion for Mysplace. That is true and is in fact
expected, since going from Misplace to Mysplace is done by substituting
the “i” and the “y” (one-character change only).

The
distance is longer to transform MySpace into Mysplace (turning
the capital “S” into a lower-case “s” and adding “l”). This is why MySpace
appears as the 2nd suggestion.

Note
that MySpace is listed as the first suggestion when you type Mispace in
Office 2007, and not as the second one, as suggested by Ben’s reader (see
screenshot below):

Of course, it will always be up to
the writer to decide whether they really meant MySpace or something
else. In any case, I don’t think this is a “Cupertino” issue, since there is no
automatic replacement (as you know, the Cupertino issue affected the
Word speller in the 1997 version, over 10 years ago, and was due to the AutoReplace
function – many things have changed since then and the Office proofing tools
have improved a lot, for instance with the introduction in Office 2007 of a contextual
speller). The speller does its job when it flags the mistake in Mispace and also does its job when it suggests the most likely corrections. I would argue that if the user unfortunately clicks on “Misplace” in this list when they meant “MySpace” and had written “Mispace”, the tool cannot really be blamed, can it? ;-).