Appendix A The the-the Function

Sometimes when you you write text, you duplicate words—as with “you
you” near the beginning of this sentence. I find that most
frequently, I duplicate “the”; hence, I call the function for
detecting duplicated words, the-the.

As a first step, you could use the following regular expression to
search for duplicates:

\\(\\w+[ \t\n]+\\)\\1

This regexp matches one or more word-constituent characters followed
by one or more spaces, tabs, or newlines. However, it does not detect
duplicated words on different lines, since the ending of the first
word, the end of the line, is different from the ending of the second
word, a space. (For more information about regular expressions, see
Regular Expression Searches, as well as
Syntax of Regular Expressions, and Regular Expressions.)

You might try searching just for duplicated word-constituent
characters but that does not work since the pattern detects doubles
such as the two occurrences of `th' in `with the'.

Another possible regexp searches for word-constituent characters
followed by non-word-constituent characters, reduplicated. Here,
‘\\w+’ matches one or more word-constituent characters and
‘\\W*’ matches zero or more non-word-constituent characters.

\\(\\(\\w+\\)\\W*\\)\\1

Again, not useful.

Here is the pattern that I use. It is not perfect, but good enough.
‘\\b’ matches the empty string, provided it is at the beginning
or end of a word; ‘[^@ \n\t]+’ matches one or more occurrences of
any characters that are not an @-sign, space, newline, or tab.

\\b\\([^@ \n\t]+\\)[ \n\t]+\\1\\b

One can write more complicated expressions, but I found that this
expression is good enough, so I use it.

Here is the the-the function, as I include it in my
.emacs file, along with a handy global key binding: