You know Google Book Search—and, realistically, Google itself
(which includes book results)—has been touted as a fast, easy way to check for
plagiarism. Just do a phrase search for a distinctive sentence or long phrase,
and if you get a match you should check further for likely plagiarism.

Some writers seem to take it a little farther. Plug in a
distinctive five-word phrase; if there’s a match you’ve got a plagiarist.
That’s nonsense. Few five-word phrases are all that distinctive, and I’m not
sure it’s reasonable to call five borrowed words plagiarism.

Still, the general approach seems sound. With several
million digitized books and billions of other text sources, and with phrase
searching that appears able to accept fairly long phrases, it’s a good first
step (if only a first step).

Distinctive Sentences

But what’s distinctive? How do you identify sentences that are
good candidates for checking? Let’s turn that around: Aren’t most sentences
common enough that they’re useless for detecting plagiarism?

I can’t take credit for that idea—although, in one of those
“accidental plagiarism” situations, I’d nearly forgotten the seed of it.
Fortunately, Google comes to the rescue. Paul Collins wrote “Dead plagiarists
society” on November 21, 2006 at Slate (www.slate.com/id/
2153313). Collins focuses on Google Book Search, noting the extent to
which it had already aided folks in uncovering published plagiarism. He offers
some examples and suggests that we may see a lot of newly detected plagiarism
in the future.

But wait, you might ask, don't people accidentally repeat each
other's sentences all the time? It seems to me that this should not be
unusual. Yet try plugging that last sentence word by word into Google Book
Search, and watch what happens.

It: Rejected—too many hits to count

It seems: 11,160,000 matches

It seems to: 3,050,000

It seems to me: 1,580,000

It seems to me that: 844,000

It seems to me that this: 29,700

It seems to me that this should: 237

It seems to me that this should not: 20

It seems to me that this should not be: 9

It seems to me that this should not be unusual: 0

It seems to me that this should not be unusual is itself
... unusual.

There’s much more to Collins’ thoroughly entertaining
article—and I already discussed the article in January (Cites & Insights
8.1). How do I know that? Because I tried Collins’ 10-word sentence in
Google…and Cites & Insights 8.1 comes up third in a list of seven
hits. (GBS still shows zero hits.)

When I discussed this in January 2008, I didn’t entirely buy
the notion:

I would note that this is probably not the case for descriptive
nonfiction sentences, at least taken one at a time: After all, there are only
so many ways to state a fact. (That sentence, not including “after all,”
appears twice in a Google search—both discussions of plagiarism—but not in
Google Book Search.)

The first hit for Collins’
phrase is now a post at Althouse, a blog by Ann Althouse, a law professor.
She titles the post “Hasn’t it all been said before?” and follows that with
“No. Everything is actually amazingly new:” and the same stuff I
quoted—properly attributed and linked, of course. The second commenter on the
post argues against this proposition, saying in part:

The argument that a similar string of words necessarily
proves plagiarism is a statistically naive argument.

There are relatively few
commonly used words in English; it is highly unlikely that one can come up with
a phrase or sentence that has not been used before, even, possibly, for the
same subject matter. There are, for example, only so many ways that one can
discuss Hamlet's ambivalence…

The next commenter quoted the phrase beginning “it is highly…”
and responded:

Actually it is highly likely that any given sentence you
speak has never been used before, unless the sentence is short and about a
common subject. It just seems like the same sentences get reused a lot because
our brains are amazingly efficient at distilling sentences down to their core
meanings, which do get reused regularly.

The next commenter took the direct approach, searching “Ann has
too much time on her hands.” No match. Another commenter searched the three
phrases in the long sentence above (“There are…English,” “it is
highly…sentence” and “that has not…subject matter”)—and didn’t find matches for
any of them.

A later commenter points out the truth behind the
statistics. Even if you assume a truly tiny vocabulary, the number of
combinations in a sentence gets very big very fast. If you assume a mere 1,000
words (I’ve seen 2,000 to 6,000 words cited as the smallest plausible
vocabularies for people to communicate effectively in English), you can
construct one billion three-word sentences, one trillion four-word
sentences and, shall we say, an exceedingly large number of nine-word
sentences. (One billion billion billion, or 10 to the 27th power—an
octillion different sentences using American wording.) As the commenter notes,
“Not all sequences of valid English words are valid English sentences, but what
you lose for that reason is peanuts, relatively speaking.” True: If 99.9% of
combinations are invalid, that would still leave 10 to the 24th
nine-word sentences…for an unrealistically small subset of English. Have a
6,000-word vocabulary? The numbers mount up a lot faster: Allowing nonsense
combinations, you could have 10 million times as many nine-word sentences. I
write a lot, but I won’t write even a billion sentences in my lifetime (almost
certainly not even ten million)—and most of my sentences are a lot more than
four words long.

Expanding the Test Cases

Somehow, my English background trumped my math background—and I
found it hard to believe most non-literary sentences would be all that unique.
So I thought I’d run a slightly larger experiment, using random sentences from
a body of writing by an author who doesn’t strive for clever phrasing—in other
words, pretty ordinary sentences.

Fortunately, I could locate a writer who doesn’t use many
fancy words, doesn’t strive for literary effect, could provide a bunch of
paragraphs in machine-readable form—and wouldn’t take offense at being called a
writer of ordinary sentences. All I had to do was look in the mirror.

The process was simple enough. I set up a simple
spreadsheet, opened Google, and started copying in the first sentence of each
paragraph from an issue of Cites & Insights heavy on essays, where
odd proper names and the like would be less likely to skew the results. I
expected to find matches for at least 25% of the sentences—after all, this is
commonplace nonfiction writing. (How common is that last five-word phrase?
Apparently not as common as I’d expect: A Google phrase search yields zero
results.)

Skewing the research

Almost as soon as I began the process (I’d planned to search 100
sentences in all, taking up to 18 words at a time and checking Google results
to see how many different authors were involved, counting up to five) I ran
into trouble.

I was consistently coming up with one author: Me. Even on
shorter sentences. This made no sense. I deleted a couple of sentences that
used the word “liblogs.” No help—and although some sentences used “blogs,”
surely there have been millions of sentences written using that word by now.

The ones and zeros (portions of Cites & Insights don’t
seem to be indexed by Google, although most of it is) kept on coming. After 20
or so, I started deliberately skewing the research toward “indistinct”
sentences. I omitted sentences with proper nouns and sentences with nouns much
more unusual than “librarians.” I started selecting smaller portions of
sentences. And I set up a parallel column, taking the first eight words of
sentences and retesting those: Surely I’d get lots of matches then!

I also moved beyond that
issue of Cites & Insights to some unedited copy for this issue
(being unedited, it was even more likely to be humdrum) and unedited drafts for
an early “Crawford Files” column and a “disContent” column. All the while, I
was avoiding distinctive nouns and what I thought of as distinctive writing.

And still there were few matches—a few more in eight-word
subsets, but even there not all that many.

Adding more authors

After 130 sentences or so (I
was fascinated enough to enlarge the sample) I decided to broaden the range of
authorship a bit. That had actually happened already: A few of the sentences
were quotations from blog posts.

I took a handful of well-known liblogs and tried ten
sentences from each—again, avoiding sentences that were inherently distinctive
because of the terminology. In doing so, I noted that Dorothea Salo’s writing
is distinctive even in short bursts, as are the posts of a few of the other
bloggers I sampled.

I went outside the library field for one essay from New
York (on the death of traditional publishing)—and then I tried something
different: Wikipedia, frequently faulted for plagiarism. I took one
essay that seemed like a good candidate (and included distinctive words in this
case), “Jeremy Bentham,” and another essay on a fairly obscure topic, “Theosophy.”
I did find cases that had the feel of plagiarism—but, except for the definition
of theosophy (where an edit war seems to keep inserting a plagiarized
definition), the cases of possible plagiarism seemed to be the other way
around: Other websites using Wikipedia text without attribution. I’m not
saying Wikipedia’s free of plagiarism; I’m saying I didn’t find obvious
instances in the two articles (out of several million) and 18 sentences (out of
hundreds) that I tried, except for the one definition. In the end, I removed
the Wikipedia tests from the overall sample, replacing them with 18
others from my own unedited drafts in order to maintain overall coherence.

The Results, Round One

I tested 300 full or partial sentences—most twice. Forty-four
test phrases were eight words or less; for the other 256, I also tested the
first eight words.

Ten phrases out of 300—just over three percent—showed up
more than once in Google, not counting attributed quotations. Here’s the full
list—after all, it’s short!

·A funny thing happened on the way to this column

·On the other hand, it’s a lot of work

·I swear I don’t do this on purpose

·Let’s go a little further

·Times change, and change again.

·We learn in many ways

·That time came and went

·They’re free to express their opinions within reason

·That does not mean print is dead

·Improved technology cuts both ways

The first seven phrases were used by at least five different
writers. The last three were used by two writers each.

Only two of these phrases are longer than eight words, and
the first is an awfully convenient way to start an offbeat column (I believe it
always appears as the first sentence in an article). I’ll take credit for six
of the ten phrases so ordinary they couldn’t possibly represent plagiarism (the
first, fourth through sixth, ninth and tenth).

The short ones

Half of the non-unique phrases are only five words long—and, as
it happens, none of the five-word phrases I tested turned out to be unique.
Remember that I skewed selections toward non-uniqueness—I mean, “we learn in
many ways” and “let’s go a little further” (both my sparkling prose) are so
ordinary they come close to cliché status.

But there were also three four-word phrases—and all three of
them tested as unique:

·Delayed commentary makes sense

·This is a scattered essay

·Is blogging scholarly communication

So did all but one of the six six-word sentences:

·Formal language does not grant authority

Personal attacks undermine reasoned arguments

·But who cares about my conclusions

·Blind posts can damage honest discussion

·I’m ignoring all sorts of context

And all eight of the seven-word sentences and initial
phrases:

·Was this genuine controversy or incited controversy?

·Citations are tricky, so many different formats

·She plays us a few of the clips

It’s time common sense prevailed in Washington

·I’m a reasonably well-read, well-informed, well-educated person

·Collegiality and professionalism are perfectly fine qualities

·Blogging does have a real intellectual value

·That seems unlikely as a general situation

I wrote all six of the six-word sentences—but only one of the
seven-word sentences, which came from seven different sources.

In all, 22 of the 300 test cases (7%) were sentences shorter
than eight words, with another 22 (7%) exactly eight words. Two of the
eight-word sentences showed up more than once, but 91% were unique. It’s
certainly true that most non-unique test cases were eight words or fewer (eight
of ten), but also true that most short sentences were still unique (81%).

Length distribution

Here’s the distribution for the 256 test phrases longer than
eight words:

·Nine words: 31 cases

·Ten words: 45 cases

·Eleven words: 30 cases

·Twelve words: 51 cases

·Thirteen words: 32 cases

·Fourteen and fifteen words: 25 cases each

·Sixteen words: eight cases

·Seventeen words: six cases

·Eighteen words: three cases.

A sampling of “unique” sentences and phrases

Here are a dozen of the phrases and sentences so unusual they
don’t show up anywhere in Google’s corpus—or if they do, it’s only in the
source from which I quoted and in sources properly quoting that one:

·It occurred to me that I’d probably be quite natural in a similar role

People are angry and confused, searching for
meaning and otherwise unclear how to respond

·There is much of interest in the specific results

·If you haven’t witnessed this type of behavior in person

·I think the answer is still yes, at least some of the time

·We disagree on a number of issues—and do so agreeably

I’m reluctant to label any of these as ordinary text, since some
of them come from other people. They’re all clear and straightforward (which
may not be ordinary at all). I’m guessing the authors would not suspect
plagiarism if they happened to see these sentences in someone else’s writing.
(I included sentences from the following blogs in this test—and, with few
exceptions, it would be easy to find the original: The aardvark speaks,
Blogwithoutalibrary, Catalogablog, Caveat lector, Free range librarian, Librarian.net,
Mamamusings, LibrarianInBlack, Off the Mark, Lorcan Dempsey’s blog, Open
stacks, The travelin’ librarian, The medium is the message, The shifted
librarian, Tame the web and Walking paper.)

Conclusions

Common language isn’t nearly as common as you might think—or,
rather, “ordinary” sentences seem to be unique a great deal more often than I
would have anticipated.

Is it reasonable to suggest that nine of ten sentences (nine
words or longer) are unique? I have no idea. That was the case in this small
sampling, deliberately excluding most sentences I thought likely to be
unique—except that it was more than 19 of 20.

Sure, if you look at the statistics, that seems likely. But
it feels wrong, at least to me.

Is it a matter of
vocabulary? I couldn’t resist that question. The 300 text samples total 3,392
words—and include 1,219 different words, with no normalization except
for capitalization. That’s a modest vocabulary.

The Results: Round Two

Roughly five-sixths of these sentences are at least nine words
long. So, I suspect, are most sentences in everyday written and spoken English.
In Collins’ single test, uniqueness occurred at the ninth word. What happens
with the 300 samples in this run if they’re truncated to eight words?

More of them show up from more than one author—35 more, for
a total of 45 out of 300 tests, or 15%. (Naturally, the ten tests that weren’t
unique at all weren’t unique at an eight-word limit.)

Of those commonly occurring briefer phrases, 28 have at
least five different occurrences. I kept looking through the first hundred
results (if there were that many) before giving up. Three more had four
sources, four had three each, and ten had only two sources.

Some of the briefer phrases that weren’t unique:

·But if I want to go back to something

·I also found it interesting that there were

·I can’t tell you how exciting it is

·I come down strongly on the side of

·I don’t think I would have said that

·I have to say I had no idea

·It really got me thinking about how

·Oh, and for those of you unfamiliar with

·We need to treat each other with dignity

·What is very interesting to me is that

·We have a small but excellent group of

·Fall is always a hectic time of year

·You can read the fine print yourself, but

·If you see something that is not just

·It occurred to me that I’d probably be

·So a few weeks ago we started the

·We have been having some internal discussion about

Some “unique” shorter phrases

A dozen of the 255 test
phrases that still showed up once (or not at all) when limited to eight words
or less:

·A sort of sewing kit for my life

·Absurd and even dangerous as it may be

·All the while he has argued for a

·And, later, the clear suggestion that increased plagiarism

·As a profession how do we find, identify

·At first, I was just thrilled to see

·Copyright helps maintain a balance between the needs

·He makes the interesting point that although a

Humans may be flawed, but we have discovered

·Mixing the old and the young, the established

·That won’t reassure those who prefer to worry

Conclusions

The same as before, I think—although at a lower level. Even
relatively short sentences seem to be unusual most of the time. On the order of
85% in this sample, and I suspect that percentage would be higher in a truly
random sample.

Uniqueness and Plagiarism

While this is only an anecdotal study, I find it mildly
convincing. Our sentences are much more varied than I expected, even when we’re
not striving for literary excellence. (To those whom I’ve quoted here—always
without attribution—who do strive for literary excellence and
distinctive phrasing in each and every blog post: My apologies. It’s all good
writing, or at least I’ll grant that for the 55% of these samples that
other people wrote.)

What this little study does not show: That one
duplicated sentence is evidence of plagiarism. There are, indeed, only so many
ways to discuss Hamlet’s ambivalence. Or are there? Searching the words
“Hamlet” and “ambivalence” in Google yields a claimed 57,000 results—and I
didn’t spot any obvious duplications of phrasing in the first hundred.

What I believe may be
true: If you’re suspicious that a clumsy plagiarist has cut-and-pasted without
paraphrasing, almost any medium-length sentence may suggest you should
check further. It may be entirely innocent. But it seems surprisingly uncommon
for the same, say, 11-word string to show up more than once.

Cites & Insights: Crawford at Large, Volume 8, Number 11, Whole Issue 108, ISSN 1534-0937, a
journal of libraries, policy, technology and media, is written and produced by
Walt Crawford, Director and Managing Editor of the PALINET Leadership Network.

All
original material in this work is licensed under the Creative Commons
Attribution-NonCommercial License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative
Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.