An example
of a large scale application of this computer program was discussed in a
series of newspaper
articles in June 2007. The articles won first
place in the 2007 Philip Meyer investigative journalism awards. A
2011
ICFIS conference presentation (copy) dealt
with some of the practical issues in the statistical detection of
copying.

The
features of the program, which is called SCheck, are explained in the file
ReadMeSCheck.pdf . An
example of the output for a class with a 'tough' exam is sample00.pdf;
unfortunately, this extent of cheating is not unusual.

SCheck
Availability For leasing terms in the institutional use of this software
and for consulting on analyses, please contact the author. Assistance with
cheating detection may be available
to university researchers as time permits. A simplified and easy
to use demo version of SCheck is available as a free, but
time-limited, download. To obtain this link please send me an email
at (wesolows@mcmaster.ca ) with subject "SCheck demo download". Please include
your affiliation, the educational institution where you are an
instructor or administrator.

A Brief Overview of Statistical Detection of Cheating
on Multiple Choice
Exams

It is an unfortunate fact
that some examinees/students cheat by copying or collusion on multiple
choice tests and examinations. Statistical methods for detecting such
cheating have existed for more than 30 years, as has statistical detection
software, but these are unknown to the great majority of instructors using
multiple choice questions. These methods for detecting cheating or
collusion can be divided roughly into two categories. Model-based
detection methods are founded on modeling non-cheating and sometimes
cheating behavior, and using statistical tests of significance to identify
student pairs suspected of academic dishonesty. Outlier-based detection
methods provide indexes, usually more than one, that show unusual
(outlier) characteristics of responses which are attributed to suspected
cheating student pairs. In general, there is close agreement on the
suspected pairs identified by either type of approach provided that the
evidence is strong. On marginal cases methodologies often differ. Properly
used, model based detection methods can control false positives (students
incorrectly indicated as cheating) to any desired (extremely low)
estimated level of probability. Probability calculations for outlier
methods may be more difficult or questionable. Researchers into this
detection methodology can also be divided into two types: those who mainly
study statistical properties of statistical detection methods and those
who attempt practical application and urge cheating prevention. The latter
group inevitably encounters more controversy.

The reliability of statistical
cheating detection has been demonstrated by a long history of application
at testing institutions and at several universities. This experience has
yielded some interesting observations. Common intuitive objections such as
the one that unusual or significant similarity between student responses
can be the result of 'studying together' rather than cheating, have been
shown to be unfounded. If it were true that such factors as studying
together (or having a common educational or cultural background, etc.)
could lead to strong similarities in responses then statistical detection
would have been discredited long ago. Because all possible pairs are
examined and many students share such common characteristics, there should
have been a huge number of students with statistically unexplainable
strong similarities of responses who could not have cheated because they
wrote the test, for example, in different rooms. No such examples have
been observed. However, It has been observed that when proper precautions
against cheating are implemented, statistically unusual similarities
virtually disappear.

The popular assumption
that about five feet of separation between writing desks and reasonably
alert invigilation virtually prevents cheating has been shown to be false.
The detected cheating rate on multiple choice tests has been variously
reported to be between 3% and 10%, even under test-room conditions. Due to
limitations in detection, the actual cheating rate is likely to be
considerably higher. Despite this, it is notable that neither statistical
cheating detection, nor even certain known and very effective cheating
prevention measures, are prevalent among educational institutions.
Prevention measures, in order of importance, are multiple versions of
tests (different orders of questions and/or answers), assigned randomized
seating, and as a developing requirement, measures against electronic
communication between students during tests. This widespread lack of
prevention is somewhat curious because precautions against cheating on
multiple choice examinations are generally far less intrusive on students
than more publicized methods, like Turnitin, used to combat
plagiarism.

Finally, it should be
noted that copying is not the only way that some examinees and their
helpers cheat. Examples of methods that may be immune to similarity
analysis include using imposters to write tests, bringing in unauthorized
aids, stealing and distributing answer keys, copying from examinees with
perfect or nearly perfect answers, and altering responses after the test
is written.

_

Postscript : Comments on bad methodology in
statistical detection

It is not infrequent that
instructors, when confronted by a suspected cheating situation, invent
their own methodology on the spot. Usually, this consists of some simple
way of using the number of wrong answers that two students have in common.
It could be a count of such 'wrong matches', a proportion, run length, a
ratio with other counts, or some other such index. The distribution of
this index or indices is then often plotted for all pairs and the outlier
status of the suspected student pair demonstrated. Probability
calculations to support the case are frequently incorrect or based on
over-simplified assumptions.

Many such indices, or
measures, have been tried and are frequently rediscovered. Unfortunately,
while these measures are intuitively appealing and usually appear to
confirm blatant cheating, they can easily produce many false positives and
lead to dangerous misuse of statistical detection. One reason for this is
that the number of matches can depend on the overall ability of the
student pair (their marks), the number of choices on each question, as
well as on the difficulty of questions and the popularity of particular
wrong answers to each question. Indices without a sound theoretical basis
do not incorporate such considerations properly and can be 'fooled'.

For example, a rather bad
simple index is the percent of errors in common that match. Another bad
methodology is to plot the number of matches on answers against the
longest matching string of answers and to look for outliers. Obviously, an
honest pair of 'brilliant' students with perfect scores would have the
highest number of matches and the longest string of matching answers. This
indicates that student ability will confound the interpretation of
outliers. Other factors mentioned above also confound such simple
methodologies.

It is often not
understood that class size influences what is unusual when indexes are
calculated for all pairs. It is also not understood that the threshold of
excessive similarity must be higher when the pair is selected by scanning
all possible pairs than when it is selected by a triggering event such as
a report of suspicious behavior.

These factors severely
limit the reliability and validity of simple indices either as evidence of
academic dishonesty or as a method of estimating the extent of
cheating.