Recent Publications

Research Interests
Courses Taught
Degrees
Other Information

George O. Wesolowsky, Ph.D.

 Professor Emeritus of Management Science, McMaster University, Hamilton, Ontario, Canada

(905) 525-9140, Ext. 23948  wesolows@mcmaster.ca

Free Statistics lectures: 26 PowerPoint narrated lectures in introductory statistcs

Software: Statistical detection of cheating (copying, collusion) on multiple choice tests and examinations.This software is based on the method described in : Wesolowsky G.O. (2000) "Detecting Excessive Similarity in Answers on Multiple Choice Exams", Journal of Applied Statistics, Vol. 27, 909-921.

An example of a large scale application of this computer program was discussed in a series of newspaper articles in June 2007. The articles won first place in the 2007 Philip Meyer investigative journalism awards. A 2011 ICFIS conference presentation (copy) dealt with some of the practical issues in the statistical detection of copying.

The features of the program, which is called SCheck, are explained in the file ReadMeSCheck.pdf . An example of the output for a class with a 'tough' exam is sample00.pdf; unfortunately, this extent of cheating is not unusual.

SCheck Availability  For leasing terms in the institutional use of this software and for consulting on analyses, please contact the author. Assistance with cheating detection may be available to university researchers as time permits. A simplified and easy to use demo version of SCheck is available as a  free, but time-limited, download. To obtain this link please send me an email at (wesolows@mcmaster.ca ) with subject "SCheck demo download". Please include your affiliation,  the educational institution where you are an instructor or administrator.

A Brief Overview of Statistical Detection of Cheating on Multiple Choice Exams

It is an unfortunate fact that some examinees/students cheat by copying or collusion on multiple choice tests and examinations. Statistical methods for detecting such cheating have existed for more than 30 years, as has statistical detection software, but these are unknown to the great majority of instructors using multiple choice questions. These methods for detecting cheating or collusion can be divided roughly into two categories. Model-based detection methods are founded on modeling non-cheating and sometimes cheating behavior, and using statistical tests of significance to identify student pairs suspected of academic dishonesty. Outlier-based detection methods provide indexes, usually more than one, that show unusual (outlier) characteristics of responses which are attributed to suspected cheating student pairs. In general, there is close agreement on the suspected pairs identified by either type of approach provided that the evidence is strong. On marginal cases methodologies often differ. Properly used, model based detection methods can control false positives (students incorrectly indicated as cheating) to any desired (extremely low) estimated level of probability. Probability calculations for outlier methods may be more difficult or questionable. Researchers into this detection methodology can also be divided into two types: those who mainly study statistical properties of statistical detection methods and those who attempt practical application and urge cheating prevention. The latter group inevitably encounters more controversy.

The reliability of statistical cheating detection has been demonstrated by a long history of application at testing institutions and at several universities. This experience has yielded some interesting observations. Common intuitive objections such as the one that unusual or significant similarity between student responses can be the result of 'studying together' rather than cheating, have been shown to be unfounded. If it were true that such factors as studying together (or having a common educational or cultural background, etc.) could lead to strong similarities in responses then statistical detection would have been discredited long ago. Because all possible pairs are examined and many students share such common characteristics, there should have been a huge number of students with statistically unexplainable strong similarities of responses who could not have cheated because they wrote the test, for example, in different rooms. No such examples have been observed. However, It has been observed that when proper precautions against cheating are implemented, statistically unusual similarities virtually disappear.

The popular assumption that about five feet of separation between writing desks and reasonably alert invigilation virtually prevents cheating has been shown to be false. The detected cheating rate on multiple choice tests has been variously reported to be between 3% and 10%, even under test-room conditions. Due to limitations in detection, the actual cheating rate is likely to be considerably higher. Despite this, it is notable that neither statistical cheating detection, nor even certain known and very effective cheating prevention measures, are prevalent among educational institutions. Prevention measures, in order of importance, are multiple versions of tests (different orders of questions and/or answers), assigned randomized seating, and as a developing requirement, measures against electronic communication between students during tests. This widespread lack of prevention is somewhat curious because precautions against cheating on multiple choice examinations are generally far less intrusive on students than more publicized methods, like Turnitin, used to combat plagiarism.

Finally, it should be noted that copying is not the only way that some examinees and their helpers cheat. Examples of methods that may be immune to similarity analysis include using imposters to write tests, bringing in unauthorized aids, stealing and distributing answer keys, copying from examinees with perfect or nearly perfect answers, and altering responses after the test is written.

_

Postscript : Comments on bad methodology in statistical detection

It is not infrequent that instructors, when confronted by a suspected cheating situation, invent their own methodology on the spot. Usually, this consists of some simple way of using the number of wrong answers that two students have in common. It could be a count of such 'wrong matches', a proportion, run length, a ratio with other counts, or some other such index. The distribution of this index or indices is then often plotted for all pairs and the outlier status of the suspected student pair demonstrated. Probability calculations to support the case are frequently incorrect or based on over-simplified assumptions.

Many such indices, or measures, have been tried and are frequently rediscovered. Unfortunately, while these measures are intuitively appealing and usually appear to confirm blatant cheating, they can easily produce many false positives and lead to dangerous misuse of statistical detection. One reason for this is that the number of matches can depend on the overall ability of the student pair (their marks), the number of choices on each question, as well as on the difficulty of questions and the popularity of particular wrong answers to each question. Indices without a sound theoretical basis do not incorporate such considerations properly and can be 'fooled'.

For example, a rather bad simple index is the percent of errors in common that match. Another bad methodology is to plot the number of matches on answers against the longest matching string of answers and to look for outliers. Obviously, an honest pair of 'brilliant' students with perfect scores would have the highest number of matches and the longest string of matching answers. This indicates that student ability will confound the interpretation of outliers. Other factors mentioned above also confound such simple methodologies.

It is often not understood that class size influences what is unusual when indexes are calculated for all pairs. It is also not understood that the threshold of excessive similarity must be higher when the pair is selected by scanning all possible pairs than when it is selected by a triggering event such as a report of suspicious behavior.

These factors severely limit the reliability and validity of simple indices either as evidence of academic dishonesty or as a method of estimating the extent of cheating.