When two binary variables are attempts by two individuals to measure the
same thing, you can use Cohen's Kappa (often simply called Kappa) as a
measure of agreement between the two individuals.

Kappa measures the percentage of data values in the main diagonal of the
table and then adjusts these values for the amount of agreement that could be
expected due to chance alone.

Two raters are asked to classify objects into categories 1 and 2. The table
below contains cell probabilities for a 2 by 2 table.

To compute Kappa, you first need to calculate the observed level of
agreement

This value needs to be compared to the value that you would expect if the
two raters were totally independent,

The value of Kappa is defined as

The numerator represents the discrepancy between the observed probability
of success and the probability of success under the assumption of an
extremely bad case. Independence implies that pair of raters agree about as
often as two pairs of people who effectively flip coins to make their
ratings.

The maximum value for kappa occurs when the observed level of agreement is
1, which makes the numerator as large as the denominator. As the observed
probability of agreement declines, the numerator declines. It is possible for
Kappa to be negative, but this does not occur too often. In such a case, you
should interpret the value of Kappa to imply that there is no effective
agreement between the two rates.

How to interpret Kappa

Kappa is always less than or equal to 1. A value of 1 implies perfect
agreement and values less than 1 imply less than perfect agreement.

In rare situations, Kappa can be negative. This is a sign that the two
observers agreed less than would be expected just by chance.

It is rare that we get perfect agreement. Different people have different
interpretations as to what is a good level of agreement. At the bottom of
this page is one interpretation, provided on page 404 of Altman DG. Practical
Statistics for Medical Research. (1991) London England: Chapman and Hall.

Here is one possible interpretation of Kappa.

Poor agreement = Less than 0.20

Fair agreement = 0.20 to 0.40

Moderate agreement = 0.40 to 0.60

Good agreement = 0.60 to 0.80

Very good agreement = 0.80 to 1.00

An example of Kappa

In an examination of self reported prescription use and prescription use
estimated by electronic medical records

The following table represents the diagnosis of biopsies from 40 patients
with self-reported malignant melanoma. The rows represent the first
pathologist's diagnosis and the columns represent the second pathologist's
diagnosis. Compute Kappa.

Again, this is only a fair level agreement. Notice that even though the
pathologists agree 70% of the time, they would be expected to have almost as
large a level of agreement (62%) just by chance alone.

Using SPSS to compute Kappa

As before, select ANALYZE | DESCRIPTIVE STATISTICS | CROSSTABS from the
SPSS menu. In the dialog box, click on the STATISTICS button and then select
the Kappa option box.

At the bottom of the page is what the SPSS output would look like.

Further reading

I have a lot of references for kappa and the intraclass correlation
coefficient that I need to sort through.

Here's an interesting question related to this topic: Bill asks how to
determine if a sample size is adequate for estimating an intraclass
correlation.

The simplest approach is to see if the confidence interval that you have
produced (or will produce) is sufficiently narrow to meet your needs. The
confidence interval formulas are messy, but if you want to pursue this
further, Shoukri and Edge have a book that may help.

Here's an email that might make the basis for an Ask Professor Mean
question.

I met with you at the start of my dissertation and found your advice
very helpful. I am in the process of finishing up my data and have a quick
question that I thought you might could help with. I did behavioral
observations for my study, and had one person code all the data, and
another person code 20% of the data for reliability. I would like to use
the Kappa equation to determine the reliability between my coders. I know I
need to calculate four numbers: 1) total number agreements the behavior
occurred; 2) total number agreements the behavior did not occur; 3) number
of times coder A said yes and coder B said no, and 4) number of times coder
A said no and Coder B said yes. My question is what do I do with those
numbers to get a Kappa score? I know SPSS will do it if I enter all the
data--but that would be hundreds of data points per subjects, and would
take much longer than calculating it by hand. Any information you could
provide would be greatly appreciated. Thanks! Rebecca

Lantz CA and Nebenzahl E (1996) Behavior and interpretation of the
kappa statistics: resolution of the two paradoxes. Journal of Clinical
Epidemiology 49:431.

This page was written by
Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more
information? I have a page with general help
resources. You can also browse for pages similar to this one at Category:
Definitions, Category:
Measuring agreement.