Thursday, March 28, 2013

Introduction

This time it is about Cohen's Kappa, a measure of inter-rater agreement or reliability. Suppose we have two raters that each assigns the same subjects or objects to one of a fixed number of categories. The question then is: How well do the raters agree in their assignments? Kappa provides a measure of association, the largest possible value is one, the smallest is minus 1, and it has a corresponding statistical test for the hypothesis that agreement is only by chance. Cohen's kappa and similar measures have widespread use, among other fields, in medical or biostatistic. In one class of applications, raters are doctors, subjects are patients and the categories are medical diagnosis. Cohen's Kappa provides a measure how well the two sets of diagnoses agree, and a hypothesis test whether the agreement is purely random.

Most of the following focuses on weighted kappa, and the interpretation of different weighting schemes. In the last part, I add some comments about R, which provided me with several hours of debugging, since I'm essentially an R newbie and have not yet figured out some of it's "funny" behavior.

Monday, March 25, 2013

Introduction

Statistical tests are often grouped into one-sample, two-sample and k-sample
tests, depending on how many samples are involved in the test. In k-sample
tests the usual Null hypothesis is that a statistic, for example the mean
as in a one-way ANOVA, or the distribution in goodness-of-fit tests, is the
same in all groups or samples. The common test is the joint test that all
samples have the same value, against the alternative that at least one sample
or group has a different value.

However, often we are not just interested in the joint hypothesis if all
samples are the same, but we would also like to know for which pairs of
samples the hypothesis of equal values is rejected. In this case we conduct
several tests at the same time, one test for each pair of samples.

This results, as a consequence, in a multiple testing problem
and we should correct our test distribution or p-values to account for this.

I mentioned some of the one- and two sample test in statsmodels before. Today,
I just want to look at pairwise comparison of means. We have k samples
and we want to test for each pair whether the mean is the same, or not.

Instead of adding more explanations here, I just want to point to
R tutorial
and also the brief description on Wikipedia. A search for "Tukey HSD" or
multiple comparison on the internet will find many tutorials and explanations.

The following are examples in statsmodels and R interspersed with a few explanatory comments.

Sunday, March 17, 2013

I merged last week a branch of mine into statsmodels that contains large parts of
basic power calculations and some effect size calculations. The documentation is in
this section .
Some parts are still missing but I thought I have worked enough on this for a while.

(Adding the power calculation for a new test now takes approximately:
3 lines of real code, 200 lines of wrapping it with mostly boiler plate and docstrings,
and 30 to 100 lines of tests.)

The first part contains some information on the implementation. In the second part, I compare
the calls to the function in the R pwr package to the calls in my (statsmodels') version.

I am comparing it to the pwr package because I ended up writing almost all unit tests
against it. The initial development was based on the SAS manual,
I used the explanations on the G-Power website for F-tests, and some parts were initially
written based on articles that I read. However, during testing I adjusted the options
(and fixed bugs), so I was able to match the results to pwr, and I think pwr has
just the right level of abstraction and easiness of use, that I ended up with code that is
pretty close to it.

Friday, March 15, 2013

Effect Size

I have been working on and off for a while now on adding statistical power
calculations to statsmodels. One of the topics I ran into is the
effect size.

At the beginning, I wasn't quite sure what to make of it. While I was
working on power calculations, it just seemed to be a convenient way of
specifying the distance between the alternative and the null hypothesis.
However, there were references that sounded like it's something special
and important. This was my first message to the mailing list

Scaling Issues

Today I finally found some good motivating quotes:

"In the behavioral, educational, and social sciences (BESS), units of measurement are many
times arbitrary, in the sense that there is no necessary reason why the measurement instrument
is based on a particular scaling. Many, but certainly not all, constructs dealt with in the
BESS are not directly observable and the instruments used to measure such constructs do
not generally have a natural scaling metric as do many measures, for example, in the physical
sciences."

and

"However, effects sizes based on raw scores are
not always helpful or generalizable due to the lack of natural scaling metrics and multiple
scales existing for the same phenomenon in the BESS. A common methodological suggestion
in the BESS is to report standardized effect sizes in order to facilitate the interpretation of
results and for the cumulation of scientific knowledge across studies, which is the goal of
meta-analysis (<...>). A standardized effect size is an effect size that describes the size of the effect
but that does not depend on any particular measurement scale."

The two quotes are from the introduction in "Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation" by Ken Kelley http://www.jstatsoft.org/v20/a08 .

Large parts of the literature that I was browsing or reading on this, are in Psychological journals.
This can also be seen in the list of references in the Wikipedia page on effect size.

One additional part, that I found puzzling was the definition of "conventional" effect sizes by Cohen.
"For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect." (sentence from the Wikipedia page)

"Small" what? small potatoes, small reduction in the number of deaths, low wages? or, "I'm almost indifferent" (+0 on the mailing lists)?

Where I come from

Now it's clearer why I haven't seen this in my traditional area, economics and econometrics.

Although economics falls into BESS, in the SS part, it has a long tradition of getting a common scale, money. Physical units also show up in some questions.

National Income Accounting tries to measure the economy with money as a unit. (And if something doesn't have a price associated with it, then it's ignored by most. That's another problem. Or, we make a price.) There are many measurement problems, but there is also a large industry to figure out common standards.

Effect sizes have a scale that is "natural":

What's the increase in lifetime salary, if you attend business school?

What's the increase in sales (in Dollars, or physical units) if you lower the price?

What's the increase in sales if you run an advertising campaign?

What's your rate of return if you invest in stocks?

Effects might not be easy to estimate, or cannot be estimated accurately, but we don't need a long debate about what to report as effect.

(ii) I started my last round of work on this because I was looking at effect size as distance measure for a chisquare goodness of fit test. When the sample size is very large, then small deviations from the Null Hypothesis will cause a statistical test to reject the Null, even if the effect, the difference to the Null is for all practical purposes irrelevant. My recent preferred solution to this is to switch to equivalence test or something similar, not testing the hypothesis that the effect is exactly zero, but to test whether the effect is "small" or not.

(iii) I have several plans for blog posts (cohens_kappa, power onion) but never found the quiet time or urge to actually write them.