Thursday, May 17, 2012

In my post on the decline effect in linguistics, the question came up of how I've calculated the exponents for the Exponential Model in my papers. I think this is a point worth clarifying, but it's not likely to be interesting to a broad audience. You have been forewarned.

To recap as briefly as possible, in English, when a word ends in a consonant cluster, which also ends in a /t/ or a /d/, sometimes that /t/ or /d/ is deleted. This deletion can affect a whole host of different words, but the ones which have been of most interest to the field are the regular past tense (e.g., packed), the semiweak past tense (e.g., kept) and morphologically simplex words (e.g., pact), which I'll call mono. Other morphological cases which can be affected, and which I believe have occasionally and erroneously been categorized with the semiweak are no-change past tense (e.g., cost), "devoicing" (or something) past tense (e.g., built), stem changing past tense (e.g., found), etc. For the sake of this post, I'm only looking at the the main three cases: past, semiweak, and mono.

Now, Guy (1991) came up with a specific proposal where if you described the proportion of pronounced /t d/ for past as p, for semiweak as pjand for mono as pk, then j= 2, and k = 3. It is specifically whether or not j= 2 and k = 3 that I'm interested in here. If you've calculated the proportions of pronounced /t d/ for each grammatical class, you can calculate j by log(semiweak)⁄log(past) and k by log(mono)⁄log(past). The trick is in how you decide to calculate those proportions.

For this post, you can play along at home. Here's code to get set up. It'll load the Buckeye data I've been using, and do some data prep.

So, how do you calculate the rate at which /t d/ are pronounced at the end of the word when you have a big data set from many different speakers? Traditional practice within sociolinguistics has been to just pool all of the observations from each grammatical class across all speakers.

So you come out with j = 1.91, k = 3.1, which is a pretty good fit to the proposal of Guy (1991).

The problem is that this isn't really the best way to calculate proportions like this. There are some words which are super frequent, and they therefore get more "votes" in the proportion of their grammatical class. And, some speakers talk more than others, and they get more "votes" towards making the over-all proportions look more similar to their own. One approach to ameliorate this is to first calculate the proportion for each word within a grammatical class within a speaker, then for each grammatical class within a speaker, then within a grammatical class. Here's the code for this nested proportion approach.

All of a sudden, we're down to j = 1.34 and k = 2.05, and I haven't even dipped into mixed-effects models black magic yet.

But when it comes to modeling the proposal of Guy (1991), calculating the proportions is really just a mean to an end. I asked Cross Validated how to directly model j and k, and apparently you can do so using a complementary log-log link. So here is the mixed effects model for j and k directly.

Now we're a bit closer back to the original pooled proportions estimates, j = 1.57, k = 3.19.

My personal conclusion from all this is that the apparent j = 2, k = 3 pattern is driven mostly by the lexical effects of highly frequent words. This table recaps all of the results, plus the estimates of two more model. One has just a by speaker random intercept, and a flat model, which looks just like the maximum likelihood estimate of the fully pooled approach, because it is.

Method

j

k

Pooled

1.91

3.1

Nested

1.34

2.05

~Gram+(Gram|Speaker)+(1|Word)

1.38

2.11

~Gram+(Gram|Speaker)

1.57

3.19

~Gram+(1|Speaker)

1.84

3.14

~Gram

1.91

3.1

The lesson is that it can matter a low how you calculate your proportions.

Wednesday, May 16, 2012

It seems to me that in the past few years, the empirical foundations of the social sciences, especially Psychology, have been coming under increased scrutiny and criticism. For example, there was the New Yorker piece from 2010 called "The Truth Wears Off" about the "decline effect," or how the effect size of a phenomenon appears to decrease over time. More recently, the Chronicle of Higher Education had a blog post called "Is Psychology About to Come Undone?" about the failure to replicate some psychological results.

These kinds of stories are concerning at two levels. At the personal level, researchers want to build a career and reputation around establishing new and reliable facts and principles. We definitely don't want the result that was such a nice feather in our cap to turn out to be wrong! At a more principled level, as scientists, our goal is for our models to approximate reality as closely as possible, and we don't want the course of human knowledge to be diverted down a dead end.

Small effects

But, I'm a linguist. Do the problems facing psychology face me? To really answer that, I first have to decide which explanation for the decline effect I think is most likely, and I think Andrew Gelman's proposal is a good candidate:

The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount.

I've put together some R code to demonstrate this point. Let's say I'm looking at two populations, and unknown to me as a researcher, there is a small difference between the two, even though they're highly overlapping. Next, let's say I randomly sample 10 people from each population, do a t-test for the measurement I care about, and write down whether or not the p-value < 0.5 and the estimated size of the difference between the two populations. Then I do this 1000 more times. Some proportion (approximately equal to the power of the test) of the t-tests will have successfully identified a difference. But did those tests which found a significant difference also accurately estimate the size of the effect?

For the purpose of the simulation, I randomly generated samples from two normal distributions with standard deviations 1, and means 1 and 1.1. I did this for a few different sample sizes, 1000 times each. This figure show how many times larger the estimated effect size was than the true effect for tests which found a significant difference. The size of each point shows the probability of finding a significant difference for a sample of that size.

So, we can see that for small sample sizes, the test has low power. That is, you are not very likely to find a significant difference, even though there is a true difference (i.e., you have a high rate of Type II error). Even worse, though, is that when the test has "worked," and found a significant difference when there is a true difference, you have both Type M (magnitude) and Type S (sign) errors. For small sample sizes (between 10 and 50 samples each from the two populations), the estimated effect size is between 5 and 10 times greater than the real effect size, and the sign is sometimes flipped!

Taking the approach of just choosing a smaller p-value will help you out insofar as you will be less likely to conclude that you've found a significant difference when there is a true difference (i.e., you ramp up your Type II error rate, by reducing the power of your test), but that doesn't do anything to ameliorate the size of the Type M errors when you do find a significant difference. This figure facets by different p-value thresholds.

So do I have to worry?

So, I think how much I ought to worry about the decline effect in my research, and linguistic research in general, is inversely proportional to the size of the effects we're trying to chase down. If the true size of the effects we're investigating are large, then our tests are more likely to be well powered, and we are less likely to experience Type M errors.

And in general, I don't think the field has exhausted all of our sledgehammer effects. For example, Sprouse and Almeida (2012) [pdf] successfully replicated somewhere around 98% of the syntactic judgments from the syntax textbook Core Syntax (Adger 2003) using experimental methods (a pretty good replication rate if you ask me), and in general, the estimated effect sizes were very large. So one thing seems clear. Sentence 1 is ungrammatical, and sentences 2 and 3 are grammatical.

*What did you see the man who bought?

Who did you see who bought a cow?

Who saw the man who bought a cow?

And the difference in acceptability between these sentences is not getting smaller over time due to the decline effect. The explanatory theories for why sentence 1 isn't grammatical may change, and who knows, maybe the field will decide at some point that its ungrammaticality is no longer a fact that needs to be explained, but the fact that it is ungrammatical is not a moving target.

Maybe I do need to worry

However, there is one phenomenon that I've looked at that I think has been following a decline effect pattern: the exponential pattern in /t d/ deletion. For reasons that I won't go into here, Guy (1991) proposed that if the rate at which a word final /t/ or /d/ is pronounced in past tense forms like packed is given as p, the rate at which it is pronounced in semi-irregular past tense forms like kept is given as pj, and the rate at which it is pronounced in regular words like pact is given as pk, then j = 2, k = 3.

Here's a table of studies, and their estimates of j and k, plus some confidence intervals. See this code for how I calculated the confidence intervals.

Study

Year

Dialect

j

k

Guy

1991

White Philadelphia

4.74

2.37

1.17

4.26

2.75

1.86

Santa Ana

1992

Chicano Los Angeles

2.29

1.76

1.35

3.39

2.91

2.51

Bayley

1994

Tejano San Antonio

2.08

1.51

1.11

3.59

2.99

2.52

Tagliamonte & Temple

2005

York, Northern England

1.85

1.12

0.66

1.96

1.43

1.04

Smith & Durham & Fortune

2009

Buckie, Scotland

1.36

0.64

0.24

3.59

2.33

1.53

Fruehwald

2012

Columbus, OH

2.48

1.38

0.76

2.35

1.93

1.59

I should say right off the bat that all of these studies are not perfect replications of Guy's original study. They have different sample sizes, coding schemes, and statistical approaches. Mine, in the last row, is probably the most divergent, as I directly modeled and estimated the reliability of j and k using a mixed effects model, while the others calculated pj and pk and compared them to the maximum likelihood estimates for words like kept and pact.

But needless to say, estimates of j and k have not hovered nicely around 2 and 3.