Nathan Danneman (a co-author and one of my graduate students from Emory) recently sent me a New Yorker article from 2010 about the “decline effect,” the tendency for initially promising scientific results to get smaller upon replication. Wikipedia can summarize the phenomenon as well as I can:

In his article, Lehrer gives several examples where the decline effect is allegedly showing. In the first example, the development of second generation anti-psychotic drugs, reveals that the first tests had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. However, after repeating tests this effect declined and in the end it was not possible to document that these drugs had any better effect than the first generation anti-psychotics.

Experiments done by Jonathan Schooler were trying to prove that people describing their memories were less able to remember them than people not describing their memories. His first experiments were positive, proving his theory about verbal overshadowing but repeated studies showed a significant declining effect.

In 1991, Danish zoologist Anders Møller discovered a connection between symmetry and sexual preference of females in nature. This sparked a huge interest in the topic and a lot of follow-up research was published. In three years following the original discovery, 90% of studies confirmed Møller’s hypothesis. However, the same outcome was published in just four out of eight research papers in 1995, and only a third in next three years.

Why would a treatment that shows a huge causal effect in an experiment seem to get weaker when that experiment is repeated later on? “‘This was profoundly frustrating,’ he [Schooler] says. ‘It was as if nature gave me this great result and then tried to take it back.’”

Regression to the mean: As the number of data points increases, we expect the average values to regress to the true mean…and since often the initial work is done on the basis of promising early results, we expect more data to even out a fortuitously significant early outcome.

The file drawer effect: Results that are not significant are hard to publish, and end up stashed away in a cabinet. However, as a result becomes established, contrary results become more interesting and publishable.

These are common, well-known and well-understood phenomena. But as far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.

Let’s start with a simple example. Suppose that the data generating process is , where . If we repeatedly generate data sets of size 1000 out of this DGP, run an appropriate linear model , and save only those estimated coefficients that are statistically significant in a one-tailed test, .

In short, we find that none of the statistically significant results are near the actual coefficient of 0.5. In fact, the statistically significant coefficients are biased upward (the mean coefficient is 1.20 1.008 in this simulation). This makes sense: only the largest slopes are capable of overcoming the intrinsic noise in the DGP and being detected at this sample size (1000).

What does this mean? Well… the estimator is not itself intrinsically biased: if you plotted all the coefficients from our 1000 simulated samples, they would be normally distributed around the true mean of 0.5 with appropriate variance. But we’re not talking about the distribution of an estimator given a true value, ; we’re talking about the distribution of scientifically notable, publishable results . This is the distribution of results we expect to see in journal articles and in the media. And that distribution is biased because the scientific review process requires that results reach a certain signal-to-noise ratio (viz., a p-value smaller than 0.05) before they deserve scientific attention: .

In short: when you look at what’s in journal articles or scientific magazines, you’re looking at results that are biased. Not that this is a terrible thing: we are imposing this bias so as to avoid printing a stream of chance results or reams of uninformative non-effects (do we need to know that the price of tea in China is not related to rainfall in London?).

In fact, given the size of the underlying effect, I can tell you precisely how large we should expect that bias to be. The analysis below is for a normally distributed with a standard error of 0.5.

So: the smaller the true coefficient, the larger we expect a statistically significant (and positive) estimate of that coefficient to be. The lesson is relatively straightforward: comparatively small relationships are very likely to be overestimated in the published literature, but larger relationships are more likely to be accurately estimated.

The bias plot I just produced is neat, but the x-axis shows the true beta value—which, of course, we cannot know in advance. It explains the overall pattern of overestimated effects in journal articles and medical studies, but doesn’t give us a way to assess any particular result. It would be better, I think, for us to have some guess about the probability that the coefficient we are seeing in a journal article is biased upward. I think we can get that, too.

This block of code illustrates the gist of how I would proceed. I simulate a data set (of size 1000) out of the DGP , , and superimpose the posterior distribution of (using the frequentist procedure, and therefore implying a flat prior) from that regression onto the bias curve computed for this application.

The idea is to use the posterior distribution of as our best guess about the underlying state of the world, then to infer the expected bias in the published literature on the basis of this distribution. In principle, this can be done with any appropriate data set for the study of ; for example, I could imagine collecting data, running an analysis, finding a statistically insignificant , and then constructing this plot to determine the distribution of this relationship that I will see published in extant or even future studies! But it’s probably more likely that someone will look at a published posterior, and then infer something about the likelihood that this published result overstates the actual effect.

That is, we would calculate .

The inner integral is the expected bias of the estimated coefficient given its true value, and the outer integral calculates this expectation over the posterior.

I created some R code to calculate this expectation, and then applied it to the posterior distribution above.

The result: an average bias of 0.352 0.299. What this tells us is that, given the range of true values of with which this result is consistent, we would on average expect it to overstate the true value by about 40% 33%.

Note again: the estimator itself is not biased, or is anything about the data set wrong. What this tells us is that, given the “gating” process in publishing that is imposed by statistical significance testing, this particular published result is likely to overstate the true value of . If all posteriors were equally likely to be published, the distribution of published estimates would be symmetric about and there would be no bias.

But not all results are equally susceptible: larger and more certain published results are more likely to accurately represent the actual value of because deviations of that are smaller and larger than are equally likely to be published, as we see by computing the expected bias for a coefficient of 2 instead of 0.895: the expected bias drops to 0.023 0.016.

So, why do effects weaken over time? As PZ Myers said, “However, as a result becomes established, contrary results become more interesting and publishable.” What that means is that the statistical significance threshold for publication disappears, and the publication distribution shifts from to ; as we have already seen, the first distribution is going to exaggerate the size of effects whereas the second will reflect them accurately, and thus subsequent publications will indicate a weakened effect.

What I hope that I’ve done here is to create a diagnostic tool to identify possibly problematic results before the replication happens.

Comments and criticisms welcome! I’m still figuring out whether any of this makes any sense.

[Update 2/27/2013, 10:27 AM CST: I made a couple of very small edits to the text, source code, and pictures. The text edits were minor LaTeX errors and clarifications. I also changed the code so as to reflect the standard error of the estimated result, rather than a close but not quite , in the third plot. Blogging at midnight be hard, yo.]

Does integrating prior information directly into the analysis, via the usual Bayesian techniques, address the problem in such a way that we can simply read the information directly off a plot of the posterior distribution?

If the answer isn’t just “employ Bayesian methods and look at a posterior,” is there a better way of communicating how much we can learn (in a scientific sense) ?

To answer the first question: it all depends, I suppose, on exactly how we think about Bayesian statistics. There’s no question in my mind that the rational belief distribution about is given by Bayes’ rule, and thus the posterior is in some sense the “right” set of beliefs about a relationship given priors and evidence. And yet…

It’s extremely common, in both frequentist and Bayesian circles, to report 95% confidence intervals (credible regions in the Bayesian parlance, i.e. when they refer to posteriors that integrate a prior). Several methodologists in multiple disciplines have proposed reporting 95% CIs as an alternative to traditional hypothesis testing with p-values or t-ratios. It’s an idea that makes a lot of sense to me, in that it does a better job of communicating the true degree of uncertainty in a coefficient estimate and (perhaps) steers us away from cutpoint-style judgments.

However, the coverage of 95% credible regions with a properly specified prior is still surprisingly uninformative about the underlying state of the world. To demonstrate this, I’ve created an R script that simulates data from a simple linear model:

where . I generated data from two states of the world, and . Note that I will be assuming that the state of the world is a point, but that our uncertain beliefs about that world are a probability distribution.

I generate 5,000 data sets with 100 observations each from the two DGPs, then calculate the proportion of the time that the 95% credible region of a properly formulated posterior distribution actually covers zero. I do this for four different normal prior distributions, all centered on but with different levels of uncertainty in the prior (different standard deviations on the normal prior).

I then calculated:

This gives the proportion of the time that when the 95% credible region excludes zero, a measure of how strongly informative that indicator is of the true state of the world. The result is plotted below.

This graph tells us the pattern of coverage of 95% CIs we would expect to see if the true coefficient () is on the x-axis. As prior beliefs are more certain, the coverage of 95% credible intervals becomes more dispositive. Putting this another way: the ratio of true positives to false positives is improved by narrower priors in this demonstration, such that 95% credible regions that exclude zero are more strongly dispositive. This results may be curious to some: why are tighter priors on the null associated with stronger inferences away from the null? The reason is that tighter priors make it harder to exclude 0 from the 95% CIs, which makes such a result more informative. That is: 0 is excluded less often when , but even less often when . It’s just the size-power tradeoff in another guise!

So: for reasonably weak coefficients or reasonably uncertain priors, it appears that we do not learn very much about the state of the world from a 95% CI that excludes zero. Even when the true , a 95% CI that excludes zero is still very consistent with no effect whatsoever. Specifying stronger priors does improve the situation, but that only makes sense if the priors reflect actual certainty about the truth of the null. If we are uncertain about , we would be better off incorporating that uncertainty into the prior and then recognizing that a 95% credible interval is not especially informative about the state of the world.

What to do? Perhaps there’s a way to communicate how much information any given result contains. I have a procedure that I think makes sense. I’ll start in the frequentist framework: no priors (or flat priors, if you prefer). Rather than changing the certainty of the prior, I’ll adjust the underlying chance that the world is one where the true value of is zero:

This is a frequentist way of approaching uncertainty about the state of the world: we say that the data generating process is drawn from a population of DGPs where there is some proportion for which and a complementary proportion for which it isn’t. We then look at the sample of cases from this population where the 95% CI excludes zero, and see how much of this sample includes cases for which . An informative result is one that is highly inconsistent with the null hypothesis—no matter how likely the null was to be drawn a priori.

If this all sounds a bit confusing, that’s probably because frequentist resampling logic is itself a little confusing. Or because I’ve messed something up.

Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that , and 2,000 data sets from the data-generating process assuming that .

Compute the proportion of the time that the 95% CI excludes zero in both cases.

Compute | CI excludes zero) for a variety of different underlying proportions of , then graph one against the other.

This is the procedure used to create the plot depicted below.

This graph was generated from an analysis of a single data set, sample size 100, generated out of the DGP , . As you can see, we do learn something from this analysis—but not as much as we might wish. If the underlying state of the world has a even a moderate proportion of samples where , then we expect a fair proportion of results whose CIs exclude zero to come from cases where . In short, if the underlying “population” of data generating processes includes even a moderate proportion of null DGPs, then finding a result where the 95% CI excludes zero doesn’t tell us much about our particular draw. For example, if there’s a 50% chance of drawing a null DGP from the population, then we expect about 20% of the cases where the 95% CI excludes zero to be null DGPs. Another way of putting this: if you have a result where the 95% CI excludes zero, and you consider this result a random draw out of the subpopulation of all cases where the 95% CI excludes zero, there’s a pretty good shot that you’ve randomly drawn a null DGP. This is far from the usual 5% threshold we use for “statistical significance.”

As you would expect, results get better if a larger sample size is used. If I repeat the analysis above with a data set of size 1000…

The sharper bend in the curve is reflective of the fact that these results are more convincing: the narrower variance in the estimated results in narrower 95% CIs which in turn are better at detecting true positives. Consequently, even when the null is likely a priori, a 95% CI that excludes zero is highly inconsistent with a truthful null hypothesis. Of course, if this is an initial study or one that uses an inductive search procedure—such that we expect the DGP to come from a population of mostly null hypotheses—even this finding is not wholly dispositive. In a population of 80% null DGPs, about 15% of the samples that exclude 0 from the 95% CIs will be non-null.

The procedure can be adapted to Bayesian inference with priors instead of frequentist resampling… which makes the interpretation a little more straightforward.

Use Bayesian methods (e.g., MCMCregress) to recover parameter estimates for and associated 95% confidence intervals using your prior of choice.

Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that , and 2,000 data sets from the data-generating process assuming that .

Run Bayesian analyses on all the Monte Carlo data sets using a mean-zero, specified-variance prior () to compute the proportion of the time that when the 95% CI excludes zero.

Repeat steps 2-3 for a range of values, computing the proportion each time.

Plot the proportion of the time that when the 95% CI excludes zero against the values of .

This graph (for the same data and analysis as the earlier plot) looks like this:

The x-axis shows the diffuseness of the prior belief distribution about (the precision, where higher numbers indicate narrower, less diffuse priors), and the y-axis shows the proportion of (simulated) cases for which out of the total number of cases where 0 is not a part of the 95% credible interval. As you can see, more diffuse priors lead to a less-informative 95% credible interval, just as one would expect from the previous examples.

As far as I can tell, the disadvantages of the fully Bayesian method are computational (it takes forever to compute all these points) and precision-related (the computational time means that fewer draws are used to compute each point in the graph, leading to greater error).

In conclusion: the plot that I’ve proposed might be a valid way to communicate to a reader precisely how much information is contained in a statistical result. One common theme: if we have diffuse priors (or expect that our analysis comes out of a population with mostly null DGPs), a single statistical result doesn’t individually say much. Even a good one! But, as more studies are conducted and our priors become narrower (or our knowledge of the population of DGPs indicates fewer nulls), each result becomes more important and informative.

I’ve signed up my blog to be a part of the R-bloggers aggregator site, a “blog of blogs” specifically about statistical analysis in the R programming environment. I highly recommend subscribing to its RSS feed, if you have a reader. There are a lot of interesting posts that I find there each week!

Many political scientists think about the possibility that outcomes are path-dependent, or that the power of X to change Y depends on a contextual factor Z. But in my reading of the literature, not very many (quantitative) political scientists are thinking about the possibility that path dependence changes the association between X and Y.

Jackie Demeritt and I have put together a couple of papers about this idea, which we call state-dependence. The concept is pretty simple: the marginal effect of on depends on the current state of the system, . (It’s also true, as a consequence of Young’s theorem, that the relationship between and depends on .)

I think that this idea is already pervasive in the thinking of social scientists, but not necessarily recognized as such and translated into an appropriate empirical model. For example, economists have known for years that the effect of monetary policy () on growth () depends on the current state of the economy (): the Fed is really good at using interest rates to control inflation and growth during boom times, but terrible at using interest rates to decrease unemployment/increase growth during recessions. The problem is so well-known that it comes with its own analogy: interest rates are like a string, good at “pulling” down an overheated economy but bad at “pushing” up a depressed one.

Our first paper works out the methodological implications of this idea, proposes an accompanying empirical model, verifies that this model works at detecting and modeling state-dependence, and then does a quick example application to presidential approval ratings and macroeconomic performance. In brief, we find that simply interacting the lagged dependent variable with the independent variable of interest works:

Furthermore, the statistical significance of is a decent dispositive indicator of state dependence. The standard techniques for presenting marginal effects in interaction models are accurate and effective at presenting the nature and degree of state-dependence in the DGP. There are far more details in the paper, including the surprising fact that this model creates latent interaction among all the variables of the model, and I hope you’ll check it out.

The second paper applies our state-dependence model to the question of whether “naming and shaming” by international human rights groups (e.g., the UN Commission on Human Rights) actually hurts foreign aid receipts for the condemned state. Somewhat surprisingly, we find evidence that condemnation tends to reinforce pre-existing relationships in the international network. That is, weak ties that are mostly humanitarian and symbolic become weaker (aid from these relationships falls) and strong ties that are based on mutual political benefit become stronger (aid from these relationships stays the same or rises). Although obviously not much can be made out of our one finding (actually two findings, since we verify the relationship in two different data sets), that’s a pretty interesting statement about the role that morality plays in international politics.

I hope you’ll find some time to take a look at our papers and offer us helpful feedback!

Just like a lot of political science departments, here at Rice a group of faculty and students meet each week to discuss new research in political methodology. This week, we read a new symposium in Political Analysis about the pre-registration of studies in political science. To briefly summarize, several researchers argued that political scientists should be required, or at least encouraged, to publicly announce their data collection and analysis plans in advance. The idea is that allowing researchers to adjust their analysis plans after collecting the data allows for some degree of opportunism in the analysis, potentially allowing researchers to find statistically significant relationships even if none exist. As usual, we don’t have any original ideas in political science: this is something that medical researchers started doing after evidence suggested that false positives were rather common in the medical literature.

To me, the discussion of study registration raises a more fundamental question: what can we hope to learn from a single data analysis? It’s a question whose answer ultimately depends on even deeper epistemological questions about how we know things in science, and how new discoveries are made. And there’s no way I can answer such a question in a short blog post. Suffice it to say that I am skeptical that we can arrive at any conclusion on the basis of a single study, even if it is pre-registered and perfectly conducted.

But there is a closely related question that I think can be answered in a short blog post. Nathan Danneman and I have recently written a paper arguing that combining assessment of the substantive robustness of a result along with its statistical significance reduces the false positive rate. In short, we find that when a relationship doesn’t really exist, it’s quite unlikely that a sample data set will show results that are both substantively robust and statistically significant. (Substantive robustness is technically defined in our paper, but for the present it suffices to note that the substantive robustness of a result is related to its size and certainty.)

There is one thing that we don’t ask: how much can we learn from a statistically significant result, given its size and its statistical significance? I’ll consider the case of a basic linear regression, . I want to know the probability that is equal to zero given that the result is statistically significant:

Now in some ways, this is an unsatisfying statement of the problem: the probability that any point estimate is true is . What I’ve done is to simplify the problem by partitioning the space of possibilities into two discrete choices: and . This roughly corresponds to the two possibilities in a frequentist hypothesis test: we can conclude that the result is not consistent with the null hypothesis, or we can conclude that it is. We could increase the complexity of the problem by (for example) using two intervals on the continuous parameter space, and , where k is the minimum size threshold for some hypothesis of concern, but nothing about the point I’m about to make would change (I think!). So for expositional purposes, this simplification is fine.

Bayes’ rule tells us that:

Here, is our prior belief that the null hypothesis is true.

Suppose we have a data set with 100 observations, run an analysis, and get an estimate . How should we update our belief about the probability of the null? Well… we need to calculate and , and we can do that using a Monte Carlo simulation. I ran 1000 simulations of data creation, linear modeling, and hypothesis test under two conditions: , where was a constant, and . I then calculated the updated probability of the null using my results and Bayes’ rule. I chose a range of values for to simulate a range of possible signal strengths in the data generating process. The true DGP was , where . The R code for this simulation is here.

The results are depicted in the graph below.

Now, take a look at this. Our maximum likelihood, squared-error-minimizing guess about is , and so we can interpret this graph as the updated belief that one should have about the probability of the null hypothesis upon seeing a result of the size on the X-axis. If you find that , and you believed that there was a 50/50 chance of the null being false before you started the analysis, you should now think that there’s a 60/40 chance of the null being false. That means there’s a 40% chance that the null is true, given your result! That’s pretty amazing.

And that’s with a rather liberal prior! If, like most political scientists, you start out being much more skeptical—a 90% chance that the null is true—even finding a wouldn’t get you anywhere near to a 95% posterior belief that the null is false.

So, what can we conclude? First, a small magnitude but statistically significant result contains virtually no important information. I think lots of political scientists sort-of intuitively recognize this fact, but seeing it in black and white really underscores that these sorts of results aren’t (by themselves) all that scientifically meaningful. Second, even a large magnitude, statistically significant result is not especially convincing on its own. To be blunt, even though such a result moves our posterior probabilities a lot, if we’re starting from a basis of skepticism no single result is going to be adequate to convince us otherwise.

And this brings me back to my first point: study pre-registration. I hope that my little demonstration has helped to convince you that no single study can be much evidence of anything, even under comparatively ideal conditions. And so putting restrictions on the practice of science to guarantee the statistical purity of individual studies seems a little misguided to me, if those restrictions are likely to constrain scientists’ freedom to create and explore. Pre-registration by its very nature is going to incentivize people to create tests of existing theories and inhibit them from searching their data sets for new and interesting relationships. Perhaps we’d be more open to their doing that if we knew that the marginal contribution of any study to “conclusiveness” is very small, and so it’s more important to ensure that these studies are creative than to ensure that they are sound implementations of Popperian falsification. Bayesian learning about scientific conclusions is going to take place in a literature, not in a paper.

The New Atlantis published a recent retrospective on Kuhn's Structure of Scientific Revolutions that's well worth a look. I think the article rightly notes that social scientists have eagerly embraced Kuhn's ideas, in the sense that the average political scientist is quite likely to have at least spent a graduate seminar meeting or two discussing the relationship between Kuhn's thesis and the practice of social science.

Yet I can't help but think that it gets several important points wrong. For instance:

Despite these criticisms, many social scientists embraced — or perhaps appropriated — Kuhn’s thesis. It enabled them to elevate the status of their work. The social sciences could never hope to meet the high standards of empirical experimentation and verifiability that the influential school of thought called positivism demanded of the sciences. But Kuhn proposed a different standard, by which science is actually defined by a shared commitment among scientists to a paradigm wherein they refine and apply their theories. Although Kuhn himself denied the social sciences the status of paradigmatic science because of their lack of consensus on a dominant paradigm, social scientists argued that his thesis could still apply to each of those competing paradigms individually. This allowed social scientists to claim that their work was scientific in much the way Kuhn described physics to be.

I'm willing to be persuaded otherwise, but my subjective impression is that Popper has been far more influential on (quantitative) political science than Kuhn. It's Popperian falsificationism that informs the scientific process that we teach to our undergrads in the standard research methods curriculum. Popperian falsificationism drove the AJPS's previous policy of requiring explicit statements of hypothesis, test, and result in the abstract. Popperian falsificationism informs criticisms of non-experimental empirical social science offered by Green, inter alia.

Some things the author says are kinda true, but I think are taken in a misleading direction:

A scientific way of thinking permeated the writings of Auguste Comte and Karl Marx, and by the end of the century, with the work of Max Weber and Émile Durkheim, the era of social science had begun in earnest. Many of the early social scientists came to view society in terms of contemporary physics; they adopted the Enlightenment belief in science as the source of progress, and considered physics the archetypical science. They understood society as a mechanism that could be engineered and adjusted. These early social scientists began to deem philosophical questions irrelevant or even inappropriate to their work, which instead became about how the mechanism of society operated and how it could be fixed. The preeminence of physics and mechanistic thinking was passed down through generations of social scientists, with qualitative characterization considered far less valuable and less “scientific” than quantitative investigations. Major social scientific theories, from behaviorism to functionalism to constructivism and beyond, tacitly think of man and society as machines and systems.

I'm not an expert on Weber and Durkheim, but I have read some of their methodological writings. Durkheim wrote an entire book about the philosophical underpinnings of his method. Weber's method of verstehen, interpretive understanding, launched a thousand essays about qualitative interpretationalism in political science–most of them decidedly opposed to positivism in the social sciences. That's not to say that many quantitative political scientists don't cling to a somewhat philosophically backward version of epistemological positivism; I suspect that many, maybe even most, do. But methodologists, the teachers and developers of quantitative methods in political scientists, are ceaselessly ragging on people to take a more sophisticated view (just take a look at this naturalistic but non-logical-positivist book just released by Kevin Clarke).

Still, some of the things the author says are pretty interesting. For instance:

A recent paper in the journal Theory in Biosciences perfectly encapsulates the desire for a more biological perspective in the social sciences, arguing for “Taking Evolution Seriously in Political Science.” The paper outlines the deterministic dangers in the view of social systems as Newtonian machines, as well as the problems posed by the reductionist belief that elements of social systems can be catalogued and analyzed. By contrast, the paper argues that approaching social sciences from an evolutionary perspective is more appropriate philosophically, as well as more effective for scientific explanation. This approach allows us to examine the dynamic nature of social changes and to explain more consistently which phenomena last, which disappear, and which are modified, while still confronting persistent questions, such as why particular institutions change.

Reading over a preprint of the article, it makes some interesting (if slightly superficial) observations about the difference between an explanandum that's amenable to repeatable experimentation and one which is data-driven and factual but largely based on uncontrolled observational evidence. The latter is not unscientific, but certainly not idiographic (a new word I learned from this month's Perspectives on Politics). But I'd like to hear a more thorough and more philosophically developed epistemological framework that relates the practice of evolutionary biologists to some foundational perspective on the world in such a way that we would expect one to lead to a better understanding of the other. That would, I think, be necessary before I could get behind using evolutionary biologists as a model for political science.

The Theory in Biosciences article does say:

Many political scientists today are searching for a better understanding of the mechanisms of political change. The problem analytically, is that most political science models are static. For rational choice, this is due to the theoretical argument that any given institutional setting will eventually reach an equilibrium in which “no one has the incentive to change his or her choice” (Levi 1997: 27). Consequently the only source of change is exogenous. As Levi argues, “it is obvious that choices change regularly and constantly. . . To understand these changes requires a set of hypotheses concerning what exogenous shocks or alterations to the independent variables will have what effects on the actions of the individuals under study” (Levi, 1997: 28).39 Given the foundational assumptions and logic of rational choice, “endogenous institutional change appears,” as Hall and Taylor observe, “to be a contradiction in terms.”

Now that's something I completely agree with… and yet, I see very little research being done on truly dynamic theoretical models in political science. Dynamic statistical models, sure. But they can't substitute for institutional political theories that seamlessly integrate change over time into their explanatory framework.

The basic idea was to figure out which of these three model types did a better job of predicting which passengers would survive on the Titanic based on their personal characteristics; these included characteristics like sex, age, the class of the ticket (first, second, or third). For each model, the blogger estimated the model on half the sample (the “training data”) and then predicted the probability of survival for the other half of the sample (the “test data”). Any passenger predicted to have a >50% probability of survival was classified as being predicted to survive. The blogger then determined what proportion of predicted survivors actually survived.

The result, copied from the original blog post:

I think this is a flawed way to assess the predictive power of the model. If a passenger is predicted to have a 50% probability of survival, we should expect this passenger to die half the time (in repeated samples); classifying the person as a “survivor,” a person with a 100% probability of survival, misinterprets the model's prediction. For example, suppose (counterfactually) that a model classified half of a test data set of 200 passengers as having a 10% chance of survival and the other half as having a 60% chance of survival. The 50% threshold binary classification procedure expects there to be 100 survivors, and for all of the survivors to be in the portion of the sample with >50% predicted probability of survival. But it's more realistic to assume that there would be 10 survivors in the 10% group, and 60 survivors in the 60% group, for a total of 70 survivors. Even if the model's predictions were totally accurate, the binary classification method of assessment could easily make its predictions look terrible.

Andrew Pierce and I just published a paper in Political Analysis making this argument. In that paper, we propose assessing a model's predictive accuracy by constructing a plot with the predicted probability of survival on the x-axis, and the empirical proportion of survivors with that predicted probability on the y-axis. The empirical proportion is computed by running a lowess regression of the model's predicted probability against the binary (1/0) survival variable, using the AIC to choose the optimal bandwidth, and then extracting the lowess prediction from the model. We created an R package to perform this process automatically (and to implement a bootstrapping approach to assessing uncertainty in the plot), but this package is designed for the assessment of in-sample fit only. So, I have to construct them manually for this example. The code for everything I've done is here.

Here's the plot for the GLM model:

As you can see, the logit model actually does a very good job of predicting the probability of passenger survival in the test data. It slightly underpredicts the probability of death for passengers who are unlikely to die, and slightly overpredicts the probability of death for the other passengers. But the predicted probabilities for near-certain (Pr(survive) near 0) and nearly impossible (Pr(survive) near 1) deaths, which are most of the data set, are quite accurately predicted.

The random forest model does a perfectly reasonable job of predicting outcomes, but not markedly better:

The pattern of over- and under-predictions is very similar to that of the GLM model. In fact, if you plot the logit predictions against the random forest predictions…

You can see that there are comparatively few cases that are classified much differently between the two models. The primary systematic difference seems to be that the random forest model takes cases that the logit predicts to have a low but positive probability of survival, and reclassifies them as zero probability of survival. I put in the dark vertical and horizontal lines to show which data points the binary classification procedure would deem “survivors” for each model; there are a few observations that are categorized differently by the two models (in the upper left and lower right quadrants of the plot), but most are categorized the same.

Finally, the conditional inference tree model does classify things quite differently, but not in a way that substantially improves the performance of the model:

I've jittered the CTREE predictions a bit so that you can see the data density. The tree essentially creates five categories of predictions, but doesn't appreciably improve the predictive performance inside of those categories above the logit model.

Comparing the GLM logit predictions to the ctree predictions…

…you see the categorizations more clearly. Of course, you can just look at the CART plot to see how these categories are created:

I have to admit, that is a pretty sweet plot.

In conclusion, comparatively ancient GLM methods do surprisingly well on this problem when compared to the CART methods. If anything, the CART methods apparently suppress a decent amount of heterogeneity in probability forecasts that the GLM models uncover. But all of the models have the same basic strengths and weaknesses, in terms of predictive accuracy. And if the heterogeneity of the GLM predictions reflects signal and not noise–and my plots seem to suggest that it is signal–the GLM predictions might well be better for forecasting survival in individual cases.

Maybe some day, I will get around to creating a version of my R package that does out-of-sample forecasts! That way, I could get assessments of the uncertainty around the plot as well.

Larry Wasserman suggests re-naming “causal inference” methods to “formal inference” methods to avoid confusion, which I fully support (though it might make the formal theorists in Political Science mad). But what I really like is his suggestion to rename “the bootstrap” to “the shotgun,” mostly because I want to teach a method called “shotgunning” in my methods classes.

It's winter break… and as any academic knows, breaks are “a good time to get work done.”

For the Christmas break, many of us have to travel home to see family members. One of the great privileges of being an academic is that you don't necessarily need to be in your office to get research and administrative work accomplished. But for methods types like me, much of our work requires either (a) running large scale simulations to test the performance of a new estimator, or (b) running complex models on large data sets. In either case, significant computing resources are required… the kind of resources that can't be easily checked in a carry-on.

But one thing I usually do have with me is my iPad. It's light, portable, and it has a 4G internet connection (and WiFi, of course) so that I can almost always get online. And I realized that, if you add a keyboard and a Remote Desktop app to the mix, that's all that's required to get serious work done.

You only need a few things:

(A) Your work computer must be connected to the internet, and must be have some sort of remote access software installed. Fortunately, there are tons of really great remote access options available out there, many of them free.

RDP, the Windows Remote Desktop protocol, is a great option and free for anyone with Windows. If you have Windows 7, like I do, you can find out how to enable RDP connections here. Make sure that you have passwords enabled on all the administrator accounts (all of which will get RDP access once you enable it).

VNC Server is available for free on Unix-alikes, such as Ubuntu, and there are apparently free options for Macintosh OS X and Windows as well. But my experience has been that RDP is much better for Windows users.

LogMeIn and Hamachi are paid services that have proprietary software to help make Remote Desktop connections more secure and more convenient. But RDP and VNC connections can just as secure, if you use Virtual Private Networks and/or SSL tunneling to connect.

(B) If your computer is at work on a secure network, such as on a university's network, you need to connect your iPad to the Virtual Private Network of your university. The way you do this will vary from case to case, but generally speaking your university should have instructions on how to connect to its VPN. The iPad can connect to a VPN natively through iOS. You may also need to ask the university's IT services to assign your computer to a static IP address on the network.

(C) If your computer is at home and behind a router, connecting remotely is possible but a little tricky. There are several things you need to do:

You will probably need to enable port forwarding on your router so that the iPad will be able to connect to the computer that you wish to share. If your router has an option for VPN connections, it may be preferable to set up the VPN rather than port forwarding.

You will need to set the computer you wish to connect to up on a static IP on your home network, through your router.

You will need to connect to the public IP address of your home router, which unfortunately changes from time to time. So, you will need to set your router up with a service like DynDNS so that you can point the router at a static web address (something like esarey.dyndns.org).

(D) Finally, you need to get connection software for your iPad. I personally use Jump for iPad, which is working great for me so far. A keyboard is also probably essential to get real work done, and so far I've really liked the Zagg ProFolio+.

And that's it! If everything's working well, you should be able to use Jump to connect to your computer's IP address, and have an iPad screen that looks a little something like this:

Now you can get a little programming done while you're home for the holidays!