Political scientists tend to think about causality in terms of mechanisms. In this paper we argue that non-parametric structural equation models are consistent with how many empirical political
scientists think about causality and are consistent with the powerful and well-respected Neyman-Rubin Causal Model. Furthermore, using examples
we demonstrate that two important practical questions are more easily addressed within the mechanistic framework: What (if any) set or sets
of conditioning variables will allow the identification of average causal effects in a regression or matching model? When unmeasured confounding is present, what (if any) adjustment will non-parametrically identify the average causal effect?

Over the last twenty years, a literature spanning several fields of applied statistics has analyzed how to identify and estimate causal effects of a nonrandomized treatment when an instrumental variable (IV) is available. But researchers often have multiple treatments that might interact with one another and want to estimate either the direct or joint effect of these treatments. This paper introduces a set of novel estimands for instrumental variables with multiple treatments and multiple instruments. These estimands are similar to previous IV estimands as they are ``local'' to strata defined by the joint compliance status across the treatments. Furthermore, I show that these estimands are nonparametrically identified under standard instrumental variable assumptions. The paper further develops nonparametric estimators for these quantities and assesses their performance relative to classic parametric approaches like two-stage least squares. Finally, I demonstrate the method through an empirical application to a voter mobilization field experiment with (1) a telephone treatment and (2) an in-person canvassing treatment.

A basic feature of many field experiments is that investigators are only able to randomize clusters of individuals -- such as households, communities, firms, medical practices, schools, or classrooms -- even when the individual is the unit of interest. To recoup some of the resulting efficiency loss, many studies pair similar clusters and randomize treatment within pairs. Other studies (including almost all published political science field experiments) avoid pairing, in part because some prominent methodological articles claim to have identified serious problems with this 'matched-pair cluster-randomized' design. We prove that all such claims about problems with this design are unfounded. We then show that the estimator for matched-pair designs favored in the literature is appropriate only in situations where matching is not needed. To address this problem without modeling assumptions, we generalize Neyman's (1923) approach and propose a simple new estimator with much improved statistical properties. We also introduce methods to cope with individual-level noncompliance, which most existing approaches incorrectly assume away. We show that from the perspective of, among other things, bias, efficiency, or power, pairing should be used in cluster-randomized experiments whenever feasible; failing to do so is equivalent to discarding a considerable fraction of one's data. We develop these techniques in the context of a randomized evaluation we are conducting of the Mexican Universal Health Insurance Program.

Understanding causal mechanisms is a fundamental goal of social science research. Demonstrating whether one variable causes a change in another is often insufficient, and researchers seek to explain why such a causal relationship arises. Nevertheless, little is understood about how to identify causal mechanisms in empirical research. Many researchers either informally talk about possible causal mechanisms or attempt to quantify them without explicitly stating the required assumptions. Often, some assert that process tracing in detailed case studies is the only way to evaluate causal mechanisms. Others contend the search for causal mechanisms is so elusive that we should instead focus on causal effects alone. In this paper, we show how to learn about causal mechanisms from experimental and observational studies. Using the potential outcomes framework of causal inference, we formally define causal mechanisms, present general identification and estimation strategies, and provide a method to assess the sensitivity of one's conclusions to the possible violations of key identification assumptions. We also propose several alternative research designs for both experimental and observational studies that may help identify causal mechanisms under less stringent assumptions. The proposed methodology is illustrated using media framing experiments and observational studies of incumbency advantage.

We analyze a natural experiment to answer the longstanding question of
whether the name order of candidates on ballots affects election outcomes.
Since 1975, California law has mandated randomizing the ballot order with a
lottery, where alphabet letters would be shaken vigorously and selected
from a container. Previous studies, relying overwhelmingly on non-randomized
data, have yielded conflicting results about whether ballot order effects
even exist. Using improved statistical methods, our analysis of statewide
elections from 1978 to 2002 reveals that in general elections ballot order
has a significant impact only on minor party candidates and candidates
for nonpartisan offices. In primaries, however, being listed first benefits
everyone. In fact, ballot order might have changed the winner in roughly
nine percent of all primary races examined. These results are largely
consistent with a theory of partisan cuing. We propose that all
electoral jurisdictions randomize ballot order to minimize ballot effects.

In his landmark article, Neyman (1923) introduced randomization-based inference in analyzing experiments under the completely randomized design. Under this framework, Neyman considered the statistical estimation of the sample average treatment effect and derived the variance of the standard estimator using the treatment assignment mechanism as the sole basis of inference. In this paper, I extend Neyman's analysis to randomized experiments under the matched-pair design where experimental units are paired based on their pre-treatment characteristics and the randomization of treatment is subsequently conducted within each matched pair. I study the variance identification for the standard estimator of average treatment effects and analyze the relative efficiency of the matched-pair design over the completely randomized design. I also show how to empirically evaluate the relative efficiency of the two designs using experimental data obtained under the matched-pair design. My randomization-based analysis clarifies some of the important questions raised in the literature and identifies a hiden and yet implausible assumption that is made for the efficiency analysis in a widely used textbook. Finally, the analytical results are illustrated with numerical and empirical examples.

Political scientists are often interested in understanding whether state laws alter individual level behavior. For example, states often alter their election procedures, which can increase or decrease the cost of voting. In this example, it is important to understand whether these changes alter turnout since changes in costs may disproportionally affect those at the margin of voting. Analysts have typically used one of two different regression based research designs to estimate whether changes in state laws increase or decrease turnout. In both instances, voters from states without a change in laws are used as counterfactuals for the voters who experience a change in election law. Here, we carefully examine the assumptions behind both research designs and study their plausibility. Next, we outline a series of research design elements that can be used in addition to the usual designs. These research design elements allow the analyst to better understand the role of unobserved confounders, which is obscured in standard research designs. Using these design elements, we demonstrate that what appears to be clear cut evidence from the usual research designs is often a function confounding. We argue that to truly understand how changes in voting costs alters turnout, a different research design is required. Future work must rely on a research design that makes comparisons among voters who live within the same state. Our work has implications beyond turnout to any investigation of how state level treatments alter individual behavior.

Many important papers studying cross-national outcomes such as political regime type or economic development exploit treatment variables generated by either geological or pre-modern historical processes. A general and major problem with these treatments, however, derives from their heavy regional concentration. Despite not being caused by other variables that independently affect the dependent variable, due to geological or historical accidents, variables such as oil or settler mortality claimed to be exogenous are nonetheless highly correlated with potential confounders that impede drawing causal inferences. With the goal of eliminating bias by controlling for observables, many papers studying variables such as these use parametric procedures to control for regional dummies. While estimation techniques such as ordinary least squares (OLS) provide a seemingly straightforward methodological fix, OLS also obscures particular shortcomings of the data, and imposes strong assumptions to combine information across regions. The current paper takes a closer look at these assumptions and provides examples from top political science and economic journals to show how disaggregating the data can either help to support or to severely qualify existing results.

Using new robust matching methods for making causal inferences from
survey data, I demonstrate that there are profound differences
between how voters behave in mature democracies versus how they
behave in new ones. The problems of voter ignorance and
inattentiveness are not as serious in mature democracies as many
analysts have suggested but are of grave concern in new democracies.
Citizens in mature democracies are able to accomplish something that
citizens in fledgling democracies are not: inattentive and poorly
informed citizens are able to vote like their better informed
compatriots and hence need to pay little attention to political
events such as election campaigns in order to vote as if they were
attentive. The results from the U.S. (which rely on various
National Election Studies) and Mexico (2000 Panel Study) are
reported in detail. Results from other countries are briefly
reported.

Many randomized experiments suffer from the ``truncation-by-death'' problem where potential outcomes are not defined for some subpopulations. For example, in medical trials, quality-of-life measures are only defined for surviving patients, and various skip-pattern questions are analyzed in social science survey experiments. In this paper, I
derive the sharp bounds on causal effects under various assumptions. My identification analysis is based on the idea that the ``truncation-by-death'' problem can be formulated as the contaminated
data problem. The proposed analytical techniques can be applied to other settings in causal inference including the estimation of direct and indirect effects and the analysis of three-arm randomized
experiments with noncompliance.

We present a method that largely automates the search for systematic treatment effect heterogeneity in large-scale experiments. We introduce an estimator recently proposed in the statistical learning literature, Bayesian Additive Regression Trees (BART), to model treatment effects that vary as a function of covariates. BART has two important advantages over commonly employed parametric modeling strategies: it automates the search for treatment-covariate interactions and models them in a very flexible manner. To increase the reliability and credibility of the resulting conditional average treatment effect estimates, we suggest the use of a split sample analysis, which randomly divides the data into two equally-sized parts. The first part is used to search for systematic treatment effect heterogeneity; the second part is used to confirm the results. This approach permits a relatively unstructured exploration of systematic treatment effect heterogeneity while avoiding the pitfalls of data dredging and multiple comparisons. We illustrate the value of our approach by offering two empirical examples, a survey experiment on Americans' support for social welfare spending and a voter mobilization field experiment. In both applications, our approach provides robust insights into the nature and extent of systematic treatment effect heterogeneity.

In this article, we develop the theoretical properties of the propensity
function which is a generalization of the propensity score of Rosenbaum
and Rubin (1983). Methods based on the propensity score have long been
used for causal inference in observational studies; they are easy to use
and can effectively reduce the bias caused by non-random treatment
assignment. Although treatment regimes need not be binary in practice, the
propensity score methods are generally confined to binary treatment
scenarios. Two possible exceptions were suggested by Joffe and Rosenbaum
(1999) and Imbens (2000) for ordinal and categorical treatments,
respectively. In this article, we develop theory and methods which
encompass all of these techniques and widen their applicability by
allowing for arbitrary treatment regimes. We illustrate our propensity
function methods by applying them to two data sets; we estimate the effect
of smoking on medical expenditure and the effect of schooling on wages. We
also conduct Monte Carlo experiments to investigate the performance of our
methods.

We attempt to clarify, and suggest how to avoid, several serious misunderstandings about and fallacies of causal inference in experimental and observational research. These issues concern some of the most basic advantages and disadvantages of each basic research design. Problems include improper use of hypothesis tests for covariate balance between the treated and control groups, and the consequences of using randomization, blocking before randomization, and matching after treatment assignment to achieve covariate balance. Applied researchers in a wide range of scientific disciplines seem to fall prey to one or more of these fallacies, and as a result make suboptimal design or analysis choices. To clarify these points, we derive a new four-part decomposition of the key estimation errors in making causal inferences. We then show how this decomposition can help scholars from different experimental and observational research traditions better understand each other's inferential problems and attempted solutions.
(This paper is forthcoming in the Journal of the Royal Statistical Society, but we have some time for revisions and would value any comments anyone might have. This is a revised and much more general version of an earlier paper, "The Balance Test Fallacy in Causal Inference".)

Interference between units may pose a threat to unbiased causal inference in randomized controlled experiments. Although the assumption of no interference is essential for causal inference, few options are available for testing this assumption. This paper presents the first reliable ex post method for detecting interference between units in randomized experiments. Naive estimators of interference that attempt to exploit the proximity of units may be biased because simple randomization of units into treatment does not imply simple randomization of proximity to treated units. However, through a randomization-based approach, the confounding associated with these naive estimators may be circumvented entirely. With a test statistic of the analyst's choice, a conditional randomization test allows for the calculation of the exact significance of the causal dependence of outcomes on the treatment status of other units. The efficacy and robustness of the method is demonstrated through simulation studies and, using this method, interference between units is detected in a field experiment designed to assess the effect of mailings on voter turnout.

This essay examines several alternative theories of causality from the
philosophy of science literature and considers their implications for
methods of empirical social inquiry. In particular, I argue that the
epistemology of counterfactual causality is not the only logic of causal
inference in social inquiry, and that different methods of research
appeal to different models of causal inference. As these models are
often philosophically inter-dependent, a more eclectic understanding of
causation in empirical research may afford greater methodological
versatility and provide a more complete understanding of causality.
Some common statistical critiques of small-N research are then
considered from the perspective of mechanistic causal theories, and
alternative strategies of strengthening causal arguments in small-N
research are discussed.

Experiments, unlike observational studies, are rarely criticized for yielding invalid causal inferences. However, I identify measurement error as a threat to causal inference of an experiment. In particular, acquiescence bias, a common and substantial source of measurement error within surveys, may be correlated with experimental manipulations. Using data from a survey experiment embedded in a Deliberative Poll, I find that acquiescence bias causes significant measurement error and that the bias differs before and after deliberation. I conclude that even experimental researchers should heed the recommendation by questionnaire design researchers to refrain from asking agree/disagree questions completely and instead ask only construct-specific questions to avoid this threat to validity.

This paper proposes a research design for evaluating the effect of Republican candidates' immigration stances on House election outcomes. It develops a measure of immigration stance which is based on the text of each candidate's issue statement. With this as the treatment, propensities to support a harsh line on immigration are calculated for each candidate based on a variety of covariates that also may influence election outcomes. In this way, a research design is developed before election outcomes are observed. Thus, this project clearly reflects the advice of Rubin, who argues that the research design ought to be set before the outcome is even observed.

In this article, we develop the theoretical properties of the propensity
function which is a generalization of the propensity score of Rosenbaum
and Rubin (1983). Methods based on the propensity score have long been
used for causal inference in observational studies; they are easy to use
and can effectively reduce the bias caused by non-random treatment
assignment. Although treatment regimes are often not binary in practice,
the propensity score methods are generally confined to binary treatment
scenarios. Two possible exceptions were suggested by Joffe and Rosenbaum
(1999) and Imbens (2000) for ordinal and categorical treatments,
respectively. In this article, we develop theory and methods which
encompass all of these techniques and widen their applicability by
allowing for arbitrary treatment regimes. We illustrate our propensity
function methods by applying them to two data sets; we estimate the effect
of smoking on medical expenditure and the effect of schooling on wages. We
also conduct Monte Carlo experiments to investigate the performance of our
methods.

We address a major discrepancy in matching methods for causal inference in observational data. Since these data are typically plentiful, the goal of matching is to reduce bias and only secondarily to keep variance low. However, most matching methods seem designed for the opposite problem, guaranteeing sample size ex ante but limiting bias by controlling for covariates through reductions in the imbalance between treated and control groups only ex post and only sometimes. (The resulting practical difficulty may explain why many published applications do not check whether imbalance was reduced and so may not even be decreasing bias.) We introduce a new class of "Monotonic Imbalance Bounding" (MIB) matching methods that enables one to choose a fixed level of maximum imbalance, or to reduce maximum imbalance for one variable without changing it for the others. We then discuss a specific MIB method called "Coarsened Exact Matching" (CEM) which, unlike most existing approaches, also explicitly bounds through ex ante user choice both the degree of model dependence and the causal effect estimation error, eliminates the need for a separate procedure to restrict data to common support, meets the congruence principle, is approximately invariant to measurement error, works well with modern methods of imputation for missing data, is computationally efficient even with massive data sets, and is easy to understand and use. This method can improve causal inferences in a wide range of applications, and may be preferred for simplicity of use even when it is possible to design superior methods for particular problems. We also make available open source software which implements all our suggestions.

We introduce a new ``Monotonic Imbalance Bounding'' (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, ``Equal Percent Bias Reducing'' (EPBR), which is designed to satisfy weaker properties and only in expectation. We also offer strategies to obtain specific members of the MIB class, and analyze in more detail a member of this class, called Coarsened Exact Matching, whose properties we analyze from this new perspective. We offer a variety of analytical results and numerical simulations that demonstrate how members of the MIB class can dramatically improve inferences relative to EPBR-based matching methods.

Can randomized experiments at the individual level help assess the persuasive effects of campaign tactics? In the contemporary U.S., vote choice is not observable, so one promising research design to assess persuasion involves randomizing appeals and then using a survey to measure vote intentions. Here, we analyze one such field experiment conducted during the 2008 presidential election in which 56,000 registered voters were assigned to persuasion in person, by phone, and/or by mail. Persuasive appeals by canvassers had two unintended consequences. First, they reduced responsiveness to the follow-up survey, lowering the response rate sharply among infrequent voters. Second, various statistical methods to address the resulting biases converge on a counter-intuitive conclusion: the persuasive canvassing reduced candidate support. Our results allow us to rule out even small effects in the intended direction, and illustrate the backlash that persuasion can engender.

We introduce a set of new Markov chain Monte Carlo algorithms for Bayesian
analysis of the multinomial probit model. Our Bayesian representation of
the model places a new, and possibly improper, prior distribution directly
on the identifiable parameters and thus is relatively easy to interpret
and use. Our algorithms, which are based on the method of marginal data
augmentation, involve only draws from standard distributions and dominate
other available Bayesian methods in that they are as quick to converge as
the fastest methods but with a more attractive prior specification.

The fundamental problem of causal inference is that an individual cannot be simultaneously observed in both the treatment and control states (Holland 1986). The propensity score methods that compare the treatment and control groups by discarding the unmatched units are now widely used to deal with this problem. In some situations, however, it is possible to observe the same individual or unit of observation in the treatment and control states at different points in time. The data has the structure that is often refer to as time-series-cross-sectional (TSCS) data. While multilevel modeling is often applied to analyze TSCS data, this paper proposes that synthesizing the propensity score methods and multilevel modeling is preferable. The paper conducts a Monte Carlo simulation with 36 different scenarios to test the performance of the two combined methods. The result shows that synthesizing the propensity score matching with multilevel modeling performs better in that such method yields less biased and more efficient estimates. An empirical case study that reexamine the model of Przeworksi et al (2000) on democratization and development also shows the advantage of this synthesis.

Political scientists frequently use instrumental variables estimators to estimate the Local Average Treatment Effect (LATE), or the average treatment effect among those who comply with treatment assignment. However, the LATE is often not the causal estimand of interest; researchers may instead be interested in the Sample Average Treatment Effect (SATE), or the average treatment effect for the entire sample. We first introduce the compliance score, a pre-treatment covariate that reflects a unit's probability of treatment compliance, to researchers in political science. We posit a maximum likelihood estimation technique for predicting compliance scores even in the presence of two-sided non-compliance. We then develop a new technique, inverse compliance score weighting, that, in conjunction with a standard IV estimator, will allow researchers to easily estimate the SATE. Finally, we estimate both the LATE and SATE for a randomized experiment designed to measure the effects of media exposure and reach striking substantive conclusions.

Most of the literature on grassroots campaigning focuses on mobilizing potential sup- porters to turn out to vote. The actual ability of partisan campaigns to boost support by changing voter preferences is unclear. We present the results of a field experiment the Australian Council of Trade Unions (ACTU) ran during the 2013 Australian Federal Election. The experiments were designed to minimize the conservative (the Coalition) vote as part of one of the largest and most extensively documented voter persuasion campaigns in Australian history. Union members who were identified as undecided voters in over 30 electorates were targeted with appeals by direct mail and phone banks. Because of the presence of compulsory voting in Australia, we are able to identify the effects of voter persuasion independently of voter turnout. We find that direct mail, the most extensively used campaign strategy in Australia, has little effect of voter persuasion. Direct human contact, on the other hand, seems to be an effective tool for voter persuasion. Among undecided voters who actually receive direct contact via phone call, we find a ten percentage point decrease in the Coalition vote. From a methodological standpoint, we use various methods to account for multiple treatment arms, measured treatment noncompliance in one of the treatments, and missing outcome and covariate data. The field experiment also provides a good lesson in conducting and saving broken experiments in the presence of planning uncertainty and implementation failures.

Statistical analysis requires a probability model: commonly, a model
for the dependence of outcomes $Y$ on confounders $X$ and a
potentially causal variable $Z$. When the goal of the analysis is to
infer $Z$'s effects on $Y$, this requirement introduces an element
of circularity: in order to decide how $Z$ affects $Y$, the analyst
first determines, speculatively, the manner of $Y$'s dependence on
$Z$ and other variables. This paper takes a statistical perspective
that avoids such circles, permitting analysis of $Z$'s effects on
$Y$ even as the statistician remains entirely agnostic about the
conditional distribution of $Y$ given $X$ and $Z$, or perhaps even
denies that such a distribution exists. Our assumptions instead
pertain to the conditional distribution $Z vert X$, and the role of
speculation in settling them is reduced by the existence of random
assignment of $Z$ in a field experiment as well as by
poststratification, testing for overt bias before accepting a
poststratification, and optimal full matching. Such beginnings pave
the way for ``randomization inference'', an approach which, despite
a long history in the analysis of designed experiments, is
relatively new to political science and to other fields in which
experimental data are rarely available.
The approach applies to both experiments and observational studies.
We illustrate this by applying it to analyze A. Gerber and
D. Green's New Haven Vote 98 campaign. Conceived as both a
get-out-the-vote campaign and a field experiment in political
participation, the study assigned households to treatment and
desired to estimate the effect of treatment on the individuals
nested within the households. We estimate the number of voters who
would not have voted had the campaign not prompted them to --- that
is, the total number of votes attributable to the interventions of
the campaigners --- while taking into account the non-independence
of observations within households, non-random compliance, and
missing responses. Both our statistical inferences about these
attributable effects and the stratification and matching that
precede them rely on quite recent developments from statistics; our
matching, in particular, has novel features of potentially wide
applicability. Our broad findings resemble those of the original
analysis by citet{gerbergreen00}.

Using panel data and matching techniques, we exploit a rare change in communication flows -- the endorsement switch to the Labour Party by several prominent British newspapers before the 1997 United Kingdom general election -- to study the persuasive power of the news media. These unusual events provide an opportunity to test for news media persuasion while avoiding methodological pitfalls that have plagued previous studies. By comparing readers of newspapers that switched endorsements to similar individuals who did not read these newspapers, we estimate that these papers persuaded a considerable share of their readers to vote for Labour. Depending on the statistical approach, the point estimates vary from about 10 percent to as high as 25 percent of readers. These findings provide rare, compelling evidence that the news media exert a powerful influence on mass political behavior.

Matching is an increasingly popular method of causal inference in observational data, but applications of it are often poorly executed. We address this problem by providing a graphical approach for choosing among the numerous possible matching solutions generated by three methods: the venerable "Mahalanobis Distance Matching" (MDM), the commonly used "Propensity Score Matching" (PSM), and a newer approach called "Coarsened Exact Matching" (CEM). In the process of using our approach, we also discover that PSM often approximates random matching, both in real applications and in data simulated by the processes for which PSM theory was designed. Moreover, contrary to conventional wisdom, random matching is not benign: it (and thus PSM) can degrade inferences relative to not matching at all. We find that MDM and CEM do not have this problem, and in practice CEM usually outperforms the other two approaches. However, with our comparative graphical approach, focus is on choosing a matching solution for a particular application, which is what may improve inferences, rather than the particular method used to generate it. The easyto- follow procedures we describe thus enable researchers to improve the application of any one of these methods, to choose among them and from the various matching solutions generated by any one method, and ultimately to increase the validity and extent of causal information extracted from their data.
Link to paper: http://gking.harvard.edu/files/psparadox.pdf

Genetic matching is a new method for performing multivariate matching
which uses an evolutionary search algorithm to determine the weight
each covariate is given. The method utilizes an evolutionary algorithm
developed by Mebane and Sekhon (1998; Sekhon and Mebane 1998) that
maximizes the balance of observed potential confounders across matched
treated and control units. The method is nonparametric and does not
depend on knowing or estimating the propensity score, but the method
is greatly improved when a known or estimated propensity score is
incorporated. Genetic matching reliably reduces both the bias and the
mean square error of the estimated causal effect even when the
property of equal percent bias reduction (EPBR) does not hold. When
this property does not hold, matching methods---such as Mahalanobis
distance and propensity score matching---often perform poorly. Even if the EPBR property
does hold and the propensity score is correctly specified, in finite samples, estimates based on
genetic matching have lower mean square error than those based on the
usual matching methods. We present a reanalysis of the LaLonde (1986)
job training dataset which demonstrates the benefits of genetic
matching and which helps to resolve a longstanding debate between
Dehejia and Wahba (1999, 2002); Dehejia (2005) and Smith and Todd
(2001, 2005a,b) over the ability of matching to overcome LaLonde's
critique of nonexperimental estimators. Monte Carlos are also
presented to demonstrate the properties of our method.

What, if anything, should one infer about the causal effect of a binary treatment on a binary outcome from a $2 imes 2$ cross-tabulation of non-experimental data? Many researchers would answer ``nothing'' because of the likelihood of severe bias due to the lack of adjustment for key confounding variables. This paper shows that such a conclusion is unduly pessimistic. Because the complete data likelihood under arbitrary patterns of confounding factorizes in a particularly convenient way, it is possible to parameterize this general situation with four easily interpretable parameters. Subjective beliefs regarding these parameters are easily elicited and subjective statements of uncertainty become possible. This paper also develops a novel graphical display called the confounding plot that quickly and efficiently communicates all patterns of confounding that would leave a particular causal inference relatively unchanged.

Matching methods are widely used to adjust for possibly confounded treatment assignment when making causal inferences. The success of the matching adjustment depends on generating as much equivalence as possible between the distribution of pre-treatment covariates in the treated and control groups. In numerous articles across a diverse variety of academic fields that use matching, researchers evaluate the degree of equivalence by conducting hypothesis tests, most commonly the $t$-test for the mean difference of each of the covariates in the two matched groups. We demonstrate that these hypothesis tests are fallacious and discuss better alternatives.

Does access to foreign media facilitate the diffusion of protest in authoritarian regimes? Apparently for the first time, I test this hypothesis by exploiting a natural experiment in communist East Germany. I take advantage of the fact that West German television broadcasts could be received in most but not all parts of East Germany and conduct a matched analysis in which counties without access to West German television are matched to a comparison group of counties with West German television. Comparing these two groups of East German counties, I find no evidence that West German television affected the speed or depth of protest diffusion during the 1989 East German revolution.

Of late there has been a renewed interest in natural experiments as a method for drawing causal inferences from observational data. One form of natural experiment exploits variation in geography where units in one geographic area receive a treatment while units in another area do not. In this kind of geographic natural experiment, the hope is that assignment to treatment via geographic location creates as-if random variation in treatment assignment. When this happens, adjustment for baseline covariates is unnecessary. In many applications, however, some adjustment for baseline covariates may be necessary due to strategic sorting around the border between treatment and control areas. As such, analysts may wish to combine identification strategies--using both spatial proximity and covariates--for more plausible inferences. Here we explore how to utilize spatial proximity as well as covariates in the analysis of geographic natural experiments. We contend that standard statistical tools are ill-equipped to exploit covariates as well as variation in treatment assignment that is a function of spatial proximity. We use a mixed integer programming matching algorithm to flexibly incorporate information about both the discontinuity and observed covariates which allows us to minimize spatial distance while preserving balance on observed covariates. We argue the combining both information about covariates and the discontinuity creates a method of estimation that can be informally thought of as doubly robust. We demonstrate the method with data on ballot initiatives and turnout in Milwaukee, WI.

In this paper, we demonstrate how to effectively design and analyze randomized experiments, which are becoming increasingly common in political science research. Randomized experiments provide researchers with an opportunity to obtain unbiased estimates of causal effects because the randomization of treatment guarantees that the treatment and control groups are on average equal in both observed and unobserved characteristics. Even in randomized experiments, however, complications can arise. In political science experiments, researchers often cannot force subjects to comply with treatment assignment or to provide the information necessary for the estimation of causal effects. Building on the recent statistical literature, we show how to make statistical adjustments for these noncompliance and nonresponse problems when analyzing randomized experiments. We also demonstrate how to design randomized experiments so that the potential impact of such complications is minimized.

This paper uses British Household Panel Survey data to estimate the effects of divorce and widowhood on political attitudes and political behavior. In contrast to previous research, which mostly relied on cross-sectional data, a matched propensity score analysis does not find any effects of transitions out of marriage on policy preferences, party identification, and vote choice. The results also show that divorce (but not widowhood) substantially reduces electoral participation. Some preliminary evidence suggests that this effect of divorce on turnout is partially attributable to the increased residential mobility that accompanies divorce.

The propensity score plays a central role in a variety of settings for causal inference. In particular, matching and weighting methods based on the estimated propensity score have become increasingly common in observational studies. Despite their popularity and theoretical appeal, the main practical difficulty of these methods is that the propensity score must be estimated. Researchers have found that slight misspecification of the propensity score model can result in substantial bias of estimated treatment effects. In this paper, we introduce covariate balancing propensity score (CBPS) estimation, which simultaneously optimizes the covariate balance and the prediction of treatment assignment. We exploit the dual characteristics of the propensity score as a covariate balancing score and the conditional probability of treatment assignment and estimate the CBPS within the generalized method of moments or empirical likelihood framework. We find that the CBPS dramatically improves the poor empirical performance of propensity score matching and weighting methods reported in the literature. We also show that the CBPS can be extended to a number of other important settings, including the estimation of generalized propensity score for non-binary treatments, causal inference in longitudinal settings, and the generalization of experimental and instrumental variable estimates to a target population.

Missing data are frequently encountered in the statistical analysis of randomized experiments. In this article, I propose statistical methods that can be used to analyze randomized experiments with a nonignorable missing binary outcome where the missing-data mechanism may depend on the unobserved values of the outcome variable itself. I first introduce an identification strategy for the average treatment effect and compare it with the existing alternative approaches in the literature. I then derive the maximum likelihood estimator and its asymptotic properties, and discuss possible estimation methods. Furthermore, since the proposed identification assumption is not directly verifiable from the data, I show how to conduct a sensitivity analysis based on the parameterization that links the key identification assumption with the causal quantities of interest. Then, the proposed methodology is extended to the analysis of randomized experiments with noncompliance. Although the method introduced in this article may not directly apply to randomized experiments with non-binary outcomes, I briefly discuss possible identification strategies in more general situations. Finally, I apply the proposed methodology to analyze data from the German election experiment and the influenza vaccination study, which originally motivated the methodological problems addressed in this article.

In this case study of the impact of West German television on public support for the East German communist regime, we evaluate the conventional wisdom in the democratization literature that foreign mass media undermine authoritarian rule. We exploit formerly classified survey data and a natural experiment to identify the effect of foreign media exposure using instrumental variable estimators. Contrary to conventional wisdom, East Germans exposed to West German television were more satisfied with life in East Germany and more supportive of the East German regime. To explain this surprising finding, we show that East Germans used West German television primarily as a source of entertainment. Behavioral data on regional patterns in exit visa applications and archival evidence on the reaction of the East German regime to the availability of West German television corroborate this result.

If an experimental treatment is experienced by both treated and control group units, tests of hypotheses about causal effects may be difficult to conceptualize let alone execute. In this paper, we show how counterfactual causal models may be written and tested when theories suggest spillover or other network-based interference among experimental units. We show that the ``no interference'' assumption need not constrain scholars who have interesting questions about interference. We offer researchers the ability to model theories about how treatment given to some units may come to influence outcomes for other units. We further show how to test hypotheses about these causal effects, and we provide tools to enable researchers to assess the operating characteristics of their tests given their own models, designs, test statistics, and data. The conceptual and methodological framework we develop here is particularly applicable to social networks, but may be usefully deployed whenever a researcher wonders about interference between units. Interference between units need not be an untestable assumption; instead, interference is an opportunity to ask meaningful questions about theoretically interesting phenomena.

One of the most common identification strategies in political science is selection on observables. Under this strategy, analysts assume that they observed enough covariates to make treatment status as-if random. Adjustments are then made for observed confounders through statistical methods such as regression or matching. Under adjustment methods such as matching or inverse probability weighting, coefficients for control variables are treated as nuisance parameters and are not directly estimated. This is in direct contrast to regression approaches where estimated parameters are observed for all covariates. Analysts often find it tempting to give a causal interpretation to all the parameters in such regression models, which is not possible under the controls as nuisance parameter approach. In this paper, we illustrate the dangers of treating all the parameters in a regression model as causal parameters. Using Directed Acyclic Graphs, we show how even if some effects are identified in a regression model, many estimated parameters do not represent causal effects or may be direct effects. The general recommendation is for analysts to attempt to identify a single effect and limit interpretation of models to that effect.

In this paper we demonstrate empirically that incumbency is a source of contamination in Germany's mixed electoral system. Using a quasi-experimental research design that allows for causal inference under a weaker set of assumptions than the regression models commonly used in the electoral systems literature, we find that incumbency causes a gain of $1.4$ to $1.7$ percentage points in PR vote shares. We also present simulations of Bundestag seat distributions to demonstrate that contamination effects caused by incumbency are sufficiently large to trigger significant shifts in parliamentary majorities

It has long been understood that the presence of the ballot initiative process leads to different outcomes among states. In general, extant research has found that the presence of ballot initiatives tends to increase voter turnout and depress state revenues and expenditures. I reconsider this possibility and demonstrate that past findings are an artifact of incorrect research design. Failure to account for differences in states often leads to a confounding association between ballot initiatives and voter turnout and fiscal policy. Here, I conduct an observational study based on a counterfactual model of inference to analyze the effects of ballot initiatives. The resulting research design leads to two analyses. First, I utilize the synthetic case control method, which allows me to compare over time outcomes in states with initiatives to states without initiatives while accounting for pretreatment baseline differences across states. Second, I use matching to assess voter turnout differences across metro areas along state boundaries with and without ballot initiatives. In both analyses, I find that ballot initiatives rarely have spillover effects on voter turnout and state fiscal policy.

This paper presents randomization-based methods for estimating average causal effects under arbitrary interference of known form. We present conservative estimators of the randomization variance of the average treatment effects estimators and a justification for confidence intervals based on a normal approximation. Examples relevant to research in environmental protection, networks experiments, "viral marketing," two-stage disease prophylaxis trials, and stepped-wedge designs are presented.

A distinctive feature of a clustered observational study is its multilevel or nested data structure arising from the assignment of treatment, in a non-random manner, to groups or clusters of units or individuals. Examples are ubiquitous in the health and social sciences including patients in hospitals, employees in firms, and students in schools. What is the optimal matching strategy in a clustered observational study? At first thought, one might start by matching clusters of individuals and then, within matched clusters, continue by matching individuals. But, as we discuss in this paper, the optimal strategy is the opposite: first match individuals and, once all possible combinations of matched individuals are known, then match clusters. In this paper we use dynamic and integer programming to implement this strategy and extend optimal matching methods to hierarchical and multilevel settings. In particular, our method attempts to replicate a paired clustered randomized study by finding the largest sample of matched pairs of treated and control individuals within matched pairs of treated and control clusters that is balanced according to specifications given by the user. Our method directly balances covariates both at the cluster and individual levels and does not require estimating the propensity score, although the propensity score can be balanced as an additional covariate. We illustrate our method on a case study of the comparative effectiveness of public versus private voucher schools in Chile, a question of intense policy debate in the country at the present.

Building on an idea in Abadie and Gardeazabal (2003), this article investigates the application of synthetic control methods to comparative case studies. We discuss the advantages of these methods and apply them to study the effects of Proposition 99, a large-scale tobacco control program that California implemented in 1988. We demonstrate that following Proposition 99 tobacco consumption fell markedly in California relative to a comparable synthetic control region. We estimate that by the year 2000 annual per-capita cigarette sales in California were about 26 packs lower than what they would have been in the absence of Proposition 99. Given that many policy interventions and events of interest in social sciences take place at an aggregate level (countries, regions, cities, etc.) and affect a small number of aggregate units, the potential applicability of synthetic control methods to comparative case studies is very large, especially in situations where traditional regression methods are not appropriate. The methods proposed in this article produce informative inference regardless of the number of available comparison units, the number of available time periods, and whether the data are individual (micro) or aggregate (macro). Software to compute the estimators proposed in this article is available at the authors web-pages.

Causal mediation analysis is routinely conducted by applied researchers in a variety of disciplines including epidemiology, political science, psychology, and sociology. The goal of such an analysis is to investigate alternative causal mechanisms by examining the roles of intermediate variables that lie in the causal path between the treatment and outcome variables. In this paper, we first prove that under a particular version of sequential ignorability assumption, the average causal mediation effect (ACME) is nonparametrically identified. We compare our identifying assumption with those proposed in the literature. Some practical implications of our identification result are also discussed. In particular, the popular estimator based on the linear structural equation model (LSEM) can be interpreted as an ACME estimator if the linearity and no-interaction assumptions are satisfied in addition to the proposed assumption. We show that this assumption can easily be relaxed within the framework of LSEM. Second, we consider a simple nonparametric estimator of the ACME in order to relax distributional and functional form assumptions. We also discuss a more general nonparametric approach. Third, we propose a new sensitivity analysis that can be easily implemented by applied researchers within the standard LSEM framework. Like the existing identifying assumptions, the proposed assumption may be too strong in many applied settings. Thus, sensitivity analysis is essential in order to examine the robustness of empirical findings to the possible existence of an unmeasured confounder. Finally, we apply the proposed methods to a randomized experiment from political psychology.

We demonstrate four techniques that utilize case studies to improve causal inference within the Rosenbaum [2002, 2009] approach to observational studies. This approach accommodates small to medium sample sizes in a nonparametric framework and does not require the elicitation of Bayesian priors. First, we show that this approach allows case studies to ameliorate the effects of poorly measured outcomes, sometimes reducing p-values. Second, we show that qualitative information can be incorporated in an analysis and presented as qualitative confidence intervals. Third, we demonstrate that a standard technique of comparative case studies can improve sensitivity analysis within this framework, sometimes reducing the sensitivity of p-values to unmeasured confounders. Finally, we demonstrate that qualitative information on the heterogeneity of treatments can be used to check the robustness of p-values. We illustrate these methods by examining the effect of not having a runoff provision on opposition harassment in transitional presidential elections in 1990s sub-Saharan Africa.

Estimating causal interaction effects is essential for the exploration of heterogeneous treatment effects. In the presence of multiple treatment variables with each having several levels, researchers are often interested in identifying the combinations of treatments that induce large additional causal effects beyond the sum of separate effects attributable to each treatment. We show, however, the standard definition of causal interaction effect, typically estimated with the standard linear regression or ANOVA, suffers from the lack of invariance to the choice of baseline condition and the difficulty of interpretation beyond two-way interaction. We propose an alternative definition of causal interaction effect, called the marginal treatment interaction effect, whose relative magnitude does not depend on the choice of baseline condition while maintaining an intuitive interpretation even for higher-order interaction. The proposed approach enables researchers to effectively summarize the structure of causal interaction in high-dimension by decomposing the total effect of any treatment combination into the marginal effects and the interaction effects. We also establish the identification condition and develop an estimation strategy for the proposed marginal treatment interaction effects. Our motivating example is conjoint analysis where the existing literature largely assumes the absence of causal interaction. Given a large number of interaction effects, we apply a variable selection method to identify significant causal interaction. Our exploratory analysis of a survey experiment on immigration preferences reveals substantive insights the standard conjoint analysis fails to discover.

Sekhon (2006; 2004a) and Diamond and Sekhon (2005) propose a matching method, called Genetic Matching, which algorithmically maximizes the balance of covariates between treat- ment and control observations via a genetic search algorithm (Sekhon and Mebane 1998). The method is neutral as to what measures of balance one wishes to optimize. By default, cumulative probability distribution functions of a variety of standardized statistics are used as balance metrics and are optimized without limit. The statistics are not used to conduct formal hypothesis tests, because no measure of balance is a monotonic function of bias in the estimand of interest and because we wish to maximize balance. Descriptive measures of discrepancy generally ignore key information related to bias which is captured by probability distribution functions of standardized test statistics. For example, using several descriptive metrics, one is unable reliably to recover the experimental benchmark in a testbed dataset for matching estimators (Dehejia and Wahba 1999). And these metrics, unlike those based on optimized distribution functions, perform poorly in a series of Monte Carlo sampling experiments just as one would expect given their properties.