Category Archives: Econometrics

A “causal empiricist” turn has swept through economics over the past couple decades. As a result, many economists are primarily interested in internally valid treatment effects according to the causal models of Rubin, meaning they are interested in credible statements of how some outcome Y is affected if you manipulate some treatment T given some covariates X. That is, to the extent that full functional form Y=f(X,T) is impossible to estimate because of unobserved confounding variables or similar, it turns out to still be possible to estimate some feature of that functional form, such as the average treatment effect E(f(X,1))-E(f(X,0)). At some point, people like Angrist and Imbens will win a Nobel prize not only for their applied work, but also for clarifying precisely what various techniques are estimating in a causal sense. For instance, an instrumental variable regression under a certain exclusion restriction (let’s call this an “auxiliary assumption”) estimates the average treatment effect along the local margin of people induced into treatment. If you try to estimate the same empirical feature using a different IV, and get a different treatment effect, we all know now that there wasn’t a “mistake” in either paper, but rather than the margins upon which the two different IVs operate may not be identical. Great stuff.

This causal model emphasis has been controversial, however. Social scientists have quibbled because causal estimates generally require the use of small, not-necessarily-general samples, such as those from a particular subset of the population or a particular set of countries, rather than national data or the universe of countries. Many statisticians have gone even further, suggestion that multiple regression with its linear parametric form does not take advantage of enough data in the joint distribution of (Y,X), and hence better predictions can be made with so-called machine learning algorithms. And the structural economists argue that the parameters we actually care about are much broader than regression coefficients or average treatment effects, and hence a full structural model of the data generating process is necessary. We have, then, four different techniques to analyze a dataset: multiple regression with control variables, causal empiricist methods like IV and regression discontinuity, machine learning, and structural models. What exactly do each of these estimate, and how do they relate?

Peter Aronow and Cyrus Samii, two hotshot young political economists, take a look at old fashioned multiple regression. Imagine you want to estimate y=a+bX+cT, where T is a possibly-binary treatment variable. Assume away any omitted variable bias, and more generally assume that all of the assumptions of the OLS model (linearity in covariates, etc.) hold. What does that coefficient c on the treatment indicator represent? This coefficient is a weighted combination of the individual estimated treatment effects, where more weight is given to units whose treatment status is not well explained by covariates. Intuitively, if you are regressing, say, the probability of civil war on participation in international institutions, then if a bunch of countries with very similar covariates all participate, the “treatment” of participation will be swept up by the covariates, whereas if a second group of countries with similar covariates all have different participation status, the regression will put a lot of weight toward those countries since differences in outcomes can be related to participation status.

This turns out to be quite consequential: Aronow and Samii look at one paper on FDI and find that even though the paper used a broadly representative sample of countries around the world, about 10% of the countries weighed more than 50% in the treatment effect estimate, with very little weight on a number of important regions, including all of the Asian tigers. In essence, the sample was general, but the effective sample once you account for weighting was just as limited as some of “nonrepresentative samples” people complain about when researchers have to resort to natural or quasinatural experiments! It turns out that similar effective vs. nominal representativeness results hold even with nonlinear models estimated via maximum likelihood, so this is not a result unique to OLS. Aronow and Samii’s result matters for interpreting bodies of knowledge as well. If you replicate a paper adding in an additional covariate, and get a different treatment effect, it may not reflect omitted variable bias! The difference may simply result from the additional covariate changing the effective weighting on the treatment effect.

So the “externally valid treatment effects” we have been estimating with multiple regression aren’t so representative at all. So when, then, is old fashioned multiple regression controlling for observable covariates a “good” way to learn about the world, compared to other techniques. I’ve tried to think through this is a uniform way; let’s see if it works. First consider machine learning, where we want to estimate y=f(X,T). Assume that there are no unobservables relevant to the estimation. The goal is to estimate the functional form f nonparametrically but to avoid overfitting, and statisticians have devised a number of very clever ways to do this. The proof that they work is in the pudding: cars drive themselves now. It is hard to see any reason why, if there are no unobservables, we wouldn’t want to use these machine learning/nonparametric techniques. However, at present the machine learning algorithms people use literally depend only on data in the joint distribution (X,Y), and not on any auxiliary assumptions. To interpret the marginal effect of a change in T as some sort of “treatment effect” that can be manipulated with policy, if estimated without auxiliary assumptions, requires some pretty heroic assumptions about the lack of omitted variable bias which essentially will never hold in most of the economic contexts we care about.

Now consider the causal model, where y=f(X,U,T) and you interested in what would happen with covariates X and unobservables U if treatment T was changed to a counterfactual. All of these techniques require a particular set of auxiliary assumptions: randomization requires the SUTVA assumption that treatment of one unit does not effect the independent variable of another unit, IV requires the exclusion restriction, diff-in-diff requires the parallel trends assumption, and so on. In general, auxiliary assumptions will only hold in certain specific contexts, and hence by construction the result will not be representative. Further, these assumptions are very limited in that they can’t recover every conditional aspect of y, but rather recover only summary statistics like the average treatment effect. Techniques like multiple regression with covariate controls, or machine learning nonparametric estimates, can draw on a more general dataset, but as Aronow and Samii pointed out, the marginal effect on treatment status they identify is not necessarily effectively drawing on a more general sample.

Structural folks are interested in estimating y=f(X,U,V(t),T), where U and V are unobserved, and the nature of unobserved variables V are affected by t. For example, V may be inflation expectations, T may be the interest rate, y may be inflation today, and X and U are observable and unobservable country characteristics. Put another way, the functional form of f may depend on how exactly T is modified, through V(t). This Lucas Critique problem is assumed away by the auxiliary assumptions in causal models. In order to identify a treatment effect, then, additional auxiliary assumptions generally derived from economic theory are needed in order to understand how V will change in response to a particular treatment type. Even more common is to use a set of auxiliary assumptions to find a sufficient statistic for the particular parameter desired, which may not even be a treatment effect. In this sense, structural estimation is similar to causal models in one way and different in two. It is similar in that it relies on auxiliary assumptions to help extract particular parameters of interest when there are unobservables that matter. It is different in that it permits unobservables to be functions of policy, and that it uses auxiliary assumptions whose credibility leans more heavily on non-obvious economic theory. In practice, structural models often also require auxiliary assumptions which do not come directly from economic theory, such as assumptions about the distribution of error terms which are motivated on the basis of statistical arguments, but in principle this distinction is not a first order difference.

We then have a nice typology. Even if you have a completely universal and representative dataset, multiple regression controlling for covariates does not generally give you a “generalizable” treatment effect. Machine learning can try to extract treatment effects when the data generating process is wildly nonlinear, but has the same nonrepresentativeness problem and the same “what about omitted variables” problem. Causal models can extract some parameters of interest from nonrepresentative datasets where it is reasonable to assume certain auxiliary assumptions hold. Structural models can extract more parameters of interest, sometimes from more broadly representative datasets, and even when there are unobservables that depend on the nature of the policy, but these models require auxiliary assumptions that can be harder to defend. The so-called sufficient statistics approach tries to retain the former advantages of structural models while reducing the heroics that auxiliary assumptions need to perform.

Aronow and Samii is forthcoming in the American Journal of Political Science; the final working paper is at the link. Related to this discussion, Ricardo Hausmann caused a bit of a stir online this week with his “constant adaptation rather than RCT” article. His essential idea was that, unlike with a new medical drug, social science interventions vary drastically depending on the exact place or context; that is, external validity matters so severely that slowly moving through “RCT: Try idea 1”, then “RCT: Try idea 2”, is less successful than smaller, less precise explorations of the “idea space”. He received a lot of pushback from the RCT crowd, but I think for the wrong reason: the constant iteration is less likely to discover underlying mechanisms than even an RCT, as it is still far too atheoretical. The link Hausmann makes to “lean manufacturing” is telling: GM famously (Henderson and Helper 2014) took photos of every square inch of their joint venture plant with NUMMI, and tried to replicate this plant in their other plants. But the underlying reason NUMMI and Toyota worked has to do with the credibility of various relational contracts, rather than the (constantly iterated) features of the shop floor. Iterating without attempting to glean the underlying mechanisms at play is not a rapid route to good policy.

Angus Deaton, the Scottish-born, Cambridge-trained Princeton economist, best known for his careful work on measuring the changes in wellbeing of the world’s poor, has won the 2015 Nobel Prize in economics. His data collection is fairly easy to understand, so I will leave larger discussion of exactly what he has found to the general news media; Deaton’s book “The Great Escape” provides a very nice summary of what he has found as well, and I think a fair reading of his development preferences are that he much prefers the currently en vogue idea of just giving cash to the poor and letting them spend it as they wish.

Essentially, when one carefully measures consumption, health, or generic characteristics of wellbeing, there has been tremendous improvement indeed in the state of the world’s poor. National statistics do not measure these ideas well, because developing countries do not tend to track data at the level of the individual. Indeed, even in the United States, we have only recently begun work on localized measures of the price level and hence the poverty rate. Deaton claims, as in his 2010 AEA Presidential Address (previously discussed briefly on two occasions on AFT), that many of the measures of global inequality and poverty used by the press are fundamentally flawed, largely because of the weak theoretical justification for how they link prices across regions and countries. Careful non-aggregate measures of consumption, health, and wellbeing, like those generated by Deaton, Tony Atkinson, Alwyn Young, Thomas Piketty and Emmanuel Saez, are essential for understanding how human welfare has changed over time and space, and is a deserving rationale for a Nobel.

The surprising thing about Deaton, however, is that despite his great data-collection work and his interest in development, he is famously hostile to the “randomista” trend which proposes that randomized control trials (RCT) or other suitable tools for internally valid causal inference are the best way of learning how to improve the lives of the world’s poor. This mode is most closely associated with the enormously influential J-PAL lab at MIT, and there is no field in economics where you are less likely to see traditional price theoretic ideas than modern studies of development. Deaton is very clear on his opinion: “Randomized controlled trials cannot automatically trump other evidence, they do not occupy any special place in some hierarchy of evidence, nor does it make sense to refer to them as “hard” while other methods are “soft”… [T]he analysis of projects needs to be refocused towards the investigation of potentially generalizable mechanisms that explain why and in what contexts projects can be expected to work.” I would argue that Deaton’s work is much closer to more traditional economic studies of development than to RCTs.

To understand this point of view, we need to go back to Deaton’s earliest work. Among Deaton’s most famous early papers was his well-known development of the Almost Ideal Demand System (AIDS) in 1980 with Muellbauer, a paper chosen as one of the 20 best published in the first 100 years of the AER. It has long been known that individual demand equations which come from utility maximization must satisfy certain properties. For example, a rational consumer’s demand for food should not depend on whether the consumer’s equivalent real salary is paid in American or Canadian dollars. These restrictions turn out to be useful in that if you want to know how demand for various products depend on changes in income, among many other questions, the restrictions of utility theory simplify estimation greatly by reducing the number of free parameters. The problem is in specifying a form for aggregate demand, such as how demand for cars depends on the incomes of all consumers and prices of other goods. It turns out that, in general, aggregate demand generated by utility-maximizing households does not satisfy the same restrictions as individual demand; you can’t simply assume that there is a “representative consumer” with some utility function and demand function equal to each individual agent. What form should we write for aggregate demand, and how congruent is that form with economic theory? Surely an important question if we want to estimate how a shift in taxes on some commodity, or a policy of giving some agricultural input to some farmers, is going to affect demand for output, its price, and hence welfare!

Let q(j)=D(p,c,e) say that the quantity of j consumed, in aggregate is a function of the price of all goods p and the total consumption (or average consumption) c, plus perhaps some random error e. This can be tough to estimate: if D(p,c,e)=Ap+e, where demand is just a linear function of relative prices, then we have a k-by-k matrix to estimate, where k is the number of goods. Worse, that demand function is also imposing an enormous restriction on what individual demand functions, and hence utility functions, look like, in a way that theory does not necessarily support. The AIDS of Deaton and Muellbauer combine the fact that Taylor expansions approximately linearize nonlinear functions and that individual demand can be aggregated even when heterogeneous across individuals if the restrictions of Muellbauer’s PIGLOG papers are satisfied to show a functional form for aggregate demand D which is consistent with aggregated individual rational behavior and which can sometimes be estimated via OLS. They use British data to argue that aggregate demand violates testable assumptions of the model and hence factors like credit constraints or price expectations are fundamental in explaining aggregate consumption.

This exercise brings up a number of first-order questions for a development economist. First, it shows clearly the problem with estimating aggregate demand as a purely linear function of prices and income, as if society were a single consumer. Second, it gives the importance of how we measure the overall price level in figuring out the effects of taxes and other policies. Third, it combines theory and data to convincingly suggest that models which estimate demand solely as a function of current prices and current income are necessarily going to give misleading results, even when demand is allowed to take on very general forms as in the AIDS model. A huge body of research since 1980 has investigated how we can better model demand in order to credibly evaluate demand-affecting policy. All of this is very different from how a certain strand of development economist today might investigate something like a subsidy. Rather than taking obversational data, these economists might look for a random or quasirandom experiment where such a subsidy was introduced, and estimate the “effect” of that subsidy directly on some quantity of interest, without concern for how exactly that subsidy generated the effect.

To see the difference between randomization and more structural approaches like AIDS, consider the following example from Deaton. You are asked to evaluate whether China should invest more in building railway stations if they wish to reduce poverty. Many economists trained in a manner influenced by the randomization movement would say, well, we can’t just regress the existence of a railway on a measure of city-by-city poverty. The existence of a railway station depends on both things we can control for (the population of a given city) and things we can’t control for (subjective belief that a town is “growing” when the railway is plopped there). Let’s find something that is correlated with rail station building but uncorrelated with the random component of how rail station building affects poverty: for instance, a city may lie on a geographically-accepted path between two large cities. If certain assumptions hold, it turns out that a two-stage “instrumental variable” approach can use that “quasi-experiment” to generate the LATE, or local average treatment effect. This effect is the average benefit of a railway station on poverty reduction, at the local margin of cities which are just induced by the instrument to build a railway station. Similar techniques, like difference-in-difference and randomized control trials, under slightly different assumptions can generate credible LATEs. In development work today, it is very common to see a paper where large portions are devoted to showing that the assumptions (often untestable) of a given causal inference model are likely to hold in a given setting, then finally claiming that the treatment effect of X on Y is Z. That LATEs can be identified outside of a purely randomized contexts is incredibly important and valuable, and the economists and statisticians who did the heavy statistical lifting on this so-called Rubin model will absolutely and justly win an Economics Nobel sometime soon.

However, this use of instrumental variables would surely seem strange to the old Cowles Commission folks: Deaton is correct that “econometric analysis has changed its focus over the years, away from the analysis of models derived from theory towards much looser specifications that are statistical representations of program evaluation. With this shift, instrumental variables have moved from being solutions to a well-defined problem of inference to being devices that induce quasi-randomization.” The traditional use of instrumental variables was that after writing down a theoretically justified model of behavior or aggregates, certain parameters – not treatment effects, but parameters of a model – are not identified. For instance, price and quantity transacted are determined by the intersection of aggregate supply and aggregate demand. Knowing, say, that price and quantity was (a,b) today, and is (c,d) tomorrow, does not let me figure out the shape of either the supply or demand curve. If price and quantity both rise, it may be that demand alone has increased pushing the demand curve to the right, or that demand has increased while the supply curve has also shifted to the right a small amount, or many other outcomes. An instrument that increases supply without changing demand, or vice versa, can be used to “identify” the supply and demand curves: an exogenous change in the price of oil will affect the price of gasoline without much of an effect on the demand curve, and hence we can examine price and quantity transacted before and after the oil supply shock to find the slope of supply and demand.

Note the difference between the supply and demand equation and the treatment effects use of instrumental variables. In the former case, we have a well-specified system of supply and demand, based on economic theory. Once the supply and demand curves are estimated, we can then perform all sorts of counterfactual and welfare analysis. In the latter case, we generate a treatment effect (really, a LATE), but we do not really know why we got the treatment effect we got. Are rail stations useful because they reduce price variance across cities, because they allow for increasing returns to scale in industry to be utilized, or some other reason? Once we know the “why”, we can ask questions like, is there a cheaper way to generate the same benefit? Is heterogeneity in the benefit important? Ought I expect the results from my quasiexperiment in place A and time B to still operate in place C and time D (a famous example being the drug Opren, which was very successful in RCTs but turned out to be particularly deadly when used widely by the elderly)? Worse, the whole idea of LATE is backwards. We traditionally choose a parameter of interest, which may or may not be a treatment effect, and then choose an estimation technique that can credible estimate that parameter. Quasirandom techniques instead start by specifying the estimation technique and then hunt for a quasirandom setting, or randomize appropriately by “dosing” some subjects and not others, in order to fit the assumptions necessary to generate a LATE. If is often the case that even policymakers do not care principally about the LATE, but rather they care about some measure of welfare impact which rarely is immediately interpretable even if the LATE is credibly known!

Given these problems, why are random and quasirandom techniques so heavily endorsed by the dominant branch of development? Again, let’s turn to Deaton: “There has also been frustration with the World Bank’s apparent failure to learn from its own projects, and its inability to provide a convincing argument that its past activities have enhanced economic growth and poverty reduction. Past development practice is seen as a succession of fads, with one supposed magic bullet replacing another—from planning to infrastructure to human capital to structural adjustment to health and social capital to the environment and back to infrastructure—a process that seems not to be guided by progressive learning.” This is to say, the conditions necessary to estimate theoretical models are so stringent that development economists have been writing noncredible models, estimating them, generating some fad of programs that is used in development for a few years until it turns out not to be silver bullet, then abandoning the fad for some new technique. Better, the randomistas argue, to forget about external validity for now, and instead just evaluate the LATEs on a program-by-program basis, iterating what types of programs we evaluate until we have a suitable list of interventions that we feel confident work. That is, development should operate like medicine.

We have something of an impasse here. Everyone agrees that on many questions theory is ambiguous in the absence of particular types of data, hence more and better data collection is important. Everyone agrees that many parameters of interest for policymaking require certain assumptions, some more justifiable than others. Deaton’s position is that the parameters of interest to economists by and large are not LATEs, and cannot be generated in a straightforward way from LATEs. Thus, following Nancy Cartwright’s delightful phrasing, if we are to “use” causes rather than just “hunt” for what they are, we have no choice but to specify the minimal economic model which is able to generate the parameters we care about from the data. Glen Weyl’s attempt to rehabilitate price theory and Raj Chetty’s sufficient statistics approach are both attempts to combine the credibility of random and quasirandom inference with the benefits of external validity and counterfactual analysis that model-based structural designs permit.

One way to read Deaton’s prize, then, is as an award for the idea that effective development requires theory if we even hope to compare welfare across space and time or to understand why policies like infrastructure improvements matter for welfare and hence whether their beneficial effects will remain when moved to a new context. It is a prize which argues against the idea that all theory does is propose hypotheses. For Deaton, going all the way back to his work with AIDS, theory serves three roles: proposing hypotheses, suggesting which data is worthwhile to collect, and permitting inference on the basis of that data. A secondary implication, very clear in Deaton’s writing, is that even though the “great escape” from poverty and want is real and continuing, that escape is almost entirely driven by effects which are unrelated to aid and which are uninfluenced by the type of small bore, partial equilibrium policies for which randomization is generally suitable. And, indeed, the best development economists very much understand this point. The problem is that the media, and less technically capable young economists, still hold the mistaken belief that they can infer everything they want to infer about “what works” solely using the “scientific” methods of random- and quasirandomization. For Deaton, results that are easy to understand and communicate, like the “dollar-a-day” poverty standard or an average treatment effect, are less virtuous than results which carefully situate numbers in the role most amenable to answering an exact policy question.

Let me leave you three side notes and some links to Deaton’s work. First, I can’t help but laugh at Deaton’s description of his early career in one of his famous “Notes from America”. Deaton, despite being a student of the 1984 Nobel laureate Richard Stone, graduated from Cambridge essentially unaware of how one ought publish in the big “American” journals like Econometrica and the AER. Cambridge had gone from being the absolute center of economic thought to something of a disconnected backwater, and Deaton, despite writing a paper that would win a prize as one of the best papers in Econometrica published in the late 1970s, had essentially no understanding of the norms of publishing in such a journal! When the history of modern economics is written, the rise of a handful of European programs and their role in reintegrating economics on both sides of the Atlantic will be fundamental. Second, Deaton’s prize should be seen as something of a callback to the ’84 prize to Stone and ’77 prize to Meade, two of the least known Nobel laureates. I don’t think it is an exaggeration to say that the majority of new PhDs from even the very best programs will have no idea who those two men are, or what they did. But as Deaton mentions, Stone in particular was one of the early “structural modelers” in that he was interested in estimating the so-called “deep” or behavioral parameters of economic models in a way that is absolutely universal today, as well as being a pioneer in the creation and collection of novel economic statistics whose value was proposed on the basis of economic theory. Quite a modern research program! Third, of the 19 papers in the AER “Top 20 of all time” whose authors were alive during the era of the economics Nobel, 14 have had at least one author win the prize. Should this be a cause for hope for the living outliers, Anne Krueger, Harold Demsetz, Stephen Ross, John Harris, Michael Todaro and Dale Jorgensen?

For those interested in Deaton’s work beyond what this short essay, his methodological essay, quoted often in this post, is here. The Nobel Prize technical summary, always a great and well-written read, can be found here.

Like this:

Essentially every economist reports Huber-White robust standard errors, rather than traditional standard errors, in their work these days, and for good reason: heteroskedasticity, or heterogeneity in error variance across observations, can lead to incorrect standard error calculations. Generally, robust standard errors are only used to ensure that the parameter of interest is “real” and not an artifact of random statistical variation; the value of the parameter itself is unbiased under heteroskedasticity in many models as long as the model itself is correctly specified. For example, if data is generated by the linear process y=Xb+e, then the estimated parameter b is unbiased even if the OLS assumption that e is homoskedastic is violated. Many researchers just tag a “,robust” onto their Stata code and hope this inoculates them from criticism about the validity of their statistical inference.

King and Roberts point out, using three very convincing examples from published papers, that robust standard errors have another much more important use. If robust errors and traditional errors are very different, then researchers ought try to figure out what is causing the heteroskedasticity in their data since, in general, tests like Breusch-Pagan or White’s Test cannot distinguish between model misspecification and fundamental heteroskedasticity. Heteroskedasticity is common in improperly specified models, e.g., by estimating OLS when the data is truncated at zero.

Nothing here should be surprising. If you are a structural economist, then surely you find the idea of estimating any function other than the precise form suggested by theory (which is rarely OLS) to be quite strange; why would anyone estimate any function other than the one directly suggested by the model, where indeed the model gives you the overall variance structure? But King and Roberts show that such advise is not often heeded.

They first look at a paper in a top International Relations jouranl, which suggested that small countries receive excess foreign aid (which seems believable at first glance; I spent some time in East Timor a few years ago, a tiny country which seemed to have five IGO workers for every resident). The robust and traditional standard errors diverged enormously. Foreign aid flow amounts are super skewed. Taking a Box-Cox transformation gets the data looking relatively normal again, and rerunning the estimation on the transformed data shows little difference between robust and traditional standard errors. In addition to fixing the heteroskedasticity, transforming the specified model flips the estimated parameter: small countries receive less foreign aid than other covariates might suggest.

King and Roberts then examine a top political science publication (on trade agreements and foreign aid), where again robust and traditional errors diverge. Some diagnostic work finds that a particular detrending technique assumed homogenous across countries fits much better if done heterogeneously across countries; otherwise, spurious variation over time is introduced. Changing the detrending method causes robust and traditional errors to converge again, and as in the small country aid paper above, the modified model specification completely flips the sign on the parameter of interest. A third example came from a paper using Poisson to estimate overdispersed (variance exceeds the mean) count data; replacing Poisson with the more general truncated negative binomial model again causes robust and traditional errors to converge, and again completely reverses the sign on the parameter of interest. Interesting. If you insist on estimating models that are not fully specified theoretically, then at least use the information that divergent robust standard errors give you about whether you model is sensible.

Firms poach engineers and researchers from each other all the time. One important reason to do so is to gain access to the individual’s knowledge. A strain of theory going back to Becker, however, suggests that if, after the poaching, the knowledge remains embodied solely in the new employer, it will be difficult for the firm to profit: surely the new employee will have an enormous amount of bargaining power over wages if she actually possesses unique and valuable information. (As part of my own current research project, I learned recently that Charles Martin Hall, co-inventor of the Hall-Heroult process for aluminum smelting, was able to gather a fortune of around $300 million after he brought his idea to the company that would become Alcoa.)

In a resource-based view of the firm, then, you may hope to not only access a new employer’s knowledge, but also spread it to other employees at your firm. By doing this, you limit the wage bargaining power of the new hire, and hence can scrape off some rents. Singh and Agrawal break open the patent database to investigate this. First, use name and industry data to try to match patentees who have an individual patent with one firm at time t, and then another patent at a separate firm some time later; such an employee has “moved”. We can’t simply check whether the receiving firm cites this new employee’s old patents more often, as there is an obvious endogeneity problem. First, firms may recruit good scientists more aggressively. Second, they may recruit more aggressively in technology fields where they are already planning to do work in the future. This suggests that matching plus diff-in-diff may work. Match every patent to another patent held by an inventor who never switches firms, attempting to find a second patent with very similar citation behavior, inventor age, inventor experience, technology class, etc. Using our matched sample, check how much the propensity to cite the mover’s patent changes compares to the propensity to the cite the stayer’s patent. That is, let Joe move to General Electric. Joe had a patent while working at Intel. GE researchers were citing that Intel patent once per year before Joe moved. They were citing a “matched” patent 1 times per year. After the move, they cite the Intel patent 2 times per year, and the “matched” patent 1.1 times per year. The diff-in-diff then suggests that moving increases the propensity to cite the Intel patent at GE by (2-1)-(1.1-1)=.9 citations per year, where the first difference helps account for the first type of endogeneity we discussed above, and the second difference for the second type of endogeneity.

What do we find? It is true that, after a move, the average patent held by a mover is cited more often at the receiving firm, especially in the first couple years after a move. Unfortunately, about half of new patents which cite the new employee’s old patent after she moves are made by the new employee herself, and another fifteen percent or so are made by previous patent collaborators of the poached employee. What’s worse, if you examine these citations by year, even five years after the move, citations to the pre-move patent are still highly likely to come from the poached employee. That is, to the extent that the poached employee had some special knowledge, the firm appears to have simply bought that knowledge embodied in the new employee, rather than gained access to useful techniques that quickly spread through the firm.

Three quick comments. First, applied econometrician friends: is there any reason these days to do diff-in-diff linearly rather than using the nonparametric “changes-in-changes” of Athey and Imbens 2006, which allows recovery of the entire distribution of effects of treatment on the treated? Second, we learn from this paper that the mean poached research employee doesn’t see her knowledge spread through the new firm, which immediately suggests the question of whether there are certain circumstances in which such knowledge spreads. Third, this same exercise could be done using all patents held by the moving employee’s old firm – I may be buying access to general techniques owned by the employee’s old firm rather than the specific knowledge represented in that employee’s own pre-move patents. I wonder if there’s any difference.

Like this:

Nate Hilger is on the market from Harvard this year. His job market paper continues a long line of inference that is probably at odds with mainstream political intuition. Roughly, economists generally support cash rather than in-kind transfers because people tend to be the best judges of the optimal use of money they receive; food stamps are not so useful if you really need to pay the heat bill that week. That said, if the goal is to cause some behavior change among the recipient, in-kind transfers can be more beneficial, especially when the cash transfer would go to a family while the in-kind transfer would go to a child or a wife.

Hilger managed to get his hands on the full universe of IRS data. I’m told by my empirically-minded friends that this data is something of a holy grail, with the IRS really limiting who can use the data after Saez proved its usefulness. IRS data is great because of the 1098T: colleges are required to file information about their students’ college attendance so that the government can appropriately dole out aid and tax credits. Even better, firms that fire or layoff workers file a 1099G. Finally, claimed dependents on the individual tax form let us link parents and children. That’s quite a trove of data!

Here’s a question we can answer with it: does low household income lower college attendance, and would income transfers to poor families help reduce the college attendance gap? In a world with perfect credit markets, it shouldn’t matter, since any student could pledge the human capital she would gain as collateral for a college attendance loan. Of course, pledging one’s human capital turns out to be quite difficult. Even if the loans aren’t there, a well-functioning and comprehensive university aid program should insulate the poor from this type of liquidity problem. Now, we know from previous studies that increased financial aid has a pretty big effect on college attendance among the poor and lower middle class. Is this because the aid is helping loosen the family liquidity constraint?

Hilger uses the following trick. Consider a worker who is laid off. This is only a temporary shock, but this paper and others estimate a layoff lowers discounted lifetime earnings by an average of nearly $100,000. So can we just propensity match laid off and employed workers when the child is college age, and see if the income shock lowers attendance? Not so fast. It turns out that matching on whatever observables we have, children whose fathers are laid off when the child is 19 are also much less likely to attend college than children whose fathers are not laid off, even though age 19 would be after the attendance decision is made. Roughly, a father who is ever laid off is correlated with some nonobservables that lower college attendance of children. So let’s compare children whose dads are laid off at 17 to children whose dads are laid off from a similar firm at age 19, matching on all other observables. The IRS data has so many data points that this is actually possible.

What do we learn? First, consumption (in this case, on housing) spending declines roughly in line with the lifetime income hypothesis after the income shock. Second, there is hardly any effect on college attendance and quality: attendance for children whose dads suffer the large income shock falls by half a percentage point. Further, the decline is almost entirely borne by middle class children, not the very poor or the rich: this makes sense since poor students rely very little on parental funding to pay for college, and the rich have enough assets to overcome any liquidity shock. The quality of college chosen also declines after a layoff, but only by a very small amount. That is, the Engel curve for college spending is very flat: families with more income tend to spend roughly similar amounts on college.

Policy-wise, what does this mean? Other authors have estimated that a $1000 increase in annual financial aid increases college enrollment by approximately three percentage points (a particularly strong effect is found among students from impoverished families); the Kalamazoo experiment shows positive feedback loops that many make the efficacy of such aid even higher, since students will exert more effort in high school knowing that college is a realistic financial possibility. Hilger’s paper shows that a $1000 cash grant to poor families will likely improve college attendance by .007 to .04 percentage points depending on whether the layoff is lowering college attendance due to a transitory or a permanent income shock. That is, financial aid is orders of magnitude more useful in raising college attendance than cash transfers, especially among the poor.

November 2012 working paper (No IDEAS version). My old Federal Reserve coworker Christopher Herrington is also on the job market, and has a result suggesting the importance of Hilger’s finding. He computes a DSGE model of lifetime human capital formation, and considers the counterfactual where the US has more equal education funding (that is, schools that centrally funded rather than well-funded in rich areas and poorly-funded in poor areas). Around 15% of eventual earnings inequality – again taking into account many general equilibrium effects – can be explained by the high variance of US education funding. As in Hilger, directly altering the requirement that parents pay for school (either through direct payments at the university level, or by purchasing housing in rich areas at the primary level) can cure a good portion of our growing inequality.

Like this:

Rebecca Diamond, on the market from Harvard, presented this interesting paper on inequality here on Friday. As is well-known, wage inequality increased enormously from the 1970s until today, with the divergence fairly well split between higher wages at top incomes and higher incomes to higher educated workers. There was simultaneously a great amount of locational sorting: the percentage of a city’s population which is college educated ranges from 15% in the Bakersfield MSA to around 45% in Boston, San Francisco and Washington, DC. Those cities that have attracted the highly educated have also seen huge increases in rent and housing prices. So perhaps the increase in wage inequality is overstated: these lawyers and high-flying tech employees are getting paid a ton, but also living in places where a 2,000 square foot house costs a million dollars.

Diamond notes that this logic is not complete. New York City has become much more expensive, yes, but it’s crime rate has gone way down, the streets are cleaner, the number of restaurants per capita has boomed, and the presence of highly educated neighbors and coworkers is good for your own productivity in the standard urban spillover models. It may be that wage inequality is underestimated using wage alone if better amenities in cities with lots of educated workers more than compensates for the higher rents.

How to sort this out? If you read this blog, you know the answer: data alone cannot tell you. What we need is a theory of high and low education workers’ location choice and a theory of wage determination. One such theory lets you do the following. First, find a way to identify exogenous changes in labor demand for some industry in cities, which ceteris parabis will increase the wages of workers employed in that industry. Second, note that workers can choose where to work, and that in equilibrium they must receive the same utility from all cities where they could be employed. Every city has a housing supply whose elasticity differs; cities with less land available for development because of water or mountains, and cities with stricter building regulations, have less elastic housing supply. Third, the amenities of a city are endogenous to who lives there; cities with more high education workers tend to have less crime, better symphonies, more restaurants, etc., which may be valued differently by high and low education workers.

Estimating the equilibrium distribution of high and low skill workers takes care. Using an idea from a 1991 paper by Bartik, Diamond notes that some shocks hit industries nationally. For instance, a shock may hit oil production, or hit the semiconductor industry. The first shock would increase low skill labor demand in Houston or Tulsa, and the second would increase high skill labor demand in San Jose and Boston. This tells us what happens to the labor demand curve. As always, to identify the intersection of demand and supply, we also need to identify changes in labor supply. Here, different housing supply elasticity helps us. A labor demand shock in a city with elastic housing supply will cause a lot of workers to move there (since rents won’t skyrocket), with fewer workers moving if housing supply is inelastic.

Estimating the full BLP-style model shows that, in fact, we are underestimating the change in well-being inequality between high and low education workers. The model suggests, no surprise, that both types of workers prefer higher wages, lower rents, and better amenities. However, the elasticity of college worker labor supply to amenities is much higher than that of less educated workers. This means that highly educated workers are more willing to accept lower after-rent wages for a city with better amenities than a less educated worker. Also, the only way to rationalize the city choices of highly educated workers over the time examined is with endogenous amenities; if well-being depends only on wages and rents, then highly educated workers would only have moved where they ended moving if they didn’t care at all about housing prices. Looking at smaller slices of the data, immigrant workers are much more sensitive to wages: they spend less of their income on housing, and hence care much more about wages when deciding where to live. In terms of spillovers, a 1% increase in the ratio of college educated workers to other workers increases college worker productivity by a half percentage point, and less educated worker productivity by about .2 percentage points.

Backing out the implies value of amenity in each MSA, the MSAs with the best amenities for both high and low education workers include places like Los Angeles and Boston; the least desirable for both types include high-crime Rust Belt cities. Inferred productivity by worker type is very different, however. While both types of workers appear to agree on which cities have the best and worst amenities, the productivity of high skill workers is highest in places like San Jose, San Francisco and New York, whereas productivity for low skill workers is particularly high in San Bernardino, Detroit and Las Vegas. The differential changes in productivity across cities led to re-sorting of different types of workers, which led to differential changes in amenities across cities. The observed pattern of location choices by different types of workers is consistent with a greater increase in well-being between high and low education workers, even taking into account changes in housing costs, than that implied by wage alone!

The data requirements and econometric skill involved in this model is considerable, but it should allow a lot of other interesting questions in urban policy to be answered. I asked Rebecca whether she looked at the welfare impacts of housing supply restrictions. Many cities that have experienced shocks to high education labor demand are also cities with very restrictive housing policies: LA, San Francisco, Boston, DC. In the counterfactual world where DC allowed higher density building, with the same labor demand shocks we actually observed, what would have happened to wages? Or inequality? She told me she is working on a similar idea, but that the welfare impacts are actually nontrivial. More elastic housing supply will cause more workers to move to high productivity cities, which is good. On the other hand, there are spillovers: housing supply restrictions form a fence that makes a city undesirable to low education workers, and all types of workers appear to both prefer highly educated workers and the amenities they bring. Weighing the differential impact of these two effects is an interesting next step.

November 2012 working paper (No IDEAS version). Fittingly on the week James Buchanan died, Diamond also has an interesting paper on rent extraction by government workers on her website. Roughly, government workers like to pay themselves higher salaries. If they raise taxes, private sector workers move away. But when some workers move out, the remaining population gets higher wages and pays lower rents as long as labor demand slopes down and housing supply slopes up. If housing supply is very inelastic, then higher taxes lead to workers leaving lead to a large decrease in housing costs, which stops the outflow of migration. So if extractive governments are trading off higher taxes against a lower population after the increase, they will ceteris parabis have higher taxes when housing supply is less elastic. And indeed this is true in the data. Interesting!

Like this:

Job market talks for 2012 have concluded at many schools, and therefore this is my last post on a job candidate paper. This is also the only paper I didn’t have a change to see presented live, and for good reason: Melissa Dell is clearly this year’s superstar, and I think it’s safe to assume she can have any job she wants, and at a salary she names. I have previously discussed another paper of hers – the Mining Mita paper – which would also have been a mindblowing job market paper; essentially, she gives a cleanly identified and historically important example of long-run effects of institutions a la Acemoglu and Robinson, but the effect she finds is that “bad” institutions in the colonial era led to “good” outcomes today. The mechanism by which historical institutions persist is not obvious and must be examined on a case-by-case basis.

Today’s paper is about another critical issue: the Mexican drug war. Over 40,000 people have been killed in drug-related violence in Mexico in the past half-decade, and that murder rate has been increasing over time. Nearly all of Mexico’s domestic drug production, principally pot and heroin, is destined for the US. There have been suggestions, quite controversial, that the increase in violence is a result of Mexican government policies aimed at shutting down drug gangs. Roughly, some have claimed that when a city arrests leaders of a powerful gang, the power vacuum leads to a violent contest among new gangs attempting to move into that city; in terms of the most economics-laden gang drama, removing relatively non-violent Barksdale only makes it easier for violent Marlo.

But is this true? And if so, when is it true? How ought Mexico deploy scarce drugfighting resources? Dell answers all three questions. First, she notes that the Partido Acción Nacional is, for a number of reasons, associated with greater crackdowns on drug trafficking in local areas. She then runs a regression discontinuity on municipal elections – which vary nicely over time in Mexico – where PAN barely wins versus barely loses. These samples appear balanced according to a huge range of regressors, including the probability that PAN has won elections in the area previously, a control for potential corruption at the local level favoring PAN candidates. In a given municipality-month, the probability of a drug-related homicide rises from 6 percent to 15 percent following a PAN inauguration after such a close election. There does not appear to be any effect during the lame duck period before PAN takes office, so the violence appears correlated to anti-trafficking policies that occur after PAN takes control. There is also no such increase in cases where PAN barely loses. The effect is greatest in municipalities on the border of two large drug gang territories. The effect is also greatest in municipalities where detouring around that city on the Mexican road network heading toward the US is particularly arduous.

These estimates are interesting, and do suggest that Mexican government policy is casually related to increasing drug violence, but the more intriguing question is what we should do about this. Here, the work is particularly fascinating. Dell constructs a graph where the Mexican road network forms edges and municipalities form vertices. She identifies regions which are historical sources of pot and poppyseed production, and identifies ports and border checkpoints. Two models on this graph are considered. In the first model, drug traffickers seek to reach a US port according to the shortest possible route. When PAN wins a close election, that municipality is assumed closed to drug traffic and gangs reoptimize routes. We can then identify which cities are likely to receive diverted drug traffic. Using data on drug possession arrests above $1000 – traffickers, basically – she finds that drug confiscations in the cities expected by the model to get traffic post-elections indeed rises 18 to 25 percent, depending on your measure. This is true even when the predicted new trafficking routes do not have a change in local government party: the change in drug confiscation is not simply PAN arresting more people, but actually does seem like more traffic along the route.

A second model is even nicer. She considers the equilibrium where traffickers try to avoid congestion. That is, if all gangs go to the same US port of entry, trafficking is very expensive. She estimates a cost function using pre-election trafficking data that is fairly robust to differing assumptions about the nature of the cost of congestion, and solves for the Waldrop equilibrium, a concept allowing for relatively straightforward computational solutions to congestion games on a network. The model in the pre-election period for which parameters on the costs are estimated very closely matches actual data on known drug trafficking at that time – congestion at US ports appears to be really important, whereas congestion on internal Mexican roads doesn’t matter too much. Now again, she considers the period after close PAN elections, assuming that these close PAN victories increase the cost of trafficking by some amount (results are robust to the exact amount), and resolves the congestion game from the perspective of the gangs. As in the simpler model, drug trafficking rises by 20 percent or so in municipalities that gain a drug trafficking route after the elections. Probability of drug-related homicides similarly increases. A really nice sensitivity check is performed by checking cocaine interdictions in the same city: they do not increase at all, as expected by the model, since the model maps trafficking routes from pot and poppy production sites to the US, and cocaine is only transshipped to Mexico via ports unknown to the researcher.

So we know now that, particularly when a territory is on a predicted trafficking route near the boundary of multiple gang territories, violence will likely increase after a crackdown. And we can use the network model to estimate what will happen to trafficking costs if we set checkpoints to make some roads harder to use. Now, given that the government has resources to set checkpoints on N roads, with the goal of increasing trafficking costs and decreasing violence, where ought checkpoints be set? Exact solutions turn out to be impossible – this “vital edges” problem in NP-hard and the number of edges is in the tens of thousands – but approximate algorithms can be used, and Dell shows which areas will benefit most from greater police presence. The same model, as long as data is good enough, can be applied to many other countries. Choosing trafficking routes is a problem played often enough by gangs that if you buy the 1980s arguments about how learning converges to Nash play, then you may believe (I do!) that the problem of selecting where to spend government counter-drug money is amenable to game theory using the techniques Dell describes. Great stuff. Now, between the lines, and understand this is my reading and not Dell’s claim, I get the feeling that she also thinks that the violence spillovers of interdiction are so large that the Mexican government may want to consider giving up altogether on fighting drug gangs.

http://econ-www.mit.edu/files/7484 (Nov 2011 Working Paper. I should note that this year is another example of strong female presence at the top of the economics job market. The lack of gender diversity in economics is problematic for a number of reasons, but it does appear things are getting better: Heidi Williams, Alessandra Voena, Melissa Dell, and Aislinn Bohren, among others, have done great work. The lack of socioeconomic diversity continues to be worrying, however; the field does much worse than fellow social sciences at developing researchers hailing from the developing world, or from blue-collar family backgrounds. Perhaps next year.)

Like this:

Ignore the title of this article; it is simply a nice rhetorical trick to get economists to start using modern tools of the type Judea Pearl has developed to discuss causality. Economists know Haavelmo (winnter of the ’89 Nobel) for his “simultaneous equations” paper, in which he notes that regression cannot identify supply and demand simultaneously from a series of (price, quantity) bundles for the simple reason that regression of intersections between supply and demand won’t identify whether the supply curve or the demand curve has shifted. Theoretical assumptions about what changes to the economy affect demand and what affect supply – that is, economics, not statistics – solve the identification problem. (A side note: there is some interesting history on why econometrics comes about as late at is does. Economists until the 40s or so, including Keynes, essentially rejected statistical work in social science. They may have done so with good reason, though! Theories of stochastic processes that were needed to make sense of inference on non-IID variables like an economics time series weren’t yet developed, and economists rightly noted the non-IIDness of their data.)

Haavelmo’s other famous paper is his 1944 work on the probabilistic approach to economics. He notes that a system of theoretical equations is of interest not because of regression estimate itself, but because of the counterfactual where we vary one parameter keeping another one the same. That is, if we have in our data a joint distribution of X and Y, we are interested in more than simply that joint distribution; rather, we are interested in the counterfactual world where we could control one of those two variables. This is explicitly outside the statistical relationship between X and Y.

With Haavelmo 1944 as a suitable primer, Pearl presents the basic idea of his Structural Causal Models (SCM). This consists of a model M (usually a set of structural equations), a set of assumptions A (omitted factors, exclusion restrictions, correlations, etc.), a set of queries for the model to answer, and some data which is perhaps generated in accordance with A. The outputs are the logical implications of A, a set of data-dependent claims concerning model-dependent magnitudes or likelihoods of each of the queries, and a set of testable implications of the model itself answering questions like “to what extent do the model assumptions match the data?” I’ll ignore in the post, and Pearl generally ignores in the present paper, the much broader question ofwhen failures of that question matter, and further what the word “match” even means.

What’s cool about SCM and causal calculus more generally is that you can answer a bunch of questions without assuming anything about the functional form of relationships between variables – all you need are the causal arrows. Take a model of observed variables plus unobserved exogenous variables. Assume the latter to be independent. The model might be that X is a function of Y, W and an unobserved variable U1, Y is a function of V, W and U2, V is a function of U3 and W is a function of U4. You can draw a graph of causal arrows relating any of these concepts. With that graph in hand, you can answer a huge number of questions of interest to the econometrician. For instance: what are the testable implications of the model if only X and W are measured? Which variables can be used together to get an unbiased estimate of the effect of any one variable on another? Which variables must be measured if we wish to measure the direct effect of any variable on any other? There are many more, with answers found in Pearl’s 2009 textbook. Pearl also comes down pretty harshly on experimentalists of the Angrist type. He notes correctly that experimental potential-outcome studies also rely on a ton of underlying assumptions – concerning external validity, in particular – and at heart structural models just involve stating those assumptions clearly.

Worth a look – and if you find the paper interesting, grab the 2009 book as well.

Like this:

Christopher Sims, a winner of yesterday’s Nobel, wrote this great little comment in the JEP last year that has been making the blog rounds recently (hat tip to Andrew Gelman and Dan Hirschman). It’s basically a broadside against the “Identification Mafia”/quasinatural experiment type of economics that is particularly prevalent these days in applied micro and development.

The article is short enough that you should read it yourself, but the basic point is that a well-identified causal effect is, in and of itself, insufficient to give policy advice. For instance, if smaller class sizes lead to better outcomes in a quasinatural experiment, we might reasonably wonder why this happens. That is, if I were a principal and I created some small classes and some very large classes – and this is what universities do with the lecture hall/seminar model – am I better off than if I used equal classes all around? A well-estimated structural model can tell you. A simple identified quasinatural experiment cannot. And this problem does not even rely on expectations feedback and other phenomena that make many “experiments” in macro less than plausible.

Two final notes. First, let’s not go overboard: well-identified models and RCTs are good things! But, good internal validity is not an excuse to ignore external validity. Well-identified empirics that, through their structural setup, allow counterfactuals to be discussed and allow comparison with the rest of the literature are quite clearly the future of empirical economics. Second, as Sims notes, computers are very powerful now and growing more so. There is little excuse in the year 2011 for avoid nonlinear/nonparametric structures if we believe them to be at all important.

Like this:

Stories of “social contagion” are fairly common in a number of literatures now, most famously in the work of Christakis & Fowler. Those two authors have won media fame for claims that, for example, obesity is contagious the same way a cold is contagious. The basic point is uncontroversial – surely everyone believes in peer effects – but showing the size of the contagion in a rigorous statistical way is controversial indeed. In the present paper, Shalizi and Thomas point out, in what is really a one-line proof once everything is written down properly, that contagion cannot be distinguished from latent homophily in real world data.

Consider the authors’ bridge-jumping example. Joey jumps off a bridge, then Ian does. What might be going on? It may be peer pressure, in which case breaking the social tie between Golden Child Ian and Bad Apple Joey would break the contagion and keep Ian from jumping. It might be, though, that both are members of a thrill-seeking club, whose membership roll is public and can therefore be conditioned on; call this manifest homophily. But it may be that Joey and Ian met on a rollercoaster, and happen to have a shared taste for thrillseeking which is not observable by the analyst; call this latent homophily. More generally, social networks form endogenously based on shared interests: how do I know whether obesity is contagious or whether people who like going out for steak and potatoes are more likely to be friends?

A nice way to analyze this problem is in terms of graphical causal models. For some reason, economists (and as far as I know, other social scientists) generally are unaware of the causal model literature, but it is terribly useful whenever you want to reason about when and in what direction causality can flow, given some basic assumptions. Judea Pearl has a great book which will get you up to speed. The homophily/contagion problem is simple. If past outcomes are informative about current period outcomes, and unobserved traits are informative about each other when two people share a social tie, and those unobserved traits are informative about current period outcomes, then when two people share a social tie, Joey’s period t-1 outcome will be statistically linked to Ian’s period t outcome, even if there is no actual contagion. That is, no matter what observable data we condition on, we cannot separate a direct effect of the social tie on outcomes from the indirect path identified in the previous sentence.

Christakis and Fowler, in a 2007 paper, offered a way around this problem: take advantage of asymmetry. If Joey reports Ian as his best friend, but Ian does not report Joey as his best friend, then the influence from Ian to Joey should be stronger than the influence from Joey to Ian. Shalizi and Thomas show that the asymmetry trick requires fairly strict assumptions about how latent homophily affects outcomes.

So what can be done? If all variables of interest for predicting outcomes and social ties were known to the researcher, then certainly we can distinguish between contagion and latent homophily, since there would then be no latent homophily. Even if not all relevant latent variables are known to the analyst, it still may be possible to make some progress by constructing Manski-style bounds using well-known properties of, for example, linear causal models. If the social network did not possess homophily – for instance, if the relevant network was randomly assigned in an experiment – then we are also OK. One way to approximate this is to control for latent variables by using statistical techniques to identify clusters of friends who seem like they may share unobservable interests; work in this area is far from complete, but interesting.