On my way to becoming an economist

Menu

Category Archives: Data analysis

Post navigation

I just submitted my final paper for the panel data class. Sharing it here. Comments and feedback are welcome!

I. Introduction

This paper provides a critical review on the econometric approaches to model technology adoption decisions in developing countries. These decisions include the choice of whether or not to adopt a particular technology (e.g. high yielding variety seeds) and the amount of inputs depending on the technology used.

The developing country setting presents two additional challenges to identifying the determinants to technology adoption. First, imperfect access to credit and insurance introduces correlation between lagged productivity shocks and current input choices, thus violating the strict exogeneity assumption that is commonly maintained in panel data models. Second, the prevalence of informal networks highlights the importance of incorporating learning and externality into the analysis.

Following Foster and Rosenzweig (1995), suppose we are interested in what factors determine the adoption of high yielding variety (HYV) seeds of farmers in developing countries. There are two broad sources of uncertainty that drives differences in different technology adoption behavior. First, farmers may know the returns to HYV seeds but not the optimal levels of inputs. Therefore, a farmer needs to experiment with different levels of input choices once she decides to use HYV seeds. Second, there may be uncertainty in the profitability of this new technology. This source of uncertainty can be especially relevant when the technology is new (Conley and Udry 2010). Although the two sources of uncertainty may co-exist, we focus on only one at a time given the complication of the problem.

II. Input Choice as Technology Adoption

In this section, we assume that returns to technology adoption depends on how close actual input levels are to optimal input levels, i.e. use a target input model. Foster and Rosenzweig (1995) uses this framework to examine how farmers HYV adoption decisions depend on own and neighbors’ experience. In their framework, expected profits of farmer j at time t is
where $\eta_h$ is yield using HYV varieties, $\eta_{ha}$ is the loss associated with using less suitable land as more HYVs are used, $A_j$ is the total amount of land, $H_{jt}$ is the amount of land using HYVs, $\sigma_{\theta jt}^{2}$ is the updated variance of the mean input level, and $\sigma_{u}^2$ is the variance of the error term in target input use (relative to the mean optimal input). The updating of the variance term depends on learning from own and neighbors’ experience. This will be the focus of section IV.

In the empirical analysis, the authors estimate the profit function adding education of the farmers as an additional covariate:
where $S_{jt}$ is the cumulative number of parcels planted by farmer j up to time t, $\bar{S}_{-jt}$ is the average of the cumulative experience of neighboring farmers, $\rho$’s are precision terms of own and neighbors’ experience as signals of optimal input levels. Two approaches are used for estimation.

The first approach uses IV and fixed effects to estimate a first-order reduced-form approximation of equation (2). Instrumental variables are used to address correlation between 1) contemporaneous profit shocks and production decisions, and 2) lagged profit shocks and contemporaneous adoption (potentially because of credit constraints). Fixed effects are used to eliminate individual level heterogeneity $\mu_i$. If we maintain the assumption that input decisions are predetermined, the IV approach address the concern that strict exogeneity is violated. Note that predeterminedness implies that the profit shocks in first differences exhibit first-order autocorrelation but are uncorrelated at all other lags. This seems a reasonable assumption if we believe the profit shocks are unanticipated and are not persistent over time. Because Foster and Rosenzweig (1995) do not describe the nature of the profit shocks, it is difficult to evaluate the validity of the predeterminedness assumption.

The second approach uses nonlinear IV fixed effects to obtain the structural estimates of the profit function. Equation (2) is differenced over time and estimated using standard nonlinear IV procedure. This approach is subject to the same concern as the first approach.

III. Discrete Technology Adoption Decisions

In this section, the outcome variable equals one if the individual adopts technology in period t. Because technology adoption contributes to accumulated experience, adoption in the current period may induce changes in the returns to technology in the next periods in a complicated way.

Foster and Rosenzweig (1995) examine HYV adoption using reduced-form predictions from the structural model. But without solving the decision rules, they are unable to estimate the structural parameters. To address this limitation, we might use nonlinear panel data models with stronger distributional assumptions of the error terms (e.g. logistic distribution) and use conditional maximum likelihood estimators. This, however, rules out serial correlation in the error terms and might be unrealistic. An alternative approach is Manski’s conditional maximum score estimation. This approach achieves identification from “switchers”, but observing enough individuals switching from adopting versus not adopting a specific technology might be challenging as there are often fixed costs involved in a new technology and hence persistence in adoption decisions.

Suri (2011) provides an alternative framework to examine why farmers make different adoption decisions. She uses the information on the correlation between productivity differences and productivity of a technology among farmers who use both technology to project the different productivity levels for farmers who use only one technology. More specifically, she assumes profits for farmer i with productivity

She estimates the following equation for yields:
Based on the primitives of the model,

The identifying assumption is mean independence of the composite error $(\tau_i+\epsilon_{it})$ and the comparative advantage component $\theta_i$, and the histories of the regressors. Translated into assumptions on what drives the hybrid switching behavior, this assumes the unobserved time-varying variables that drive the switching should not be correlated with yields.Chamberlain (1982) correlated random effects approach is used for estimation. Dependence of the observed $\theta_i$’s on the endogenous input $h_{it}$ is accounted for using the linear projection of $\theta_i$ on the full history of inputs and their interactions. Structural parameters are recovered from reduced-form estimates.

The correlated random effects approach reduces the threshold for identification, and it seems reasonable to assume that individual-level heterogeneity are uncorrelated with productivity shocks once the history of input decisions are controlled for in Suri’s setting. Moreover, the focus of Suri (2011) is to identify the \emph{cross sectional} heterogeneity in productivity and its consequence on hybrid seed adoption. It is unclear whether this focus warrants the use of CRE models.

IV.Learning in Technology Adoption

Recent literature on technology adoption highlights the importance of learning from own experience and the experiences of informal network members.

Conley and Udry (2010) collect data on social interactions and address the unobserved variable problem when studying learning effects in technology diffusion: pineapple planting. In their model, risk-neutral farmers each have a single plot, and maximize current expected profits by choosing discrete-valued input $x_{it}$ at time t. Pineapple output realized 5 periods after input decision is
where $\epsilon$’s are unobserved productivity shocks iid distributed with mean 0 and variance 1, $\omega_{it}$ captures spatially and serially correlated shocks to marginal product that is only observed by the farmer (not the econometrician). Farmers do not know the function $f$ but learn about it with a learning rule.

Identification uses the specific timing of plantings to identify opportunities for information transmission. Variation in planting decisions generate a sequence of dates where new information may be revealed to the farmer. Conditional on measures of growing conditions, Conley and Udry isolate events when new productivity information is revealed to the farmer. They then investigate whether new information is associated with changes in farmer’s input use that is consistent with social learning. A logistic regression is used to estimate how farmers’ input decisions respond to actions and outcomes of other farmers in their information networks (data collected by the authors).

The baseline regression model is
where $M_{it}$ is an index of good news on input levels constructed from inputs and profits five years ago and now. The identification assumption is that conditional on measures of changes in growing conditions $\Gamma_{it}$ and other farm level characteristics, the information measure $M_{it}$ is uncorrelated with unobserved determinants in growing conditions and therefore input use. A significant, positive $\beta_1$ is evidence for social learning.

An important limitation of this approach is that it completely ignores the endogenous formation on informal networks and the potential dynamic changes in informal networks. To study the learning effects in technology adoption, we need a better understanding about the formation of informal networks and the nature of learning to evaluate whether the identification assumptions are realistic.

While I was doing a literature review on migration, I saw an interesting new article on inferring internal migration patterns from mobile phone usage data in Rwanda. The author Joshua Blumenstock, a professor in the information school at the University of Washington, has quite a few interesting projects using mobile phone data for policy evaluation in developing countries.

IZA just published this new panel data set tracking unemployed individuals in Germany.

The IZA Evaluation Dataset Survey (IZA ED Survey) is a novel panel survey which tracks the employment history, behavior and individual traits of a large, representative cohort of individuals. The IZA ED Survey covers a panel of 18,000 individuals who registered as unemployed at the Federal Employment Agency in Germany between June 2007 and May 2008. The individuals were interviewed up to four times over a time span of three years, starting at their entry into unemployment. This data allows the researchers to observe dynamics with respect to individual and labor market characteristics during the early stage of unemployment, as well as tracking long-run outcomes. Within the survey, information on labor market activities, ALMP (Active Labor Market Policy) participation, migration background, search behavior, ethnic and social networks, psychological factors, cognitive and non-cognitive abilities, attitudes and preferences was recorded. Its large sample size of individuals entering unemployment, in combination with its broad set of variables and the measurement of unemployment dynamics offers many new perspectives for empirical labor market research.

A detailed description can be found here. Seems like a good resource for labor economists.

Programming is fun and rewarding. It is similar with writing in that you reach clarity and elegance through constant revision. As a beginner in Matlab, I’m starting a learning log to record my thoughts along the way. This time, my thoughts come from homework problems for my Demand Estimation class and a dynamic programming problem in my RA work.

1. Be clear about the steps you need to take before you write down any code. If you have a model to guide your analysis, make your code as consistent with the model as possible. Sometimes a tree structure or flow chart can help you think more clearly. Once you start writing the code, it is very easy to get lost in the details (e.g. vector dimensions).

2. In a loop, have a clear idea of the relationship between variables and when and where a variable needs to be defined. For example, empty vectors/matrices to store estimates should be defined before the estimates are produced.

3. Use vectors where possible to make calculations more efficient. For someone like me who is spoiled by straight-forward “programming” in Stata, this is something I need to learn.

Learning a new programming language is like getting to know a new friend. Over time you learn about her strengths and weaknesses, and how she can complement you to make your work more productive. Starting next week I will be taking a course on entry games taught by Professor Allan Collard-Wexler. Looking forward to learning more about programming and IO theory in the next seven weeks!

Researchers using Chinese data are often disappointed by the inability of Stata to display Chinese characters correctly. The solution from the most reliable source I can find online:

Most modern software (OS and applications) work with Unicode. Stata does not work with Unicode. Unicode encodes characters with 2 or more bytes. In Stata each character must be 1 byte only. You need to make sure the input CSV file is encoded in a codepage proper for your region, presumably 1252.

It’s actually simpler than that. If you’re using Windows 8 like I do, the steps are as follows:

1. Go to Control Panel->Language->Advanced Settings.

2. Click into “Apply language settings to the welcome screen, system accounts, and new user accounts”.

3. In “Administrative” tab, under “language for non-Unicode programs”, change it to Chinese (Simplified). You might need to change system locale if your computer wasn’t initially set to be “located in China”.

Note that you don’t need to change your preferred language or system display language. The above steps should also work for other languages as well. Hopefully my note can benefit other researchers.

P.S. I didn’t want my first post in the semester to be this technical, but this kind of reflects what’s on my mind.

Markus Mobius from Microsoft Research New England gave a talk at our department on social media and news consumption on Wednesday. This is his joint work with Susan Athey and Jeno Pai. Using big data scraped from toolbar records, tweets, Facebook posts and other social media usage, they attempt to explain how social media affects the preferences and trends of news consumption.

Most adult internet users have more than one social media account, and social media websites are driving the traffic to news websites. The prevalence of social media prompts Mobius and his coauthors to ask the following questions: How does social media affect the composition of news consumption? Does it increase the demand for particular types of news relative to others? Does it lead to bias in the news and polarization in opinions?

With these questions in mind, they collected data about the media access from major social media websites in April and May 2013. Their primary focus is on the relative comparison between categories of news consumption, so the time window is deliberately restricted. They find that the composition of the social media users explains a great deal of the patterns in media consumption;social media as a channel to access news doesn’t seem to change people’s demand for particular genres of news relative to others.

One caveat of their research is that they can only access toolbar records of Internet Explorer users, so their sample is extremely selected (probably mostly old-fashioned and non tech-savvy people). Mobius also promoted Microsoft Research at the end of talk. PhD students can apply for summer intern positions where they choose and finish a topic in a summer. If I’m not mistaken, the data might be used for dissertation purposes as well. People with PhD degrees can take post-doc positions, and I’m a bit surprised to learn that a development economist who just got his PhD from UC Berkeley is doing a post-doc there before joining Harvard economics department.