Saturday, October 6, 2012

Economists as Data Scientists

In Greg Mankiw’s principles of economics textbooks he proposes
that economists are like scientists in that they develop theories and
subsequently gather data to test those theories empirically. Econometrics is
the empirical aspect of economics.In general econometrics is focused on hypothesis testing of causes and
effects. The goal is typically deriving estimators with desirable properties
appropriate for making inferences. As described by Tim
Harford:

" econometricians set themselves the
task of figuring out past relationships. Have charter schools improved
educational standards? Did abortion liberalisation reduce crime? What has been
the impact of immigration on wages?"

Some of the tools of econometrics include linear regression,
logit/probit models, instrumental variables, and time series.

Data Scientists

As presented by Drew Conway, data science is a
combination of hacking skills, math and statistics knowledge, and substantive
expertise.

A recent post via the Harvard Business
review blog gives some practical examples of the capabilities of a data
scientist:

"They can suck
data out of a server log, a telecom billing file, or the alternator on a
locomotive, and figure out what the heck is going on with it. They create new
products and services for customers. They can also interface with carbon-based
lifeforms — senior executives, product managers, CTOs, and CIOs. You need
them." -Can
You Live Without a Data Scientist? - Harvard Business Review

"Database
and data manipulation or how to shuffle data around and move things
from place to place; statistics and statistical analysis; machine
learning; visualization, or how to present data in a meaningful way; and
communication or being able to describe what’s going on."
While data scientists certainly rely on a strong foundation
in statistics, and may in fact utilize some of the same tools of inferential
statistics used by econometricians, data scientists most often will follow a
different path. As described by Leo Brieman:

"There are two
cultures in the use of statistical modeling to reach conclusions from data”

The traditional statistical/econometric culture:

"assumes that the
data are generated by a given stochastic data model."

vs. the machine learning/data mining culture:

"uses algorithmic
models and treats the data mechanism as unknown."

Because of the nature of the data and the problems solved by
data scientists, they very often use algorithmic methods to obtain desired
solutions. Typically this is not a situation that calls for the types of estimators
with desirable properties leading to empirically sound inferences sought by
econometricians, but often the concern is simply making accurate predictions or
discovering informative patterns in the data.

Economist Scott Nicholson (Chief Data Scientist at Accretive Health
and formerly at LinkedIn) comments on the differences between economists
and data scientists:

"In terms of applied work,
economists are primarily concerned with establishing causation. This is
key to understanding what influences individual decision-making, how
certain economic and public policies impact the world, and tells a much
clearer story of the effects of incentives. With this in mind,
economists care much less about the accuracy of the predictions from
their econometric models than they do about properly estimating the
coefficients, which gets them closer to understanding causal effects.At
Strata NYC 2011, I summed this up by saying: If you care about prediction, think like a computer scientist, if you care about causality, think like an economist."

The algorithms used by data scientists come from the machine
learning and data mining paradigm, and often include neural networks, decision
trees, support vector machines, association rules, and others.

These approaches may not be very familiar to economists, but
their training in statistics and mathematics make these techniques very
accessible.Take for instance logistic
regression. This technique is very familiar to most economists, and is in
fact used often times by data scientists to solve classification problems. However, as Peter Kennedy describes in A Guide to Econometrics,neural
networks (with logistic activation functions) can be thought of as a weighted
average of logit functions. And,
if the econometrician understands how logistic regression parameters are
estimated (based on maximum likelihood with estimation implemented via Newton’s
Method) it’s not that difficult to grasp gradient
descent or even the backpropogation
algorithm used in neural networks.

Similarly, as econometrics is written in the language of
calculus and linear algebra, so is machine learning. (for more details see the
popular machine learning text Elements of
Statistical Learning: Data Mining, Inference, and Prediction).Some of the mathematical concepts used in
advanced microeconomic theory (inner products, separating and supporting
hyperplanes, and quadratic
programming for example) are also very useful when it comes to understanding support vector machines.

In conclusion, most economists trained in econometrics have
two of the three elements that comprise data science; substantive expertise
(economic theory) and knowledge of mathematics and statistics. Supplementing
their quantitative skills with hacking skills (data management, manipulation,
cleaning, and loop and array processing, etc. via a language like SAS/SQL,
MATLAB, or R) and familiarity with machine learning algorithms would open the
door for many trained in economics and statistics to employ their skills as
data scientists.