Posts categorized "Interview"

After the NBA Hackathon (see report here), I caught up with the winning team in the business analytics competition, DataBucket, composed of Barbara Zhan and Harold Li.

Junkcharts: Congratulations for winning the business analytics competition at the NBA Hackathon. As a judge, I was very impressed by how much work you were able to do in 24 hours. Did you sleep or did you work all the way through?

DataBucket: We slept for around 5 hours in the early mornings, but also took breaks every few hours just to relax our minds and recalibrate.

JC: The problem you chose to tackle is to define "entertainment value" for any NBA game. That's a huge problem to tackle in 24 hours. How did you allocate your time?

DB: We spent the first few hours planning our course of action, and really debating how to evaluate "entertainment value." Without a good metric, any sort of analysis would be fruitless. We also decided on our methodology (a time-series regression approach) and the features we wanted for our model.

Afterwards, we divided and conquered, cleaning / scraping the various datasets to get the variables that we wanted. Once we had a clean dataset, we ran regressions and played with features to get the most accurate and most intuitive results.

Once we were confident with our model, we spent time building out the Tableau dashboard that visualized those entertainment values. It was important for us to come up with a tool that was engaging, interactive, flexible and informative, so we spent significant time designing our visualization.

JC: How did you allocate work between the two team members?

DB: It was a joint effort! We came up with the initial plan after an hour or so of joint brainstorming. Barbara took the lead on the feature engineering / data modeling sections, coming from a quantitative hedge fund background where she knew a ton about regressions and the assumptions behind them, but both of us were highly involved in data cleaning and modeling. Harold took the lead on the visualization / presentation component, since he comes from an analytics background where storytelling and communicating results in a business context is vital. He created a Tableau dashboard that showcased our resulting entertainment metric, which updated over time, and lent a crucial "cool" factor to our presentation.

JC: Tell me about your backgrounds.

DB: We both majored in Operations Research and Financial Engineering at Princeton. Harold is a data scientist at Blue Apron, and previously worked at Goldman Sachs as a quantitative strategist. Barbara is a quantitative researcher at Two Sigma.

JC: I heard you guys have a blog called DataBucket. What's the origin of the name?

DB: When we were both at Princeton, we thought it would be fun to use our data science skills to answer questions we were interested in. Our first article sought to quantify the clutchness of NBA players, so we called it DataBucket to honor the basketball-related heritage of the blog!

JC:Your team chose to work on the problem of defining entertainment value of an NBA game. You incorporated data from Instagram into your solution. Can you explain what data you pulled from Instagram and how you used them?

DB: The prompt suggested that we incorporate creativity into our project, so we decided to use alternative data. Barbara was familiar with the Instagram API, having used it before for the DataBucket blog, and scraped quantities of hashtags related to each player's name as a proxy of player popularity. Harold was familiar with Google Trends, which he used to scrape timely data on search terms that would be most relevant to a blockbuster game (i.e. NBA on TNT).

JC:One of the highlights of your presentation is the decision making tool you created for the manager. What tools did you use to build it?

DB: We used Tableau to visualize our dataset. Given the time constraints, Tableau was the easiest tool to create something interactive and visually appealing without much effort.

Running the regression and cleaning the data was in Python and R - we used whatever we were most comfortable with and went with it!

JC:If you had a chance to do one thing differently in the Hackathon, what would it be?

DB: We definitely wanted to explore Twitter data a bit more. While Instagram is a good indicator of player popularity, Twitter is more of a real-time platform that captures more accurate sentiment of a particular game, but we couldn't hack the API in time.

On another note, we would have loved to forecast game-level entertainment value for this upcoming season instead of validating our model for this past season.

Cathy O'Neil may need no introduction to blog readers. She's the author of the hard-hitting MathBabe blog, and she shares my passion for explaining how data analysis really works. She is co-author of the recent book Doing Data Science (link), with Rachel Schutt. Cathy has a varied career spanning academia and industry, as she explains below.

***

KF: How did you pick up your impressive statistical reasoning skills?

CO:

Thanks for the flattery, but I wouldn't call my skills impressive. I've always done my best thinking by assuming I understand nothing and starting from scratch. The best I can say about myself is that I have learned how to think abstractly and a few cool methods, or better yet rules of thumb, that help me get at very basic information.

What I know about thinking abstractly happened mostly during my mathematical training, first at math camp in high school, then as an undergrad in a highly welcoming and mathematically vibrant community at UC Berkeley in the early 1990s, and then during grad school at Harvard and to some extent my post-doc at MIT and my two-year Assistant Professor stint at Barnard College.

To be honest most of the last few years of being a "grownup" academic was spent learning non-math stuff like how to teach and write letters of recommendation.

Then I learned a bunch explicitly in the realm of statistical reasoning when I first got to D.E. Shaw, from my boss Steve Strong. Since then I feel like I've just been corroborating what Steve explained to me early on, which is that people fool themselves into thinking they understand stuff they don't.

KF:How would you rate the relative importance of academia and real-world experience in training your data interpretation skills?

CO:

I'd say that, on the whole, learning to think abstractly has been at least as important to me as rules of thumbs, and certainly more important than a given algorithm or technique. For example, from my experience working in industry, the most common mistake is answering the wrong question, not using the wrong technique.

I routinely tell people that, as a mathematician, you are a professional learner, with an added advantage of getting used to being wrong and feeling stupid. I'm sure the same can be said about other disciplines, but I'll stick to what I think I know on that score.

KF: What advice would you give to a young graduate with a BS in a quantitative field: get an advanced degree in Statistics, or go find a job in analytics?

CO:

I don't think it's a waste of time to get a Ph.D. and then an industry job, because although you're not honing specific skills in your future line of work, you are honing brain paths and habits of mind which don't come easy under time pressure and/or with money on the line. And of course, there are some people who love the feeling of getting things to work so much that they don't have patience for the thesis thing, and that makes sense for those people, as long as they don't give short shrift to the high-level perspective.

I'm a huge complainer about everyone and everything, in spite of the fact that I think data and data analysis techniques are powerful and can and should be used for good.

I guess if I had to pinpoint my single most massive peeve, which really cannot be termed "pet," it would have to be hiding perverse incentives (and almost all incentives are perverse in some way) behind what people present at "objective truth". In my experience, outside of the world of sports where everything is transparent (except steroid use), there is always some opacity and gaming going on and someone's either making money off of it, gaining status from its publication, or wielding power through it.

And come to think of it, you've asked me the wrong question altogether. My biggest peeve with data interpretations is how many aren't published at all. For example, the Value-Added Model for teachers is being used to evaluate teachers but I can't seem to get my hands on the source code to save my life. Not to mention the NSA models.

[Edit: David Spiegelhalter also complained about what studies don't get published but for a different kind of concern. See this interview. In the recent furor over Google Flu Trends, the researchers expressed dissatisfaction that the underlying algorithm isn't properly documented in the public domain.]

***

KF: Which source(s) do you turn to for reliable data analysis?

CO:

I don't trust anything or anyone, including my own analysis. Everything comes with caveats. Having said that I usually trust people more when they are open about their caveats. On the other hand, even admitting that opens me up to being fooled by people who write up fake caveats to seem trustworthy. It's really an endless loop.

So for example, I like raw data, especially when I know how the data was gathered. For example, look at this gif, which shows a map of death penalty executions. In some sense that's as good as it gets, but of course it is also misleading in a sense since there are way more people in, say, California (38M in 2013) than in Nevada (3M in 2013), so even though they look similar on the map it's not so.

Bottomline is, never trust anything until you've checked it, and even then only trust your own memory of it for about 20 minutes.

KF: What advice do you have for the average reader of this blog? Surely, checking everything they read is not too realistic.

CO:

Of course, we don't have time to check everything. My suggestion is to remain skeptical of anything that you haven't checked through. And of course, don't confuse skepticism with cynicism, but also don't confuse skepticism with evangelism.

KF: Thank you so much. I've really enjoyed our conversation.

PS. I subsequently wrote about the chart that Cathy referenced in this interview. See here.

I am excited to chat with Professor David Spiegelhalter, who is no strangers to our UK audience, and our statistics colleagues. Perhaps his most well-known contribution is the DIC criterion for model selection, introduced by a paper by him and collaborators. He holds the impressive title of Winton Professor for the Public Understanding of Risk at the University of Cambridge (link). He also writes a blog called Understanding Uncertainty (link), and as the accompanying photos show, is someone who knows how to enjoy life.

I mean, a statistician who appeared on Winter Wipeout (link to Youtube for spectacular splashes at 15:12 and 16:10), who'd have thought? Yes, Wipeout is that obstacle course show held over a pool of water. He also made this rather more educational Youtube (link).

***

KF: How did you pick up your impressive statistical reasoning skills?

DJS:

Well that's your label and not mine. I started off doing pure maths, but that got too hard, and then I did mathematical statistics, and then got both too hard and too boring, and so for years now I have preferred getting involved in real problems that people are trying to handle using data.

But generally the data are messy, incomplete, and not as relevant as desired. While some technical insights are vital, I think any skill comes mainly from an apprenticeship of dealing with many problems, making many mistakes, trying to explain things, and far too much time spent critiquing studies.

KF: What is your pet peeve with published data interpretations?

DJS:

That's easy to identify - it's a non-scientific approach to science reporting. I have a naive view that scientists should do investigations to answer a question, and they should be pleased whatever the answer. But it seems clear from many publications that some researchers set out to prove a point, and do everything they can to do so: in the worst instances they write an inflated abstract, the journal puts out a press release, and the media lap it up.

I feel the public get fed a diet of highly selected and biased studies (often, ironically, on diet) that have gone through so many filters that they become very unrepresentative of the bulk of research conducted. In my more cynical old-man moments, I would say that the very fact that a study is reported in the media is a reason to ignore it - almost certainly you would not have heard about it if the results had different.

KF: That last point sounds counterintuitive. Let's take a diet example. The media has been telling us new research suggests that four or more cups of coffee each day is great for you. If the research result were null, surely it wouldn't get picked up by the media. Why would that be a bad thing?

DS:

Say the media tells us four cups or more of coffee every day is great for you, and I judge that, if the study had shown no effect of coffee, it would not have been press-released and the media would not have picked it up. This probably means there is an unknown number of studies out there that showed the opposite to what I am being told by the media, but I am not hearing about them because they are not newsworthy enough. Therefore ignore the media. It also saves a lot of time.

KF: That's rather sobering. Which source(s) do you turn to for reliable data analysis?

DJS:

These would tend to be individuals and teams that I know and trust: Andrew Gelman (link to my interview) comes to mind, and there are other great scientists whose opinions I value. I also respect people who are trying to produce good odds for future events, without pushing for one side or another. A purely financial interest produces objectivity, and so sports-betting sites are good examples - it will be interesting to see how 5-38 develop.

KF: Thank you very much for sharing your insights.

***

David and Michael Blastland just published a new book called "The Norm Chronicles", which I had a chance to preview. It's an idiosyncratic look at the idiosyncratic risks of modern living.

Yesterday, Larry Cahoon, a 29-year veteran at the Census Bureau, answered some questions. The rest of the interview is printed below.

***

KF: How can a data analyst improve his or her skills?

LC:

I have to say my best training has been many, many hours of
just playing with statistics, playing with graphics, and reading the analysis
others have done. The more data I see, the more analysis I do, the more
graphics I look at and produce, the more I learn about how to look at data and
how to see the pitfalls in the data analysis. This is getting down in the mud
and dealing with real data with all of its warts. My wife tells people I’m a
statistician through and through as every time she looks at my
computer, there is a graphic of some type on the screen.

Finally, I have always been an avid reader of Science Fiction. No one
can read Science Fiction without being forced to consider any problem from
different perspectives and take into considerations differing assumptions. This
in turn has helped me develop the ability to question the
assumptions being made in any data analysis.

***

KF: What are your pet peeves with published data interpretations?

LC:

I seem to return again and again to the same issues with the
analyses I see in the newspapers, online, and just about anywhere. The most
basic problem is one of incomplete analysis.

We see so many papers and news
reports where a data difference is observed and then based on no data
whatsoever, the author goes off with an entire line of speculation without any
data to justify that speculation. This line of thinking then frequently ends up with the claim that these
two things are correlated, and therefore we have cause and effect.

The media
then fans these reports by writing a story without asking basic questions,
such as is the data itself any good or have they any evidence for the claims
that are being made. The media
acts as if the claims have been proven – especially in how they headline the
story.

My second pet peeve is what I call an emphasis on a one dimensional
world. This is usually reflected in simple statements like: A causes B. The
world is much more complex than that. Those who investigate airline accidents
have been telling us for some time that there is seldom just one cause for each
accident. Rather there are a number of causes. We need to carry that knowledge
over to our statistical analysis and reporting.

***

KF: Which source(s) do you turn to for
reliable data analysis?

LC:

I can’t say that I have any favorite source for data analysis.
If forced to name one, I would say that I tend to like the work of the Pew
Research Center (link) Their surveys seem to be well designed, the questions they ask
well thought out, and the analysis something I can trust.

I like the data that
is available from the Federal Government. But the government agencies rightly
avoid most detailed data analysis in an effort to remain nonpartisan.

***

KF: Thank you so much for your time. We're lucky that you continue to blog in your retirement.

A short while ago, I introduced Larry Cahoon's blog, GoodStatsBadStats. He started the blog almost two years ago after retiring from the US Census Bureau (site offline due to government shutdown), where he spent 29 years working on the statistical design of most of the household surveys conducted by the Bureau. Cahoon received his PhD from Carnegie Mellon University.

Cahoon spent the final seven years of his career working on the 2010 decennial census. He was involved in
almost all areas of census design and operations. His work included
focusing on the many issues of both over coverage and under coverage in
the census, the effectiveness of the publicity campaign, and the use of
the American Community Survey as a replacement for the long form of the
census.

Larry was very gracious and offered great insights and long answers. I have divided the interview into two parts. Part 2 will appear tomorrow.

***

KF: What are the key skills of statistical reasoning?

LC:

There are three things on the top of my list when I think
about statistical reasoning skills. A solid knowledge of statistical methods
and principles is a necessary starting point. I extend this to include the
ability to think in terms of probability. This is a way of thinking or a way of viewing the world at the
most basic level.

Equally
important is a solid foundation in logical reasoning. The third piece is recognizing and
dealing with the assumptions that must be made in any statistical analysis. One
needs to be able to look at the problem from a multitude of perspectives and
ask if the assumptions being made are good ones. Can I make a logical argument why
they are true and why a contrary set of assumptions is false? The ability to consider alternate
assumptions and confounding factors that may not be obvious is very important
to any statistical analysis.

KF: Was graduate school useful training for your career?

LC:

In my own background, my first year of graduate school just
about drove me crazy as I did nothing but statistics. Most of it was extremely
theoretical work. I have always said that the masters’ degree they gave me
after that year was in some sense worthless as I got very little exposure to
the practical side of the statistical profession.

But what I did learn from the full immersion was to think in
terms of probability. I came to see statistics not just as theory but
as a way of thinking. Thinking in
probability terms became second nature. It helped that I was trained in a Bayesian environment long before it
reached the levels of use that we now see.

I saw statistical testing as a sometimes useful but limited
tool. It became clear to me that much more important questions are what is
the best estimate and what is the decision rule that would come out of the
analysis. It was just as important to know how that data is going to be used
as it is to know how the analysis is to be done.

KF: How did your work at the Census Bureau influence you?

LC:

To do good statistics, knowledge of the subject matter it is
being applied to is critical. I also learned early on that issues of variance and
bias in any estimate are actually more important than the estimate itself. If I don’t know things like the
variability inherent in an estimate and the bias issues in that estimate, then I
really don’t know very much.

A
favorite saying among the statisticians at the Census Bureau where I worked is
that the biases are almost always greater than the sampling error. So my first
goal is always to understand the data source, the data quality and what it
actually measures.

But, I also still have to make decisions based on the data I
have. The real question then becomes given the estimate on hand, what I know about
the variance of that estimate, and the biases in that estimate, what decision am
I going to make.

KF: I want to reiterate this point to my readers who are not statisticians. In data analysis, we are using available data (the sample) to make a general statement, say use the response of subjects enrolled in a clinical trial to describe the effect of a new drug on all potential patients. Imagine you are trying to hit the bull's eye. Shotgun #1 produces a wide scatter around the target while Shotgun #2 produces a narrow scatter but the average shot lands wide of the target. We say that #1 has high variance and low bias while #2 has high bias but low variance. Both types of errors contribute to the shot being off the bull's eye.

Because the Census Bureau typically uses large samples, the sampling error (variance) is very manageable. What is hard to control is bias, meaning the entire sample is not representative of the population under study. This is Shotgun #2.

LC continues:

As I matured in my career, I learned that diversity matters. The greater the exposure and the more
diverse the exposure to real world statistics, the better the practitioner would
become. So while I worked for many years in the area of survey design, when I
went to the annual statistical meetings, I always made the effort to maximize my
exposure to other areas of statistics. I would always return home with a few
textbooks on areas of statistics outside of what I was actually working on at
the job.

A curious nature with a desire to continue learning is essential.
Today a good training route is to read as many statistical blogs as you can
find the time for. Especially important is to seek out and read the work of
those who disagree with me. This forces me to think much more critically of the
work I am doing.

In the first chapter of my first book, Numbers Rule Your World (link), I explored the concept of variability using a pair of examples, one of which was Disney's FastPass virtual reservation system. Truly grasping the ins and outs of variability is one of the most important objectives for a budding statistician (or data scientist). In the discussion, I highlighted the work of Len Testa, whose website, TouringPlans.com,
provides custom, computer-optimized itineraries for saving time in
Disney's theme parks. Testa's team does exemplary work in applying mathematical models to solve a practical problem. I'm glad to present an interview with Testa today.

KF: The problem of hitting a sequence of destinations in the fewest steps has a long history. Lots of people have worked on it. The most famous of these problems, the Traveling Salesman, even gets onto the mainstream press. However, most of this work is highly theoretical. Your touring plans to me are a shining example of amazing applied work that makes a lot of people happier. How does your work differ from the others?

LT:

A lot of times when you're trying to analyze data to solve a
particular problem, you can approach it either from the perspective of
"management" - the people controlling the process - or from the side of
the "consumer."

The thing we try to model is optimal movement
through a theme park. That is, if you're a customer and you want to ride
10 attractions, in what order should you visit them to minimize your
wait in line?

The first time we approached this problem, we
tried to figure out all of the things you need to know if you're running
a theme park: ride speed, how many vehicles to have on the track, when
to schedule entertainment to draw people to other parts of the park, and
so on. It was complicated.

Then we looked at it from the point of view of the
consumer. Consumers, it turns out, have a lot less information about
how a theme park is run. About the only thing they really have is the
posted wait time at every attraction. But it turns out that the posted
wait time is really a synthesis of all the small decisions a theme park
manager makes, so that's all you need. It's also a lot simpler to
model.

KF: That is a really great answer. I hope all the budding data analysts out there are listening. Simplicity is a beauty.

***

KF: What is your pet peeve with published data interpretations?

LT:

Lack of context, especially around economic or political analyses. Yeah,
an $860 billion stimulus package sounded like a lot of money in
2008. But in relation to a $15 trillion economy, it's what – 6%?
But all of the discourse was on the raw number, not its size
relative to the economy.

***

KF: Do you have other tips for doing great applied data work?

LT:

I find reasoning by analogy to be a powerful way to
understand and explain things. For example, if you're putting off a
flight to Europe because you're afraid of a plane crash, which is a
1-in-500,000 chance, then why would you ever drive to work, where
the odds of dying are an order of magnitude worse? So a lot of it
is "If I'm willing to accept X then I should be willing to accept
Y” type of thing.

Another
helpful thing is being able to apply Bayes Theorem, especially when
you're trying to make a business case for something. I remember one
time we were trying to get funding to re-do some computer system (at
another job), and we calculated a probability of 80% that if we made
these changes, we'd succeed in reducing customer problems and lower
future operating costs. Some people looked at that as a 1-in-5 chance
of failure. I pointed out that we were making decisions every day
with a lot less than a documented 80% chance of success.

Closer to home, I like the mix of statisticians we have now
at TouringPlans.com. They have different styles they use to
approach problems, so it's useful to hear two different views. And
when they agree, you know you've got a decent shot of being right.

KF: Statisticians disagree? They don't know the truth. Readers: you heard it on this blog first! Len, thank you so much for your time.

Professor Andrew Gelman is a pioneer in statistics blogging. His blog is one of my regular reads, a mixture of theoretical pieces, applied work, psychological musing, rants about unethical academics, advocacy of statistical graphics, and commentary on literature. He's one of the few statisticians who gets opinion pieces published in the New York Times. His expertise is statistics in politics, but I also enjoy his work on the stop-and-frisk policies of NYPD, debunking of evolutionary psychology and ESP reseaerch, etc. (I also recommend our collaborative piece on the Freakonomics franchise.) He is a co-author of Bayesian Data Analysis, one of the most influential textbooks on Bayesian statistics; the fourth third edition is out now soon (link).

Here is my interview with Andrew.

***

KF: Andrew, you have impeccable credentials, degrees from MIT and Harvard, Professor at Columbia, Fellow of ASA and IMS, etc., but in my experience, having
degrees doesn't automatically prepare one to do great applied data work
like you have done. What's your secret?

AG:

Here's one thing regarding "great applied work": Ask yourself the
question: What makes a statistician look like a hero? You might think
that the answer would be, Extracting a small faint signal from noise.
But I don't think so. I think that a statistician looks like a hero by
studying large effects.

Statisticians have been studying ESP for
decades, trying to tease out tiny signals amid masses of noise, and they
just look like chumps. But in the projects I've worked on that have
been successful, we've been aiming at big fat targets--things like
incumbency in elections, or the effects of redistricting, or predicting
home radon levels (not such a hard task; radon levels vary a lot by
geography), or measuring the number of friends people have, etc etc etc.
In these problems, my statistical successes have often come from
methods that have allowed the combination of information from different
sources. Often what is important about a statistical method work is not
what it does with the data, but rather what data it uses. Good methods
have the flexibility to incorporate more information into the analysis.

I've picked up skills over the years, and I'd definitely say I'm better
at data analysis and statistical reasoning than I used to be. On the
other hand, whenever I've tried to design an experiment for my
substantive research, I've failed miserably. I've had lots of success
in my research in social science and public health, but almost all
involving the analysis of existing data. I attribute my inability to
design an experiment to a combination of lack of practice and lack of
natural talent. Measurement is central to statistics and is a
completely different thing than data analysis.

***

KF: What is your pet peeve about published data interpretations?

AG:

I've called it the lure of certainty. It's a problem with
researchers and with consumers of researchers as well: they don't want
to acknowledge uncertainty and variation. There's lots of talk about
the treatment effect without a sense that it can and will vary, that a
small positive effect in one setting might be negative in another, and
that existing data might not be enough to determine its sign, even in
the context of the data collection. I get so frustrated when people
take a "p less then .05" statistically significant result as definitive
evidence.

Here's an example. Recently the
journal Psychological Science published some papers claiming scientific
results based on college students and participants in the online
Mechanical Turk system. I criticized these results for lack of
generality and the response was that this is standard practice, that
it's naive to criticize a psychology study for having a
nonrepresentative sample.

In this case, though, I think the naive
criticism is correct. If you're estimating an effect with large average
size and small variation, then it's ok to use a nonrepresentative
sample. But in these social psychology examples, we have every reason
to believe that main effects are small and interactions are large--that
is, effects could be positive in some groups and negative in others. In
such settings, a nonrepresentative sample can kill you. But people
don't want to think about this, because it's more comfortable to think
about effects as "real" or "not real" without acknowledging variation.

***

KF: What sources do you turn to for reliable data analyses?

AG:

I rely on various colleagues
of mine such as Don Rubin, Jennifer Hill, Eric Loken, John Carlin, and
others. These people are very busy, though, so I've set up Rubin and
Hill emulators in my head to work on problems when the main CPU in my
head gets stuck.

David Walker and I debate whether statisticians should be a part of Big Data in Significance. (link - requires registration). Significance is a joint publication of the Royal Statistical Society and the American Statistical Association, and a great source of applied statistics. Walker fears Big Data is pure froth while I outline several problems in which statisticians can play a key role.

For those shopping at Barnes and Noble, Numbersense is inexplicably and incorrectly classified as "business reference". I have had no luck so far getting either McGraw-Hill or B&N to classify the book properly.

Today's Numbersense Pros features Dean Baker, an economist and co-director of the Center for Economic and Policy Research. His blog, Beat the Press, is my first stop for economics news. What attracts me is his data-first attitude. Every post on his site is about interpreting economic data--there is none of the data-free theorizing that is typical on other economics blogs.

Baker adopts a cynical, almost satirical, tone, which is fun for readers but perhaps unamusing to the targets of his scorn. To give you a flavor, one of his recent posts is titled "Larry Summers Thought the Stock Bubble was Cool and Missed the Housing Bubble". Another post on how to fix unemployment begins with "there are two types of people in the world: those who make complicated things simple, and those who make simple things complicated."

Here is the interview.

KF: How did you
pick up your impressive statistical reasoning skills? What
sets you apart from other economists?

DB:

I don't know that my skills are so impressive, but insofar
as I do things differently from other economists I think
it is the result of thinking more about the data, meaning
how it is gathered and its underlying patterns, rather
than trying to beat it up with complex statistical
analysis. Certainly our econometric techniques have
improved substantially over the last 2-3 decades, but I
think it will be a rare that a new technique will allow us
to recognize a trend or relationship that we couldn't see
with more simple tools.

***

KF: What is your pet peeve with published data
interpretations?

DB:

People often fail to consider the most basic issues about
the data they are looking at. I recently saw a news
article commenting on how the Case-Shiller data released
in July showed that the housing market was not being hurt
by the jump in interest rates at the end of May and
beginning of June. In fact, the Case-Shiller data released
in July told us nothing about the housing market in June.
The data was a 3-month average for the period between
March and May. Furthermore, the data is based on closings.
Since it typically takes 4-8 weeks between a signed
contract and a closing, most of the sales that went into
the May data were contracted in the period from January to
April.

There are many other cases I could cite like this in which
people analyzing the data have not even thought for a
minute about the nature of the data they are looking at.
It really helps to know what the data are.

***

KF: Which source(s) do you turn to for reliable data
analysis?

DB:

That one is not easy. There are plenty of economists whose
work I respect and when I see them write on a topic I can be
confident that they have serious things to say, but in terms
of regular analysis of the economy, I can't say that there
is anyone who I think is especially good.

KF: Thank you. You are indeed one of the few people who really know the economic data. We are very lucky that you blog.

Today, I'm debuting a series of interviews called "Numbersense Pros". These are profiles of people who I turn to for well-reasoned data analyses. If you follow this blog, or read my new book, you'll understand why I say "well-reasoned" as opposed to "correct" or "true". The best way to develop your own Numbersense is to learn from others. In the interviews, I ask how they learned to analyze data, and what they read for inspiration.

The first interviewee is Felix Salmon, currently the finance blogger for Reuters. You can see his blog here. He comments on mostly economics and finance related matters, and sometimes makes funny videos (for example, here). I am a big fan of his writing.

***

KF: How did you pick up your impressive statistical reasoning skills?

FS:

The first cousin of statistics is probability, which I learned
from being taught backgammon at an early age. Then there's this thing
called Further Applied Mathematics which is studied at high school in
the UK but which is pretty college-level stuff by US standards. That's
about it, really.

KF: If I may push you a little further, I'd say most of the people who
took Math at school do not develop the ability to judge data in real
life; what differentiates you from the crowd?

FS:

The Maths A-levels were split into Pure and Applied. I was reasonably
good at both, but I just breezed through the probability and statistics
bit of Applied -- I found it by far the easiest part of the whole
shebang. To put it another way: it's not something I learned, it's
something I've always had an intuitive feel for. And I'm good at
smelling bad numbers. When I'm proof-reading or editing somebody else's
work, or when I'm reading something on the internet, I can just *tell*
when a number is wrong. A few days ago, at a conference, a woman got up
on stage and said "you can now fit all the world's music on a single
$600 disk drive". And I just *knew*, without looking anything up, that
it wasn't true. Don't ask me how.

KF: What is your pet peeve with published data interpretations?

FS:

The data I encounter most often is market data, where my biggest
pet peeve is that no one ever stops to think whether they're looking at
signal or noise. Specifically, market journalism seems obsessed with
one-day moves, even though they (ought to) know that one-day moves are
nearly always noise rather than signal.

KF: Which sources do you turn to for reliable data analysis?

FS:

The iterative blogosphere. Any one blog post can easily be
erroneous. But if you get a real blogospheric conversation going, the
end result is likely to be pretty robust and sophisticated.