Posts categorized "Causation creep"

In our latest Statbusterscolumn for the Daily Beast, we read the research behind the claim that "standing reduces odds of obesity". Especially at younger companies, it is trendy to work at standing desks because of findings like this. We find a variety of statistical issues calling for better studies.

For example, the observational dataset used provides no clue as to whether sitting causes obesity or obesity leads to more sitting. Further, as explained in the column, what you measure, and even more importantly, what you don't measure makes and breaks the analysis.

These lessons are highly relevant to anyone working with "big data" studies.

I like this passage very much, which really nails home the point that good analytics requires intuition:

The hours of waiting [during draft meetings] were often filled with watching film of prospects. It helped me refine my analysis, as I soaked up details from scouts that I never would have seen on my own. ("Rewind that. ... Did you see his foot placement there, getting ready for the rebound? That's NBA ready.") During one of these sessions, we were watching film of Syracuse point guard Jonny Flynn. I mentioned that, based on the rate at which he collected steals, he was likely a good defender. But one of the scouts explained that Flynn's steal total was likely higher than other point guards' because Syracuse played mostly zone defense, which allowed guards to attack the ball more. I checked that insight against the data and it seemed true, so I adjusted my defensive statistics to account for the dominant style of defense used by a player's team.

I'm glad to hear that the style of play is included in the models. I cringe every time I hear a (usually English) football (i.e. soccer) commentator claiming that a team "deserves" to be in the lead because it is dominating the time of possession when in fact, the other team is using a counter-attack strategy. When the other team ekes out a 1-0 victory on a sneak attack, the commentator loses his wit.

Alamar sees the next big challenge in NBA analytics as deriving value from the SportsVU data. What is SportsVU? Alamar tells us they installed cameras everywhere that "capture the coordinates of 10 players plus the ball 25 times every second." This is the typical "Big Data" scenario--data is collected without any design or any research question in mind. It raises a few intriguing questions:

The granularity of such data (here it is 25 times a second, that is, to say, four-hundredth of a second apart) can be arbitrarily small. When have we reached the point of picking up just background noise?

The very act of relating such data as "predictors" to outcomes such as scoring statistics presupposes the model in which the precise movements of the players or balls are correlated with those outcomes. Whether we like it or not, any resulting analysis will take on a causal interpretation--this is what separates trivia from an actionable insight. Is this type of predictor the most relevant to explaining outcomes? If not careful, we may just believe this story because that's the one we start with.

Consider this paragraph from a FiveThirtyEight article about the small-schools movement (my italics):

Hanushek calculated the economic value of good and bad teachers, combining the “quality” of a teacher — based on student achievement on tests — with the lifetime earnings of an average American entering the workforce. He found that a very high-performing teacher with a class of 20 students could raise her pupils’ average lifetime earnings by as much as $400,000 compared to an average teacher. A very low-performing teacher, by contrast, could have the opposite effect, reducing her students’ lifetime earnings by $400,000.

If I had told you that students who performed higher on achievement tests have higher average lifetime earnings by as much as $400,000 compared to students who performed average on achievement tests, you'd not be surprised--unless you are a skeptic of achievement tests. This is evidence that achievement test scores predict lifetime earnings.

Now, this is not what Hanushekthe journalist wants you to believe. HShe said that high-performing teachers "could raise" pupils' average lifetime earnings. Two logical jumps are made in one breath here: one is the use of student achievement scores as a proxy for "teacher quality"; the second is "causation creep," (i.e. allowing a causal interpretation to creep on correlational evidence), which is signaled by the use of weasel words like "could" and "may".

The use of proxy measures is the source for many "statistical lies". One tool I use is "proxy unmasking". Substitute the proxy metric with the actual metric. So in this example, when I see "high-performing teacher," I substitute "high-performing student" since the observed data measured students, not teachers. The sound you hear is air rushing out of the hyperventilated argument.

This is part 3 of my response to Gelman's post about the DST/heart attacks study. The previous parts are here and here.

One of the keys of vetting any Big Data/OCCAM study is taking note of the decisions made by the researchers in conducting the analysis. Most of these decisions involve subjective adjustments or unverifiable assumptions. Not that either of those things are inherently bad - indeed, any analysis one comes across is likely to utilize one or more likely both. As consumers of such analyses, we must be aware what the decisions are.

The authors selected a period of time to study. For the research paper, this was January 2010 to September 2013. The database has existed since 1998, and it wasn't explained why the other years are irrelevant. Besides, in the poster presentation, the analysis was based on March 2010 to November 2012, a different but overlapping period. In any case, it is assumed that what happened during those months is representative.

Heart attack admissions are assumed to be a reliable indicator of heart attacks. (Now, it is true that in the publications, the researchers explain that they use admissions requiring PCI as a proxy for heart attacks but as per usual, the reporters drop the modifiers, thus becoming complicit in "story time": selling us one bill of goods (admissions) and then delivering another (heart attacks).)

What happened at "non-federal" hospitals is assumed to be the same as what happened at other hospitals.

What happened in Michigan is assumed to be representative of what happened in 47 other states. Also assumed is the lack of similar effect in the two states that do not change their clocks.

Cases of heart attack admission that did not result in PCI are not tracked by the data collector, and are assumed to be unimportant.

The data is assumed to be correct. Procedures to collect data and to define cases are assumed to be consistent across all participating hospitals.

The sample size is really small. There were a total of four Spring Forwards and three Fall Backs in the data.

What annoyed Andrew: no adjustments are made for multiple comparisons, which means they are assuming that the observed effect is not a random event. This is a strong assumption.

The effect of DST (if it exists) is assumed to be linear over the number of days from the DST time shift. In other words, the patients admitted on the Tuesday after Spring Break is assumed to be twice as exposed to DST as the patients admitted on the Monday. It's hard for me to get my head around this assumption.

If enough of these assumptions or modeling decisions bother you, you should ignore the study and move on.

***

This study, like many others, is a perfect illustration of story time. In such studies, the researchers present some data analyses that tie factor A with outcome B; frequently, neither factor A nor outcome B is directly measured so the researchers start to speculate about a web of causation. Sleepy readers may not realize that much of the discussion is pure speculation while the result from the data analysis is extremely limited.

Story time occurs right here:

Our study corroborates prior work showing that Monday carries the highest risk of AMI. This may be attributed to an abrupt change in the sleep–wake cycle and increased stress in relation to the start of a new work week.

The first sentence is based on their data but the second statement is pure speculation. There is absolutely nothing in this study to confirm or invalidate the claim that the sleep-wake cycle of the average Michigan resident who presents himself or herself to the hospital ward was disrupted, or that the stress experienced by said resident has increased.

***

The fallacy of causation creep shows up as well. The authors said, "Our data argue that DST could potentially accelerate events that were likely to occur in particularly vulnerable patients and does not impact overall incidence."

If the DST effect is merely a correlation, and not a cause, it would not follow that by changing DST, one can affect the outcome. The only way the above statement holds is if one interprets the correlation as causation. Their "data" have done no arguing; it is the humans who are making this claim.

***

For those mathematically inclined, here is the description of the statistical model used in estimating the "trend" of heart attacks (recall that the gap between the actual counts and this trend is claimed as the DST effect):

This model allowed for a cubic trend in numeric date as well as seasonal factors reflecting weekday (Monday–Friday), monthly (January–December) and yearly (2010–2013) effects. The model also adjusted for the additional hour on the day of each fall time change, as well as the loss of an hour on the day of spring time changes through the inclusion of an offset term.

... The impact of the spring and fall time changes on AMI incidence adjusting for seasonality and trend was assessed through the addition of indicator variables reflecting the days following spring and fall time changes as predictors to the initial trend/seasonality model.

I'm a bit confused by this description, which implies the weeks of the DST time shifts are included in this model used to predict the trend and seasonality. When subsequently, this model prediction is compared to the actual admissions count in the week after the DST time shift to compute relative risk, aren't they just looking at the residuals of the model fit?

Also conspicuously absent is any mention of a hospital effect or a geographical effect or a patient demography effect, all of which I'd think are possible predictors of heart-attack admissions.

As others binge watch Netflix TV, I binge read Gelman posts, while riding a train with no wifi and a dying laptop battery. (This entry was written two weeks ago.)

Andrew Gelman is statistics’ most prolific blogger. Gelman-binging has become a necessity since I have not managed to keep up with his accelerated posting schedule. Earlier this year, he began publishing previews of future posts, one week in advance, and one month in advance.

Also, I have been stubbornly waiting for the developers of my former favorite RSS reader to work out an endless parade of the most elementary bugs, after they launched a new site in response to Google Reader shutting down. Not having settled on a new RSS tool has definitely shrank the volume of my reading.

I only managed to go through about a week’s worth of posts because the recent pieces interest me a lot.

Gelman links to Lior Pachter's review of what he calls "quite possibly the worst paper I've read all year".

This bit deserves further mocking: when the researchers fail to achieve conventional 5% significance, they draw conclusions based on "trend towards significance". This sleight of hand happens frequently in practice as well, where the phrase directional result is utilized.

When an observed effect, as in this case, is not statistically significant, the implication is that the signal is not large enough to distinguish from background noise. When the researcher then says “but I still see a signal”, said researcher is now ignoring the uncertainty around the point estimate, pretending that the noise doesn’t exist. The researcher is in effect making a decision using the point estimate. Anyone who has taken Stats 101 should know not to use a point estimate.

One great tenet of statistical thinking is the recognition that the observed data sample is merely one of many possible things that could have happened. The confidence interval is an attempt to capture the range of possibilities, and the much-maligned tests of significance represent an attempt to reduce such analysis to one statistic. It achieves simplicity at the expense of nuance.

This cannabis study is also a great example of what I’ve been calling “causation creep”. The authors are well-aware that they have merely found an instance of correlation (not even but just for the sake of argument), but when they start narrating their finding, they cannot help but use causal language.

The title of the paper is "Cannabis use is quantitatively associated with...", and yet the lead author told USA Today: "Just casual use appears to create changes in the brain in areas you don't want to change."

Causal creep is actually endemic in academic publishing of observational studies, and I don't want to single these authors out.

Gelman has been on this one for a while. The offensive paper looked at the correlation between hurricane damage and the gender of the names we give these hurricanes. I didn’t find it worth spending my time studying this line of research but I’m assuming that the problem is considered interesting because they claim to have found a “natural experiment” in that the gender is effectively “randomly assigned” to the hurricanes as they appear.

I have been quite irritated over the years by this type of research, encouraged by the fad of Freakonomics. Even if they did find a natural experiment, what is that experiment about? Instead of spending research hours on correlating damage with naming conventions, why not spend the precious time looking for real causes of hurricane damage? You know, like weather patterns, currents, physical phenomena, human-induced climate changes, human decisions to live in high-risk areas, etc.?

I should note that much of Steven Levitt’s original work that launched this field deal with real problems, like crime rates and . It’s just that many of his followers have gone astray.

Matt Novak debunks an article in Vox which repeats the assertion by the tech industry that new technologies have been adopted much more quickly in recent years than in the past. Vox is not the only place where you see this assertion. We have all seen variations of the chart shown on the right.

Novak puts on a statistician's hat and asks how the data came about. This type of chart is particularly prone to errors since many different studies across different eras are needed.

What Novak found: the invention date of older technologies (like TV and radio) were defined by their invention in the laboratory while recent technologies (such as Internet, mobile) were defined by their date of commercialization. Needless to say, adoption is expected to be slow when the technologies were not yet available to consumers!

Needless to say, anyone who cites this chart or its conclusion from here on out should be publicly shamed.

Gelman nicely distills one of the central messages in my Numbersense book (Get it here). All data analyses require assumptions; assumptions are subjective; making assumptions is not a sin; clarifying one’s assumptions and vigorously testing them is what make good analyses. Go read this post.

Gelman was surprised by a recent paper in which the researchers found that 42% of their sample purchased detergent on their most recent trip to the store. This reminds me of the section of Numbersense (Get it here) in which I described a study in which some marketing professors had mystery shoppers track people in a supermarket and within seconds of them placing groceries in their trolley, asked them how much the items cost. The error rate was quite shocking.

There is another big problem with this research design. People's memory of what they purchased depends on how long ago that "most recent" trip was. I also wonder how online purchasing affects this sort of study as I typically don't count going to a website as "a trip to the supermarket". It seems like some sort of prequalification is needed but prequalification always restricts the generalizability of any finding.

Andrew gently mocks both of these commonly used procedures. The discussion of outlier detection is buried in the comments section so if you are interested, you should scroll below the fold. Gelman’s annoyance with outlier detection is semantic: but important semantics, which align with my own practice. Like Gelman, I don't consider any extreme value an outlier.

Stepwise is a suboptimal procedure and Gelman prefers modern techniques like lasso. But lots of practitioners use stepwise because the procedure is “intuitive”, that is to say, one can explain it to a non-technical person without rolling their eyes. The discussion below the post is worth reading.

To add to my prior post, having now read the published paper on the effect of DST on heart attacks, I can confirm that I disagree with the way the publicist hired by the journal messaged the research conclusion. And some of the fault lies with the researchers themselves who appear to have encouraged the exaggerated claim.

Here is the summary of the research as written up by the researchers themselves. First I note the following conclusion:

and right before, they write this explanation of the "timing" effect:

So indeed, if I were to believe the research, someone may have a heart attack on Monday instead of Tuesday "as a result of" daylight savings time in the spring. And wait a minute, by reversing this change in the fall, we seemingly postpone some heart attacks by two days. Hence my assertion that even if true, the phenomenon is not interesting.

In fact, I think this study provides negative evidence toward the idea that DST causes heart attacks. Here is how the authors describe their hypothesis:

The new data show no statistical difference in overall heart attack (admissions) for either period. That is their main result.

***

In this post, I want to discuss the challenges of this type of research. The underlying data is OCCAM (see definition here). It is observational in nature, it has no controls, it is seemingly complete (for "non-federal hospitals in Michigan), it is adapted and merged (as explained in the prior post).

Start with the raw data, in which there is a blip observed the Monday after Spring Forward. This problem is one of reverse causation: we see a blip, now we want to explain it.

Spring Forward is put forward as a hypothetical "cause" of this blip. But, we should realize that there is an infinity of alternative causes.

Seasonality is clearly something that needs to be considered. Is it normal to see an increase in admissions from Sunday (weekend) to Monday? To establish how unusual that blip is, we need to manufacture a "control," because none exists in the data.

In the poster presentation, the researchers use a simple control: what happened the week before? (This is known as a pre-post analysis.) The red line shown on the chart would suggest that a jump on Monday is unusual. This chart is a reproduction of the two charts from the poster but superimposed.

One can complain that the pre-1-week control is too simplistic. What if the week before was anomalous? A natural way forward is to use more weeks of data in the control. In the published paper, the researchers abandon the pre-1-week control, and basically use several years of data to establish a trend.

But this effort is complicated by the substantial variability in the data over time:

(I can't explain why the counts here are so much lower than the counts given in the post-DST week line in the first chart. In the paper, they describe the range of daily counts as 14 to 53.)

So expanding the window of analysis is double-edged. On the one hand, we guard against the one week prior to Spring Forward being an anomaly; on the other hand, we include other weeks of the year that are potentially not representative of the period immediately prior to Spring Forward.

The researchers do not simply average the prior weeks--they actually produce a statistical adjustment on the raw data, and call that the "trend model prediction". This is a very appealing concept. What we really want to know (but can't) is the "counterfactual": the number of cases if there were no DST time change.

In the next chart (reproduced from their paper), the "trend" line is what the authors claim the counterfactual counts would have been. They then compare the red line to the blue line (actual counts) and make claims about excess cases.

***

Of course, the devil is in the details. If you're going to make predictions about the counterfactual, the reader has to gain confidence in the assumptions you use to create those predictions.

One way to understand the statistical adjustment is to plot the raw data and the adjusted data side by side. Unfortunately we don't have the raw data. We do have the one week of pre-DST data from the poster. So I compare that to the "trend".

This chart raises two questions. First, the predicted counts in the paper are about 30% higher than the counts in pre-week of the poster. Second, the pre-week distribution of count by day matches the "trend" poorly.

While the pre-count is not expected to match the predicted "trend" perfectly, I'd expect that the post-counts should match since both the poster and the paper address what happens the week after the DST time change.

Strangely enough, the counts in the paper are 35% higher than those in the poster for the post-DST week! I'm not sure what to make of this: maybe they have expanded the definition of what counts as "hospital admissions for AMI requiring PCI".

The attempt to establish a control by predicting the counterfactual is a good idea. Given the subjectivity of such adjustments, researchers should be rigorous in explaining the effect of the adjustments. Stating the methodology or the equations involved is not sufficient. The easiest way to explain the adjustments is to visualize the unadjusted versus the adjusted data. The direction and magnitude of the adjustments should make sense.

***

Going back to the problem of reverse causation. Seasonality, trend and DST are only three possible causes for the Monday blip. Analysts must make an effort to rule out all other plausible explanations, such as bad data (e.g. every time the time changes, some people forget to move their clocks).

As I am testing your patience again with the length of this post, I will put my remaining comments in a third post.

Andrew Gelman discusses a paper and blog post by Ian Ayres on the Freakonomics blog. Their main result is summarized as:

We find that a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.

Andrew finds these claims implausible, so do I.

Ayres uses the econometrics methodology called instrumental variables regression to support these claims. Since the data is observational, and as Andrew pointed out, there wasn't even a period of time in which one could find exposed and unexposed populations (since the TItle IX regulation was federal), one must treat such regression results with a heavy dose of skepticism.

It is useful to understand that causal claims are possible here only if we accept all the assumptions of the instrumental variables method.

Besides, plausibility is assisted by the ability to outline the causal pathways. It should be obvious that more females competing in college sports does not directly cause more females to become secular. The data on sports competition and on secularism come from different sources and this presents a hairy problem. The analysis would have been more convincing if it found that among the women who participated in college sports, more became secular; what the analysis linked was higher participation rate and higher secularism among all women in the state.

What is it about sports participation that would cause people to become secular? (The visual evidence from professional American sports would lead me to hypothesize the opposite--that sports participation may be associated with higher religosity!) Is this specific to the female gender? Do we find male secularism increase as sports participation by men went up?

As Andrew pointed out, the magnitude of the estimated effect seems too large to believe. I'd prefer to see these effects reported at more realistic increments. A jump of 10% participation is very drastic. For example, according to the chart here (the one titled "a dramatic, 40-year rise"), the percent of women participating in high school sports has moved just 2 percent from 1995 to 2011.

***

Andrew is right that this is an instance of "story time". And we are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions). The plausibility of a claim depends on the strength of the data, plus whether we believe the parts of the theory that are assumed.

Reader Daniel T. is unhappy about this analysis of the intraday Internet usage by OS and device types. He doesn't like their choice of index, which I'll get to in a second post. (Link appears here when ready.)

There is something else wrong with this type of analysis.

Let's do a thought experiment. If you are a marketer interested in the diurnal variability in Internet usage, what are some of the factors you might investigate? My list would include whether the user is logging in from work or from home; whether the user is working or unemployed or on vacation; whether the user is male or female, young or old, a student, retired, etc.

But OS is exactly what the blogger analyzed, and thousands of marketers around the world do so on a daily basis. That's because they are using what data they can get their hands on. Web log data are adapted, that is to say, they were collected by engineers for the purpose of debugging, and now they are used by marketers to explain consumer behavior. It's not hard to see why such data cannot tell the full story.

This goes back to the O and the A in my OCCAM framework for Big Data (link). Web log data is the prototypical example of data collected by tracking devices indiscriminately without purposeful design, and then adapted to marketing applications.

***

One way to cope with using adapted data is to be clear about our model of the world. Assume OS really does affect Internet usage. How does OS affect Internet usage? Are you assuming that the features of an OS directly condition a user's behavior? Or are you assuming that the choice of OS is an indicator of the type of user?

Another way to cope with adapted data is to find or collect the data you really want (e.g. demographics, occupation) rather than analyzing data you don't understand. Recall Sean Taylor's advice to collect your own data (link).

Yesterday, Larry Cahoon, a 29-year veteran at the Census Bureau, answered some questions. The rest of the interview is printed below.

***

KF: How can a data analyst improve his or her skills?

LC:

I have to say my best training has been many, many hours of
just playing with statistics, playing with graphics, and reading the analysis
others have done. The more data I see, the more analysis I do, the more
graphics I look at and produce, the more I learn about how to look at data and
how to see the pitfalls in the data analysis. This is getting down in the mud
and dealing with real data with all of its warts. My wife tells people I’m a
statistician through and through as every time she looks at my
computer, there is a graphic of some type on the screen.

Finally, I have always been an avid reader of Science Fiction. No one
can read Science Fiction without being forced to consider any problem from
different perspectives and take into considerations differing assumptions. This
in turn has helped me develop the ability to question the
assumptions being made in any data analysis.

***

KF: What are your pet peeves with published data interpretations?

LC:

I seem to return again and again to the same issues with the
analyses I see in the newspapers, online, and just about anywhere. The most
basic problem is one of incomplete analysis.

We see so many papers and news
reports where a data difference is observed and then based on no data
whatsoever, the author goes off with an entire line of speculation without any
data to justify that speculation. This line of thinking then frequently ends up with the claim that these
two things are correlated, and therefore we have cause and effect.

The media
then fans these reports by writing a story without asking basic questions,
such as is the data itself any good or have they any evidence for the claims
that are being made. The media
acts as if the claims have been proven – especially in how they headline the
story.

My second pet peeve is what I call an emphasis on a one dimensional
world. This is usually reflected in simple statements like: A causes B. The
world is much more complex than that. Those who investigate airline accidents
have been telling us for some time that there is seldom just one cause for each
accident. Rather there are a number of causes. We need to carry that knowledge
over to our statistical analysis and reporting.

***

KF: Which source(s) do you turn to for
reliable data analysis?

LC:

I can’t say that I have any favorite source for data analysis.
If forced to name one, I would say that I tend to like the work of the Pew
Research Center (link) Their surveys seem to be well designed, the questions they ask
well thought out, and the analysis something I can trust.

I like the data that
is available from the Federal Government. But the government agencies rightly
avoid most detailed data analysis in an effort to remain nonpartisan.

***

KF: Thank you so much for your time. We're lucky that you continue to blog in your retirement.

A short while ago, I introduced Larry Cahoon's blog, GoodStatsBadStats. He started the blog almost two years ago after retiring from the US Census Bureau (site offline due to government shutdown), where he spent 29 years working on the statistical design of most of the household surveys conducted by the Bureau. Cahoon received his PhD from Carnegie Mellon University.

Cahoon spent the final seven years of his career working on the 2010 decennial census. He was involved in
almost all areas of census design and operations. His work included
focusing on the many issues of both over coverage and under coverage in
the census, the effectiveness of the publicity campaign, and the use of
the American Community Survey as a replacement for the long form of the
census.

Larry was very gracious and offered great insights and long answers. I have divided the interview into two parts. Part 2 will appear tomorrow.

***

KF: What are the key skills of statistical reasoning?

LC:

There are three things on the top of my list when I think
about statistical reasoning skills. A solid knowledge of statistical methods
and principles is a necessary starting point. I extend this to include the
ability to think in terms of probability. This is a way of thinking or a way of viewing the world at the
most basic level.

Equally
important is a solid foundation in logical reasoning. The third piece is recognizing and
dealing with the assumptions that must be made in any statistical analysis. One
needs to be able to look at the problem from a multitude of perspectives and
ask if the assumptions being made are good ones. Can I make a logical argument why
they are true and why a contrary set of assumptions is false? The ability to consider alternate
assumptions and confounding factors that may not be obvious is very important
to any statistical analysis.

KF: Was graduate school useful training for your career?

LC:

In my own background, my first year of graduate school just
about drove me crazy as I did nothing but statistics. Most of it was extremely
theoretical work. I have always said that the masters’ degree they gave me
after that year was in some sense worthless as I got very little exposure to
the practical side of the statistical profession.

But what I did learn from the full immersion was to think in
terms of probability. I came to see statistics not just as theory but
as a way of thinking. Thinking in
probability terms became second nature. It helped that I was trained in a Bayesian environment long before it
reached the levels of use that we now see.

I saw statistical testing as a sometimes useful but limited
tool. It became clear to me that much more important questions are what is
the best estimate and what is the decision rule that would come out of the
analysis. It was just as important to know how that data is going to be used
as it is to know how the analysis is to be done.

KF: How did your work at the Census Bureau influence you?

LC:

To do good statistics, knowledge of the subject matter it is
being applied to is critical. I also learned early on that issues of variance and
bias in any estimate are actually more important than the estimate itself. If I don’t know things like the
variability inherent in an estimate and the bias issues in that estimate, then I
really don’t know very much.

A
favorite saying among the statisticians at the Census Bureau where I worked is
that the biases are almost always greater than the sampling error. So my first
goal is always to understand the data source, the data quality and what it
actually measures.

But, I also still have to make decisions based on the data I
have. The real question then becomes given the estimate on hand, what I know about
the variance of that estimate, and the biases in that estimate, what decision am
I going to make.

KF: I want to reiterate this point to my readers who are not statisticians. In data analysis, we are using available data (the sample) to make a general statement, say use the response of subjects enrolled in a clinical trial to describe the effect of a new drug on all potential patients. Imagine you are trying to hit the bull's eye. Shotgun #1 produces a wide scatter around the target while Shotgun #2 produces a narrow scatter but the average shot lands wide of the target. We say that #1 has high variance and low bias while #2 has high bias but low variance. Both types of errors contribute to the shot being off the bull's eye.

Because the Census Bureau typically uses large samples, the sampling error (variance) is very manageable. What is hard to control is bias, meaning the entire sample is not representative of the population under study. This is Shotgun #2.

LC continues:

As I matured in my career, I learned that diversity matters. The greater the exposure and the more
diverse the exposure to real world statistics, the better the practitioner would
become. So while I worked for many years in the area of survey design, when I
went to the annual statistical meetings, I always made the effort to maximize my
exposure to other areas of statistics. I would always return home with a few
textbooks on areas of statistics outside of what I was actually working on at
the job.

A curious nature with a desire to continue learning is essential.
Today a good training route is to read as many statistical blogs as you can
find the time for. Especially important is to seek out and read the work of
those who disagree with me. This forces me to think much more critically of the
work I am doing.