Posts categorized "Statbusters"

Daniel Engber at Slate reviews the latest attempt to kill the messengers - an article in the Boston Globe by a Harvard biologist. Sounds like the NYT Magazine article by Susan Dominus that I discussed here.

The common threads are (a) the unscientific use of selected anecdotes to paint a picture of "mobs" which an easy Web search will quickly disprove, as Engber did; (b) the citation of a few colorful adjectives as the entire proof of bad behavior, while conveniently ignoring similar language used to denigrate the reformers (again easily found online), a practice known as cherry-picking and widely seen as unscientific; (c) using personal attacks to condemn others of personal attacks; (d) no response to the scientific substance being debated while focusing on personalities.

From the start, the big problem with "power pose" is that its most important scientific claims cannot be replicated. Nothing has changed despite the many thousands of words used to "call off the revolutionaries."

Andrew and I warned you about "power poses" in Slate some time ago (link).

Breaking news is that Dana Carney, a co-author of the paper that claimed the benefits of the power pose, has now confirmed that she no longer believes in the power pose. She is actively discouraging researchers from this "waste of time and resources."

Here is her statement (PDF link), which is well worth reading in full. This is a courageous statement.

The statement discloses a variety of tricks used to game p-values so that they meet the publishable 0.05 threshold. Everyone suspects someone else is playing such tricks but it's rare when someone actually confesses to them.

The highlights are:

Initially, the primary DV of interest was risk taking. We ran subjects in chunks and checked the effect along the way. It was something like 25 subjects run, then 10, then 7, then 5. Back then this did not seem like p-hacking. It seemed like saving money (assuming your effect size was big enough and p-value was the only issue)

Unfortunately, I have witnessed this type of p-hacking in industry all too often. In fact, many, many people run so-called A/B tests until they reach significance. There are many problems with what Carney described above. Imagine an effect size that is small (close to zero). As the samples accumulate, the measured effect will fluctuate around zero. If you wait long enough, the measure will hit p=0.05 by chance and then you stop. Further, they were reducing the sample size as the experiment continues - which means they are introducing more sampling variability, which means it is more likely that the measure will hit extreme values by chance!

It's tough for me to believe that she wasn't aware that stopping when you hit p=0.05 is p-hacking but that's what she is saying.

For the risk-taking DV: One p-value for a Pearson chi square was 0.052 and for the Likelihood ratio it was 0.05. The smaller of the two was reported... I had found evidence that it is more appropriate to use "Likelihood" when one has smaller samples and this was how I convinced myself it was OK.

She's focused here on the researcher degree of freedom issue. The larger problem is the magic dust that seems to sprinkle off p=0.05. If that is the chosen threshold for significance, and my result is right on the cusp, I would be very skeptical of this result. I don't think 0.052 is the better number - they are both bad.

The self-reported DV was p-hacked in that many different power questions were asked and those chosen were the ones that "worked".

Many A/B testing platforms come with a battery of hundreds of metrics automatically computed for each test. No further comment needed.

As of today, the TED talk on "power poses" is still going strong. It has accumulated 36 million "views" and the official description does not mention Dana Carney's retraction.

In our latest Statbusterscolumn for the Daily Beast, we read the research behind the claim that "standing reduces odds of obesity". Especially at younger companies, it is trendy to work at standing desks because of findings like this. We find a variety of statistical issues calling for better studies.

For example, the observational dataset used provides no clue as to whether sitting causes obesity or obesity leads to more sitting. Further, as explained in the column, what you measure, and even more importantly, what you don't measure makes and breaks the analysis.

These lessons are highly relevant to anyone working with "big data" studies.

In this week's Statbusters, my column with Andrew Gelman in the Daily Beast, we take note of Slate's recent rant about "wasteful" anti-smoking advertising, and demonstrate how to think about cost-benefit analysis. The key point is: if you are going to make an extreme claim, you better have some numbers to back it up.

These numbers can be approximate, and based on (potentially dubious) Googled data. Not every analysis needs to be super precise.

In the first two chapters of Numbersense, I discuss how people game statistics, and why gaming is inevitable. I have also written about the placebo effect before. Another article has appeared covering the same topic -- the industry doesn't like the fact that more and more drugs fail to clear the "placebo" hurdle; and the industry thinks the problem is that the placebo effect is mysteriously increasing over time.

What is new in that BBC News item is the extensive conversations with people who run clinical trials. They reveal a variety of tricks they use to game the numbers.

In this week's Statbusters (link), we discuss two recent widely-shared articles, one on deaths while taking selfies, and the other on the gender gap in income among graduates of top-tier universities.

The common element between these two pieces is a reductionist analysis that looks at the correlation between a single variable X and an outcome Y when the outcome Y is affected by a multitude of variables. For example, it is reported that female graduates of Princeton or Harvard or Stanford earns $50,000 less annually than male graduates ten years after graduation. The two groups being compared do not differ just by gender, despite what that statement implies.

Note, however, that there is nothing wrong with the computation. Similarly, the deaths linked to taking selfies indeed outnumber deaths by shark attacks. The problem is that the analysis is misleading. It causes readers to come to the wrong conclusions.

For this week's Statbusters (link), we opine on that astounding report from a few weeks ago about how Google could manipulate the next elections by biasing search results. We walk you through our vetting process, starting with face validity ("the magnitude of the reported effect is too large to be believed!").

The crux of the article is about the experimental design. You start with a group of people who have no prior opinion of the candidates (e.g. showing Americans Australian candidates), then give them only one source of data about those candidates, then check to see if the data affected their opinions. Besides, the manipulation was rigged for maximum impact - the base setting is that the entire first page of results favors one candidate.

As we said, the result of these experiments is still interesting but not as interesting as the breathless headlines proclaim.

One issue that didn't make it to the article but I should mention here is the ethics of one of the experiments, which sought to manipulate a real-life election in India. Recall the outcry surrounding the Facebook experiments that supposedly manipulated the emotions of people via feed messages. I didn't think much of that controversy (link). However, I'm surprised there isn't a bigger outcry surrounding this Indian experiment, which manipulated political voting behavior. The researchers said the IRB accepted their argument that their sample size is tiny relative to the size of the Indian electorate. I am not comfortable with the concept that ethics is relative to sample size.

On Labor Day, our new Statbusters column appeared. This one concerns a popular news story from some weeks ago, saying science has proven that there are four types of drunks. The four refers to four "clusters" formed by running a cluster analysis algorithm. But four is decided by the analyst. Some algorithms won't run unless the analyst specifies the number of clusters; other algorithms generate the best structure for every number of clusters. This method is great for exploring and understanding the data but cannot confirm that there are precisely four types of drunks!

The media often removes the uncertainty of science in the name of "popularizing." The entire article is here.

In the newest column for the Daily Beast, Andrew and I look at the media's fascination with expressing large numbers as daily numbers. (link) In short, you should divide by 365 only when the metric actually scales with time, and be careful if the metric is not evenly distributed across time. We discuss the following headlines: "Air pollution is China is killing 4,000 per day" and "Periscope users view 40 years of video per day".

In our newest column, we take on the recent media obsession with companies who make robots that hire people. (link)

As with most articles about data science, the journalists failed to dig up any evidence that these robots work, other than glowing quotes from the people who are selling these robots. We point out a number of challenges that such algorithms must overcome in order to generate proper predictions. We also discuss why measuring the outcomes of these predictions is so hard: one problem is we have no objective standard for someone being the "correct" hire; another is the action we take based on the predictions affects the outcome that was predicted.