Posts categorized "Medicine"

A report came out from Stanford School of Medicine about a study of Apple Watch's health monitoring features. Some headline writers are proclaiming that "finally, there is proof that these watches benefit our health!" For example, Apple Watch Stanford Study Shows How It Can Save Lives (link).

When you read the official story, you will learn the following facts about the study:

The research is funded by Apple

It was a purely observational study in which they follow (400,000) people who wear Apple Watches

Participants must own both an Apple Watch and an iPhone to be eligible (plus meeting other criteria)

There was no "control" group - they did not follow anyone who did not use Apple Watch or use any other health monitoring wearables

Every participant is self-selected

The device issued warnings to only 0.5 percent of the participants (~ 2,160)

Those who received a warning were directed to a video consultation; and the doctor decided whether or not to send the participant an ECG patch, which is used to establish the "ground truth". Only about 30 percent were sent patches, and of those, 70 percent (450) returned the patches for analysis.

Only those who had ECG data were analyzed. One third of these were shown to have experienced "atrial fibrillation" (irregular heartbeat). This means that two-thirds got false alarms. But if we include the 70% who were not sent patches after the video consultation as false alarms as well, then out of every 100 warnings, only 7 were validated.

There is no discussion of false negatives: did any of the 99.5 percent who did not receive warnings experienced irregular heartbeats?

We do know that if there were significant false negatives, then more warnings would have to be sent, which pushes up false alarms.

Despite the headlines, any lives saved were extrapolated.

There are some major methodological limitations about this study.

Firstly, the study design prevents drawing conclusions of the type "People wearing Apple Watches .... compared to those who did not wear Apple Watches." It does not include anyone not wearing Apple Watches.

Secondly, it's difficult to interpret the accuracy metrics. Is 20 percent false alarms a good or bad number? Is 0.5 percent receiving warnings a reasonable proportion given the demographic and health characteristics of the study population?

Hopefully, this study is just the beginning, and more rigorous studies are being planned.

I don’t agree with Daniel’s conclusions in his article in Slate about the measles “crisis” but he did his research and there are lots to chew on. You don’t have to agree with him to find this article thought-provoking.

There is one paragraph which everyone should read. It’s a celebration of science, and how it saved lives. (Daniel used this story for a different purpose: he argued that we never “eradicated” measles, and therefore, the anti-vaxxers could never have reversed some mythical victory.)

During the most recent, major wave of measles infection in the U.S., between 1989 and 1991, close to 56,000 people fell ill and more than 100 people died...The 1989–91 epidemic was large enough and deadly enough to cast light on two pressing problems: First, that a single vaccine dose was not sufficient to protect children, and second, that black and Latino children, especially those living in urban areas, were less likely to be vaccinated, and thus more vulnerable to the disease.

Efforts were made to address both issues in the years that followed. A second measles shot was recommended for all children, while the federal government ramped up efforts to provide free vaccination to high-risk groups. The plan worked. By 1994, vaccine coverage for measles was closing in on 90 percent. The number of cases reported every year soon dropped from the thousands into the hundreds, and then into the tens. It was in response to this decline that experts from the CDC announced that measles had been “eliminated” from the U.S.

It's a great example of finding the drivers behind the data, and executing actions that successfully changed the numbers.

***

From the rest of the article, here are some useful tidbits:

Vaccines work because of a phenomenon called “herd immunity,” which is a type of wisdom of the crowd. Diseases spread when people interact with each other. If both sides are vaccinated, then the risk of spreading is much, much lower. Thus, the higher the proportion of the vaccinated, the lower risk for everybody. The threshold desired by health authorities is 93 to 95 percent.

In the past several decades, at a national level, the vaccination rate has stayed around 91 percent. So it is below the desired level but seems close enough not to cause alarm.

Discredited research by a Dr. Wakefield ignited the anti-vaxxer movement. The publication of such research allows people to confirm their prior beliefs, and it often is hard to dislodge such beliefs, even when the research has been invalidated.

In some localities, e.g. Somalis in Minnesota, the rate of vaccination has dropped drastically to below 50%. Those are isolated communities, and in aggregate, the level of vaccination has not changed.

The number of measles cases while small has shown signs of increase. There have been triple-digit cases 7 out of the last 10 years, but only twice in the previous decade.

The fatality rate of measles is extremely low, 11 deaths in 18 years, which is said to be similar to being killed by scorpions. Those who read Chapter 2 or 4 of Numbers Rule Your World (link) will recognize the need to think about the cost of errors.

Axios has an informative article about obesity, and the various remedies such as exercising, diets, and so on. Their headline is: "Health and wellness are booming, but we're fatter than ever." They have compiled some data, shown in a triplet of graphs:

The problem of obesity is complex, and fascinating from a data perspective. I devoted an entire chapter of Numbersense (link) to issues around measuring obesity.

There is much more underneath the surface than what is presented here. Let me unpack the layers of complexity.

Correlation is not Causation

The simplest issue to explain - just because statisticians have been screaming about it forever. If you look at the obesity chart and the gym chart, it is entirely accurate to say that gym membership has been rising in lock step with obesity rate during this decade. Both metrics rose by roughly 20%; and so it is very tempting to argue that going to gyms makes you fatter.

Of course, if you draw that conclusion, you've just been disinvited from the party of statisticians.

Ecological Fallacy

Here's the disturbing bit: the charts are also compatible with the opposite conclusion - that gym membership reduces obesity. This is an example of why it's so hard to interpret observational data.

Note that the data analyst collapsed a 2x2 matrix into two aggregate rates. Imagine four types of people: those with or without gym membership, crossed with those who are obese or not obese. When you're aware of the four types, you should realize that the rate of obesity, aggregated across gym membership, is not a great metric. It's pretty obvious that the obesity rate of those who are gym members is lower than that of those who do not have membership. The average rate paints them with the same brush.

In the same way, gym membership, aggregated across obese and not obese people, is not a great metric.

You can reasonably assume that obesity rate for the gym members should be lower than the average obesity rate, for example, if the average is 25%, then perhaps the obesity rate for non-gym members is 15%.

It's possible that the 15% rate has not changed over time but if the obesity rate of the non-gym-members increases, the overall obesity rate will increase (note that there are five times as many non-gym-members as there are gym members). The 15% rate for gym members could even have improved, and the overall obesity rate could still decline to 30% - it just requires the non-gym-members to get even more obese.

When aggregating the rates, some information is lost, and that weakens our ability to draw conclusions about individuals.

Indirect Metrics

Gym membership is not the same as gym usage. The gym's ability to influence obesity would require usage, not just membership.

CDC Diet Recommendation

The bit about the CDC complaining that people don't consume the recommended levels of fruits and vegetables makes me wonder if their problem formulation is overly simplistic. The dietary guidelines appear to be an optimization of nutritional benefits. But the real problem is to maximize nutritional benefits under a budget constraint. Each item in the basket of recommended foods delivers an amount of benefits at a level of cost. The total cost can't exceed the household budget.

For anyone taking a traditional class on optimization, "the diet problem" is often the first problem discussed. Here is one exposition of the diet problem.

I have been reading the excellent book by John Adams titled Risk (link). This is a geographer's treatment of a subject that is a staple of mathematics, particularly probability math. A mathematical treatment creates objects called probability distributions, which are then taken as complete representations of risk. Adams challenges that construct, bringing a social scientist's sensitivity to the table. In particular, he points out how the mathematics of risk is undermined by measurement issues (i.e. data issues) and statistical issues. He is not invalidating math, just pointing out large cracks that are often ignored.

I will provide a more comprehensive review of the book eventually. I'm very excited by Chapter 5, titled "Measuring Risk", and specifically the example of "traffic black spots". This example is very instructive for anyone who is interested in the practical implications and interpretation of risk measures.

The post got long so I have split it into two parts. The second part will be posted on Monday, and it concerns a delicious bit of analysis related to traffic black spots.

***

Here are some key sentences related to the quality of the collected data:

The connection between potentially fatal events [accidents] and actual fatal events is rendered extremely tenuous by the phenomonen of risk compensation.... avoiding action is taken, with the result that there are insufficient fatal accidents to produce a pattern that can serve as a reliable guide to the effect of specific safety interventions. As a consequence, safety planners seek out other measures of risk, usually - in ascending order of numbers but in decreasing order of severity - they are injury, morbidity, property damage and near misses.

It is much easier to achieve "statistical significance" in studies of accident causation if one uses large numbers rather than small. Safety researchers therefore have an understandable preference for non-fatal accident or incident data over fatality data... in exercising this preference, they usually assume that fatalities and non-fatal incidents will be consistent proportions of total casualities.

[Evoking some data from London] The correlation between fatality rates and injury rates is very weak. Is the weak correlation real or simply a recording phenomenon? How many injuries equal one life?

Uncertainty in the data increases as the severity of the injury decreases. The fatality statistics are almost certainly the most accurate and reliable of the road accident statistics... the categorization and recording of injuries is generally not informed by any evidence from a medical examination.... [A British Medical Asssociation report said that] only one in four casualties classified as seriously injured are, in fact, seriously injured and many of those classified as slightly injured are in fact seriously injured..... some 30 percent of traffic accident casualties seen in hospital are not reported to the police, and that at least 70 percent of cyclist casualties go unreported.

This last point is widely applicable in the data science/analytics world. We often have large amounts of unreliable data, and small amounts of more reliable data. What we hope to have - and we don't - is large amounts of reliable data. A lot of bad analyses results from assuming we have large amounts of reliable data.

Daniel Engber at Slate reviews the latest attempt to kill the messengers - an article in the Boston Globe by a Harvard biologist. Sounds like the NYT Magazine article by Susan Dominus that I discussed here.

The common threads are (a) the unscientific use of selected anecdotes to paint a picture of "mobs" which an easy Web search will quickly disprove, as Engber did; (b) the citation of a few colorful adjectives as the entire proof of bad behavior, while conveniently ignoring similar language used to denigrate the reformers (again easily found online), a practice known as cherry-picking and widely seen as unscientific; (c) using personal attacks to condemn others of personal attacks; (d) no response to the scientific substance being debated while focusing on personalities.

From the start, the big problem with "power pose" is that its most important scientific claims cannot be replicated. Nothing has changed despite the many thousands of words used to "call off the revolutionaries."

One of my favorite statistics-related wisecracks is: the plural of anecdote is not data.

In today's world, the saying should really say: the plural of anecdote is not BIG DATA.

In class this week, we discussed a recent Letter to the Editor of top journal, New England Journal of Medicine, featuring a short analysis of weight data coming from a digital scale that, you guessed it, makes users consent to being research subjects by accepting its Terms and Conditions. (link to NEJM paper, covered by New York Times)

The "analysis" is succinctly summarized by this chart:

Their conclusion is that people gain weight around the major holidays.

How did the researchers come up with such a conclusion? They in essence took the data from the Withings scales, removed a lot of the data based on various criteria (explained in this Supplement), and plotted the average weight changes over time. Ok, ok, I hear the complaint that I'm oversimplifying. They also smoothed (and interpolated) the time series and "de-trended" the data by subtracting a "linear trend". The de-trending accomplished nothing, as evidenced by comparing the de-trended chart in the main article to the unadjusted chart in the Supplement.

Then, the researchers marked out several major holidays - New Year, Christmas, Thanksgiving (U.S.), Easter, and Golden Week (Japan) - and lo and behold, in each case the holidays coincided with a spike in weight gain, ranging from a high of about +0.8% (U.S.) to a low of +0.25% (Easter).

Each peak is an anecdote and the plural of these peaks is BIG DATA!

Why did I say that? Look for July 4th, another important holiday in the States. If this "analysis" is to be believed, July 4th is not a major holiday in the U.S. On average, people tend to lose weight (-0.1%) around Independence Day. There is also no weight change around Labor Day.

In a sense, this chart shows the power of data visualization to shape perception. Labeling those five holidays draws the reader's attention. Not labeling the other major holidays takes them out of the narrative. Part of having numbersense is to have ability and confidence to make our own judgment about the data. Once one notices the glaring problems around July 4th and Labor Day, one no longer can believe the conclusion.

There is also "story time" operating here. The researchers only had data on weight changes. They did not have, nor did they seek, data on food intake. But the whole story is about festive holidays leading to "increased intake of favorite foods" which leads to weight gain. Story time is when you lull readers with a little bit of data, and when they are dozing off, you feed them a huge dose of narrative going much beyond the data.

The real problem here relates to the research process. Traditionally, you come up with a hypothesis, and design an experiment or study to verify the hypothesis. Nowadays, you start with some found data, you look at the data, you notice some features in the data like the five peaks, you now create your hypothesis, and there really is little need to confirm since the hypothesis is suggested by the data. And yet, researchers will now run a t-test, and report p-values (in this weight change study, the p-values were < 0.005.)

Even if it's acceptable to form your hypothesis after peeking at the data, the researcher should have then formulated a regression model with all of the major holidays represented, and then the model will provide estimates of the direction and magnitude of each effect, and its statistical significance.

PS. Some will grumble that the analysis is not "big data" since it does not contain gazillion rows of data, far from it. However, almost all Big Data analyses that are done following the blueprint outlined above. Also, I do not define Big Data by its volume. Here is a primer to the OCCAM definition of Big Data. Under the OCCAM framework, the Withings scale data is observational, has no controls, is treated as "complete" by the researchers, and was collected primarily for non-research reasons.

PPS. Those p-values are hilariously tiny. The p-value is a measure of the signal-to-noise ratio in the data. The noise in this dataset is very high. In the Supplement, the researchers outlined an outlier removal procedure, in which they disclosed that the "allowable variation" is 3% daily plus an extra 0.1% for each following day between two observations. Recall the "signals" had sizes of less than 0.8%.

I am traveling so have to make this brief. I will likely come back to these stories in the future to give a longer version of these comments.

I want to react to two news items that came out in the past couple of days.

First, Ben Stiller said that prostate cancer screening (the infamous PSA test) "saved his life". (link) So he is out there singing the praises of the PSA test, which has been disavowed even by its inventor (link), although still routinely used by many physicians.

One can't dispute that the PSA test result caused Ben Stiller to know about his cancer and he is better today because of that discovery.

However, imagine the following scenario: I invent my own screening test. The test consists of flipping a coin: heads, you have cancer, tails you don't. Amongst those people who came up heads, I can find one for whom he truly has cancer. I saved his life because my test alerted him to this fact. Because I saved this person's life, my test must be really good. (If one anecdote is too few, I could find a handful of people whose lives I have saved.)

***

Second, the FBI tells reporters that the Minnesota mall attacked "withdrew from friends in months before attack." (link)

Imagine that you are trying to predict who will be the next disgruntled attacker. Based on the FBI statement, you want to round up everyone who "withdrew from friends." How many people would that include? How many of them will eventually be attackers?

Same holds for all the other findings, such as "he converted to Islam recently", and "he posted something hateful on Facebook".

***It is precisely when we want something badly, like information that saves our lives or that prevents terrorist attacks, when we become most susceptible to nonsense data analyses.

A GMO labeling law has arrived in the US, albeit one that has no teeth (link). For those who don't want to click on the link, the law is passed in haste to pre-empt a more stringent Vermont law. The federal law defines GMO narrowly, businesses do not need to put word labels on packages (they can, for example, provide an 800-number), and violaters will not be punished.

One of the arguments against GMO labeling is that it is unscientific because (some) scientists are 100% certain that GMO foods are safe. (e.g. this Boston Globe editorial)

Any good scientist knows that scientific "truths" are true until they are proven otherwise. Science is a continuous process of making hypotheses, and finding data to confirm or reject them. The Bayesian way of thinking is very useful here. Being true is a probability - more confirmatory data increases the probability that a given hypothesis is true.

So why is GMO labeling good science?

In fact, I'd go so far as to say that there is no science without GMO labeling.

How is nutritional science done today? What is the research that tells us coffee is good, butter is good, salt is bad, etc.? Granted, this is a shaky field that has issued lots of false results. But the usual form of analysis goes like this: conduct a large survey of consumers and ask them about their diet (e.g. how much red meat do you eat each week?); obtain information about their health status, either through the same survey, a different survey, or direct measurements if they are part of a research study; then correlate the dietary data and the health data.

Now, imagine you want to study if eating GMO foods affects your health, either positively or negatively. Your survey question will be something along the lines of "How much GMO foods did you eat last week?"

Without GMO labeling, there is no way to conduct such research. This is why GMO labeling is good science. Not labeling GMO is bad science - actually it mandates no science.

Theranos (v): to spin stories that appeal to data while not presenting any data

To be Theranosed is to fall for scammers who tell stories appealing to data but do not present any actual data. This is worse than story time, in which the storyteller starts out with real data but veers off mid-stream into unsubstantiated froth, hoping you and I got carried away by the narrative flow.

Theranos (n): From 2003 to 2016, a company in Palo Alto, Ca., the epicenter of venture capital, founded by Elizabeth Holmes, a 19-year-old Stanford University dropout, raised over $70 million to develop and market a "revolutionary" technology for blood testing that is said to require only a finger-prick of blood. The company grew its valuation to $9 billion without ever publishing any scientific data in a peer-reviewed medical journal. It turned out that the new technology was used only in 12 out of 200 tests on its menu, meaning that the business has been based on selling old technology at bargain basement prices subsidized by the venture-capital money. Further, it emerged that the new technology was not accurate, that the new technology has been shelved since last year, and in some cases when old technology was used, the lab personnel improperly handled the machines--all of which eventually led to a blanket retraction of two full years worth of test results. The company claimed that these results have been "corrected" in the last few weeks. It is unclear what "correction" means when such blood was taken from patients up to two years ago. The company is still in business, and Walgreens, one of its most prominent partners, continues its commercial relations with the company. For many years, the business and technology press has issued countless glowing reviews of the company (see this epic list just covering 2013-2015.) Until 2014, the company board consists entirely of politicians, former cabinet members and military leaders. All these individuals have been Theranosed.

The Wall Street Journal has done an exemplary job following this case, and deserves a Pulitzer for this effort. The latest revelation relating to the full-scale retraction is here.