Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

The previous part of this article described the benefits of big data analysis, along with some of the formal, inherent risks of using it. We’ll go even more into the problems of real-life use now.

More hidden bias

Jeffrey Skopek pointed out that correlations can perpetuate bias as much as they undermine it. Everything in data analysis is affected by bias, ranging from what we choose to examine and what data we collect to who participates, what tests we run, and how we interpret results.

The potential for seemingly objective data analysis to create (or at least perpetuate) discrimination on the basis of race and other criteria was highlighted recently by a Bloomberg article on Amazon Price deliveries. Nobody thinks that any Amazon.com manager anywhere said, “Let’s not deliver Amazon Prime packages to black neighborhoods.” But that was the natural outcome of depending on data about purchases, incomes, or whatever other data was crunched by the company to produce decisions about deliveries. (Amazon.com quickly promised to eliminate the disparity.)

At the conference, Sarah Malanga went over the comparable disparities and harms that big data can cause in health care. Think of all the ways modern researchers interact with potential subjects over mobile devices, and how much data is collected from such devices for data analytics. Such data is used to recruit subjects, to design studies, to check compliance with treatment, and for epidemiology and the new Precision Medicine movement.

In all the same ways that the old, the young, the poor, the rural, ethnic minorities, and women can be left out of commerce, they can be left out of health data as well–with even worse impacts on their lives. Malanga reeled out some statistics:

20% of Americans don’t go on the Internet at all.

57% of African-Americans don’t have Internet connections at home.

70% of Americans over 65 don’t have a smart phone.

Those are just examples of ways that collecting data may miss important populations. Often, those populations are sicker than the people we reach with big data, so they need more help while receiving less.

The use of electronic health records, too, is still limited to certain populations in certain regions. Thus, some patients may take a lot of medications but not have “medication histories” available to research. Ameet Sarpatwari said that the exclusion of some populations from research make post-approval research even more important; there we can find correlations that were missed during trials.

A crucial source of well-balanced health data is the All Payer Claims Databases that 18 states have set up to collect data across the board. But a glitch in employment law, highlighted by Carmel Shachar, releases self-funding employers from sending their health data to the databases. This will most likely take a fix from Congress. Unless they do so, researchers and public health will lack the comprehensive data they need to improve health outcomes, and the 12 states that have started their own APCD projects may abandon them.

Other rectifications cited by Malanga include an NIH requirement for studies funded by it to include women and minorities–a requirement Malanga would like other funders to adopt–and the FCC’s Lifeline program, which helps more low-income people get phone and Internet connections.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Eight years ago, a widely discussed issue of WIRED Magazine proclaimed cockily that current methods of scientific inquiry, dating back to Galileo, were becoming obsolete in the age of big data. Running controlled experiments on limited samples just have too many limitations and take too long. Instead, we will take any data we have conveniently at hand–purchasing habits for consumers, cell phone records for everybody, Internet-of-Things data generated in the natural world–and run statistical methods over them to find correlations.

A word from our administration

A new White House report also warns that “it is a mistake to assume [that big data techniques] are objective simply because they are data-driven.” The report highlights the risks of inherent discrimination in the use of big data, including:

Incomplete and incorrect data (particularly common in credit rating scores)

The report recommends “bias mitigation” (page 10) and “algorithmic systems accountability” (page 23) to overcome some of these distortions, and refers to a larger FTC report that lays out the legal terrain.

Like the WIRED articles mentioned earlier, this gives us some background for discussions of big data in health care.

Putting the promise of analytical research under the microscope

Conference speaker Tal Zarsky offered both fulsome praise and specific cautions regarding correlations. As the WIRED Magazine issue suggested, modern big data analysis can find new correlations between genetics, disease, cures, and side effects. The analysis can find them much cheaper and faster than randomized clinical trials. This can lead to more cures, and has the other salutory effect of opening a way for small, minimally funded start-up companies to enter health care. Jeffrey Senger even suggested that, if analytics such as those used by IBM Watson are good enough, doing diagnoses without them may constitute malpractice.

W. Nicholson Price, II focused on the danger of the FDA placing too many strict limits on the use of big data for developing drugs and other treatments. Instead of making data analysts back up everything with expensive, time-consuming clinical trials, he suggested that the FDA could set up models for the proper use of analytics and check that tools and practices meet requirements.

One of exciting impacts of correlations is that they bypass our assumptions and can uncover associations we never would have expected. The poster child for this effect is the notorious beer-and-diapers connection found by one retailer. This story has many nuances that tend to get lost in the retelling, but perhaps the most important point to note is that a retailer can depend on a correlation without having to ascertain the cause. In health, we feel much more comfortable knowing the cause of the correlation. Price called this aspect of big data search “black box” medicine.” Saying that something works, without knowing why, raises a whole list of ethical concerns.

A correlation stomach pain and disease can’t tell us whether the stomach pain led to the disease, the disease caused the stomach pain, or both are symptoms of a third underlying condition. Causation can make a big difference in health care. It can warn us to avoid a treatment that works 90% of the time (we’d like to know who the other 10% of patients are before they get a treatment that fails). It can help uncover side effects and other long-term effects–and perhaps valuable off-label uses as well.

Zarsky laid out several reasons why a correlation might be wrong.

It may reflect errors in the collected data. Good statisticians control for error through techniques such as discarding outliers, but if the original data contains enough apples, the barrel will go rotten.

Even if the correlation is accurate for the collected data, it may not be accurate in the larger population. The correlation could be a fluke, or the statistical sample could be unrepresentative of the larger world.

Zarsky suggests using correlations as a starting point for research, but backing them up by further randomized trials or by mathematical proofs that the correlation is correct.

Isaac Kohane described, from the clinical side, some of the pros and cons of using big data. For instance, data collection helps us see that choosing a gender for intersex patients right after birth produces a huge amount of misery, because the doctor guesses wrong half the time. However, he also cited times when data collection can be confusing for the reasons listed by Zarsky and others.

Senger pointed out that after drugs and medical devices are released into the field, data collected on patients can teach developers more about risks and benefits. But this also runs into the classic risks of big data. For instance, if a patient dies, did the drug or device contribute to death? Or did he just succumb to other causes?

We already have enough to make us puzzle over whether we can use big data at all–but there’s still more, as the next part of this article will describe.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Privacy solidarity

Genetics present new ethical challenges–not just in the opportunity to change genes, but even just when sequencing them. These risks affect not only the individual: other members of her family and ethnic group can face discrimination thanks to genetic weaknesses revealed. Isaac Kohane said that the average person has 40 genetic markers indicating susceptibility to some disease or other. Furthermore, we sometimes disagree on what we consider a diseased condition.

Big data, particularly with genomic input, can lead to group harms, so Brent Mittelstadt called for moving beyond an individual view of privacy. Groups also have privacy needs (a topic I explored back in 1998). It’s not enough for an individual to consider the effect of releasing data on his own future, but on the future of family members, members of his racial group, etc. Similarly, Barbara Evans said we have to move from self-consciousness to social consciousness. But US and European laws consider privacy and data protection only on the basis of the individual.

The re-identification bogey man

A good many references were made at the conference to the increased risk of re-identifying patients from supposedly de-identified data. Headlines are made when some researcher manages to uncover a person who thought himself anonymous (and who database curators thought was anonymous when they released their data sets). In a study conducted by a team that included speaker Catherine M. Hammack, experts admitted that there is eventually a near 100% probability of re-identifying each person’s health data. The culprit in all this is burgeoning set of data collected from people as they purchase items and services, post seemingly benign news about themselves on social media, and otherwise participate in modern life.

I think the casual predictions of the end of anonymity we hear so often are unnecessarily alarmist. The field of anonymity has progressed a great deal since Latanya Sweeney famously re-identified a patient record for Governor William Weld of Massachusetts. Re-identifications carried out since then, by Sweeney and others, have taken advantage of data that was not anonymized (people just released it with an intuitive assumption that they could not be re-identified) or that was improperly anonymized, not using recommended methods.

Unfortunately, the “safe harbor” in HIPAA (designed precisely for medical sites lacking the skills to de-identify data properly) enshrines bad practices. Still, in a HIPAA challenge cited by Ameet Sarpatwari,only two of 15,000 individuals were re-identified. The mosaic effect is still more of a theoretical weakness, not an immediate threat.

I may be biased, because I edited a book on anonymization, but I would offer two challenges to people who cavalierly dismiss anonymization as a useful protection. First, if we threw up our hands and gave up on anonymization, we couldn’t even carry out a census, which is mandated in the U.S. Constitution.

Second, anonymization is comparable to encryption. We all know that computer speeds are increasing, just as are the sophistication of re-identification attacks. The first provides a near-guarantee that, eventually, our current encrypted conversations will be decrypted. The second, similarly, guarantees that anonymized data will eventually be re-identified. But we all still visit encrypted web sites and use encryption for communications. Why can’t we similarly use the best in anonymization?

A new article in the Journal of the American Medical Association exposes a gap between what doctors consider adequate consent and what’s meaningful for patients, blaming “professional indifference” and “organizational inertia” for the problem. In research, the “reasonable-patient standard” is even harder to define and achieve.

Patient consent doesn’t have to go away. But it’s getting harder and harder for patients to anticipate the uses of their data, or even to understand what data is being used to match and measure them. However, precisely because we don’t know how data will be used or how patients can tolerate it, I believe that incremental steps would be most useful in teasing out what will work for future research projects.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Could we benefit from more opportunities for consent?

Donna Gitter said that the Common Rule governing research might be updated to cover de-identified data as well as personally identifiable information. The impact of this on research, of course, would be incalculable. But it might lead to more participation in research, because 72% of patients say they would like to be asked for permission before their data is shared even in de-identified form. Many researchers, such as conference speaker Liza Dawson, would rather give researchers the right to share de-identified data without consent, but put protections in place.

To link multiple data sets, according to speaker Barbara Evans, we need an iron-clad method of ensuring that the data for a single individual is accurately linked. This requirement butts up against the American reluctance to assign a single ID to a patient. The reluctance is well-founded, because tracking individuals throughout their lives can lead to all kinds of seamy abuses.

One solution would be to give each individual control over a repository where all of her data would go. That solution implies that the individual would also control each release of the data. A lot of data sets could easily vanish from the world of research, as individuals die and successors lose interest in their data. We must also remember that public health requires the collection of certain types of data even if consent is not given.

Another popular reform envisioned by health care technologists, mentioned by Evans, is a market for health information. This scenario is part of a larger movement known as Vendor Relationship Management, which I covered several years ago. There is no doubt that individuals generate thousands of dollars worth of information, in health care records and elsewhere. Speaker Margaret Foster Riley claimed that the data collected from your loyalty card by the grocer is worth more than the money you spend there.

So researchers could offer incentives to share information instead of informed consent. Individuals would probably hire brokers to check that the requested uses conform to the individuals’ ethics, and that the price offered is fair.

Giving individuals control and haggling over data makes it harder, unfortunately, for researchers to assemble useful databases. First of all, modern statistical techniques (which fish for correlations) need huge data sets. Even more troubling is that partial data sets are likely to be skewed demographically. Perhaps only people who need some extra cash will contribute their data. Or perhaps only highly-educated people. Someone can get left out.

These problems exist even today, because our clinical trials and insurance records are skewed by income, race, age, and gender. Theoretically, it could get even worse if we eliminate the waiver that lets researchers release de-identified data without patient consent. Disparities in data sets and research were heavily covered at the Petrie-Flom conference, as I discuss in a companion article.

Privacy, discrimination, and other legal regimes

Several speakers pointed out that informed consent loses much of its significance when multiple data sets can be combined. The mosaic effect adds another layer of uncertainty about what will happen to data and what people are consenting to when they release it.

Nicolas Terry pointed out that American law tends to address privacy on a sector-by-sector basis, having one law for health records, another for student records, and so forth. He seemed to indicate that the European data protection regime, which is comprehensive, would be more appropriate nowadays where the boundary between health data and other forms of data is getting blurred. Sharona Hoffman said that employers and insurers can judge applicants’ health on the basis of such unexpected data sources as purchases at bicycle stores, voting records (healthy people have more energy to get involved in politics), and credit scores.

Mobile apps notoriously bring new leaks to personal data. Mobile operating systems fastidiously divide up access rights and require apps to request these rights during installation, but most of us just click Accept for everything, including things the apps have no right to need, such as our contacts and calendar. After all, there’s no way to deny an app one specific access right while still installing it.

And lots of these apps abuse their access to data. So we remain in a contradictory situation where certain types of data (such as data entered by doctors into records) are strongly protected, and other types that are at least as sensitive lack minimal protections. Although the app developers are free to collect and sell our information, they often promise to aggregate and de-identify it, putting them at the same level as traditional researchers. But no one requires the app developers to be complete and accurate.

To make employers and insurers pause before seeking out personal information, Hoffman suggested requiring that data brokers, and those who purchase their data, to publish the rules and techniques they employ to make use of the data. She pointed to the precedent of medical tests for employment and insurance coverage, where such disclosure is necessary. But I’m sure this proposal would be fought so heavily, by those who currently carry out their data spelunking under cover of darkness, that we’d never get it into law unless some overwhelming scandal prompted extreme action. Adrian Gropper called for regulations requiring transparency in every use of health data, and for the use of open source algorithms.

Several speakers pointed out that privacy laws, which tend to cover the distribution of data, can be supplemented by laws regarding the use of data, such as anti-discrimination and consumer protection laws. For instance, Hoffman suggested extending the Americans with Disabilities Act to cover people with heightened risk of suffering from a disability in the future. The Genetic Information Nondiscrimination Act (GINA) of 2008 offers a precedent. Universal health insurance coverage won’t solve the problem, Hoffman said, because businesses may still fear the lost work time and need for workplace accommodations that spring from health problems.

Many researchers are not sure whether their use of big data–such as “data exhaust” generated by people in everyday activities–would be permitted under the Common Rule. In a particularly wonky presentation (even for this conference) Laura Odwazny suggested that the Common Rule could permit the use of data exhaust because the risks it presents are no greater than “daily life risks,” which are the keystone for applying the Common Rule.

The final section of this article will look toward emerging risks that we are just beginning to understand.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Not only is informed consent a joke flippantly perpetrated on patients; I expect that it has inspired numerous other institutions to shield themselves from the legal consequences of misbehavior by offering similar click-through “terms of service.” We now have a society where powerful forces can wring from the rest of us the few rights we have with a click. So it’s great to see informed consent reconsidered from the ground up at the annual conference of the Petrie-Flom Center for Health Law Policy, Biotechnology, and Bioethics at Harvard Law School.

Petrie-Flom annual 2016 conference

By no means did the speakers and audience at this conference agree on what should be done to fix informed consent (only that it needs fixing). The question of informed consent opens up a rich dialog about the goals of medical research, the relationship between researchers and patients, and what doctors have a right to do. It also raises questions for developers and users of electronic health records, such as:

Is it ethical to save all available data on a person?

If consent practices get more complex, how are the person’s wishes represented in the record?

If preferences for the data released get more complex, can we segment and isolate different types of data?

Can we find and notify patients of research results that might affect them, if they choose to be notified?

Can we make patient matching and identification more robust?

Can we make anonymization more robust?

A few of these topics came up at the conference. The rest of this article summarizes the legal and ethical topics discussed there.

The end of an era: IRBs under attack

The annoying and opaque informed consent forms we all have to sign go back to the 1970s and the creation of Institutional Review Boards (IRBs). Before that lay the wild-west era of patient relationships documented in Rebecca Skloot’s famous Immortal Life of Henrietta Lacks.

IRBs were launched in a very different age, based on assumptions that are already being frayed and will probably no longer hold at all a few years from now:

Assumption: Research and treatment are two different activities. Challenge: Now they are being combined in many institutions, and the ideal of a “learning heath system” will make them inextricable.

Assumption: Each research project takes place within the walls of a single institution, governed by its IRB. Challenge: Modern research increasingly involves multiple institutions with different governance, as I have reported before.

Assumption: A research project is a time-limited activity, lasting generally only about a year. Challenge: Modern research can be longitudinal and combine data sets that go back decades.

Assumption: The purpose for which data is collected can be specified by the research project. Challenge: Big data generally runs off of data collected for other purposes, and often has unclear goals.

Assumption: Inclusion criteria for each project are narrow. Challenge: Big data ranges over widely different sets of people, often included arbitrarily in data sets.

Assumption: Rules are based on phenotypal data: diagnoses, behavior, etc. Challenge: Genetics introduces a whole new set of risks and requirements, including the “right not to know” if testing turns up an individual’s predisposition to disease.

Assumption: The risks of research are limited to the individuals who participate. Challenge: As we shall see, big data affects groups as well as individuals.

Assumption: Properly de-identified data has an acceptably low risk of being re-identified. Challenge: Privacy researchers are increasingly discovering new risks from combining multiple data sources, a trend called the “mosaic effect.” I will dissect the immediacy of this risk later in the article.

Now that we have a cornucopia of problems, let’s look at possible ways forward.

Chinese menu consent

In the Internet age, many hope, we can provide individuals with a wider range of ethical decisions than the binary, thumbs-up-thumbs-down choice thrust before them by an informed consent form.

What if you could let your specimens or test results be used only for cancer research, or stipulate that they not be used for stem cell research, or even ask for your contributions to be withdrawn from experiments that could lead to discrimination on the basis of race? The appeal of such fine-grained consent springs from our growing realization that (as in the Henrietta Lacks case) our specimens and data may travel far. What if a future government decides to genetically erase certain racial or gender traits? Eugenics is not a theoretical risk; it has been pursued before, and not just by Nazis.

As Catherine M. Hammack said, we cannot anticipate future uses for medical research–especially in the fast-evolving area of genetics, whose possibilities alternate between exciting and terrifying–so a lot of individuals would like to draw their own lines in the sand.

I don’t personally believe we could implement such personalized ethical statements. It’s a problem of ontology. Someone has to list all the potential restrictions individuals may want to impose–and the list has to be updated globally at all research sites when someone adds a new restriction. Then we need to explain the list and how to use it to patients signing up for research. Researchers must finally be trained in the ontology so they can gauge whether a particular use meets the requirements laid down by the patient, possibly decades earlier. This is not a technological problem and isn’t amenable to a technological solution.

More options for consent and control over data will appear in the next part of this article.