Moving to the open health-care graph

A network graph approach to modeling the health-care system.

To achieve the the triple aim in healthcare (better, cheaper, and safer), we are going to need intensive monitoring and measurement of specific doctors, hospitals, labs and countless other clinical professionals and clinical organizations. We need specific data and specific doctors.

In 1979 a Federal judge in Florida sided with the AMA to prevent these kinds of provider-specific data sets violated doctor privacy. Last Friday, a different Florida judge reversed the 1979 injunction, allowing provider identified data to be released from CMS under FOIA requests. Even without this tremendous victory for the Wall Street Journal, there was already a shift away from aggregation studies in healthcare towards using Big Data methods on specific doctors to improve healthcare. This critical shift will allow us to determine which doctors are doing the best job, and which are doing the worst. We can target struggling doctors to help improve care, and we can also target the best doctors, so that we can learn new best practices in healthcare.

Evidence-based medicine must be targeted to handle specific clinical contexts. The only really open questions to decide are “how much data should we relese” and “should this be done in secret or in the open.” I submit that the targeting should be done at the individual and team level, and that this must be an open process. We need to start tracking the performance and clinical decisions of specific doctors working with other specific doctors, in a way that allows for public scrutiny. We need to release uncomfortably personal data about specific physicians and evaluate that data in a fair manner, without sparking a witch-hunt. And whether you agree with this approach or not, it’s already underway. The overturning of this court case will only open the flood gates further.

Last year, I released DocGraph at Strata RX. DocGraph details how specific doctors and hospitals team together to deliver healthcare. (data on referral patterns, etc). At that conference we (Not Only Development) promised that this data set would go Open Source eventually. This month, we will be announcing that the DocGraph data set is available for costless download. But there are two other data sets that can now be used to make that graph data much richer.

“Graph” or “network” data, in this context, refers to a computing technique that represents nodes (in this case doctors, hospitals, etc.) connected by edges (in this case, representing which doctors and hospitals work together). So DocGraph explicitly states that Dr. Smith (a primary care doctor) is working with Dr. Jones (a cardiologist), and assigns a strength to that relationship.

In order to take a network graph approach to modeling the healthcare system, you have to name names. We have had named, specific, and incriminating data on US hospitals for years now. This openness has been at the heart of a revolution in the field of patient safety. I believe that this openness has been a central motivator for the ongoing reduction of “never events”, perhaps even more important than corresponding payment reforms. On some cases, I even have data to back this assertion.

I am a devotee of new thinking about human motivation and I believe very strongly the most doctors, even the “bad” ones, want to be the best they can at what they do, largely independent of financial incentives. But doctors need uncomfortably personal data on how they, specifically, are doing in order to start doing it better.

This has been the month for the open release of uncomfortably personal data about the healthcare system.

First, HHS announced the release of data on master charge sets for hospitals. This is likely a direct response to the problems with master charge sets in the masterful Bitter Pill: Why Medical Bills Are Killing Us article by Steven Brill in Time Magazine. The data released by CMS is a complex data set about an even more complex medico-legal issue, but a useful oversimplification is this: it shows which hospitals have been the worst abusers of cash-paying, uninsured patients. You could write entire articles about the structure, value, and depth of this data set…and I plan to, given infinite time and resources.

But then, in the midst of this, Propublica released data on the prescribing patterns of almost every doctor in the US. This is detailed information about the preferences of almost every doctor who participates in the Medicare Part D prescription program, which is almost every doctor. Does your doctor prefer Oxycontin to Vicodin? Which antibiotic does your doctor use most frequently? These choices, taken together, can be called a “prescribing pattern.” Propublica is allowing the public to view those patterns for specific doctors.

These data sets are having a combinatorial impact for those of us who are interested in researching the healthcare system.

Taken together, these changes in data release policy represent two important shifts in the analysis of the healthcare system. We are moving from proprietary analysis to open analysis and from aggregate data to graph (network) data. These moves parallel past scientific process breakthroughs:

Proprietary science -> Open Science: illustrated by the move from Alchemy to Chemistry. Alchemists were famous for doing work in secret, hoping to learn and horde the secrets for turning lead into gold or the secret to eternal life. Chemistry began when researchers gave up secrecy and started sharing important results openly.

Aggregate models -> Network models: illustrated by the move from chromosomal models of genetic inheritance to omic (genomic/proteomic/etc.) network models of genetic inheritance. Mendel could spot patterns in the colors of his peas because the those genes operated on the chromosomal level. The chromosome acts as a natural phenotypical aggregator for a much more complex genetic processes. But that aggregation limits what can be studied. The discovery of DNA allowed researchers to start analyzing inheritance using network models.

For years, there has been a proprietary market for data about how doctor behave, specifically around prescribing patterns. IMS, for instance, is a leading data vendor for this information. If you had wanted to purchase prescribing or referral pattern data, IMS will happily provide it (hint: you can’t afford it). Despite the high barrier to access, IMS’s service has been very unpopular with doctors, and the AMA successfully lobbied for a mechanism that would allow for a doctor to opt-out of these prescribing databases. So there is a well-established data vendor community here, with some recent Big Data entrants.

This is not necessarily by choice—most doctor data is released reluctantly by data owners. They are concerned with ensuring that doctor data does not spill into the public domain. In order to run their businesses, IMS, Activate, and Kyruus have likely made contractual promises that require them to keep of much of the data that they have access to private. In short, these companies are “behind the curtain” of healthcare informatics. They get substantial benefits by having access to this kind of private healthcare data and they must accept certain limitations in its use. My limited interactions with these companies has shown that they are as enthusiastic about open data in healthcare as I am.

When even proprietary players want to shift to more open accountable data models, it is fair to say that this shift is widely accepted. As a society, it is critical for us to move this graph healthcare data, as much as possible, into the open. This will allow data scientists from IMS, Activate, Kyruus and others to collaborate with journalists, academics and the open source developer community to make doctor and hospital performance into an open science. HHS has done its part by proactively releasing new critical data sets and by electing not to “fight” FOIA requests that seek even more data. This is substantive evidence that the mission of open data, inspired by President Obama and implemented by Federal CTO Todd Park, is a reality.

The second shift is away from aggregate models for healthcare researchers. While Propublica, Kyruus, and Activate have the Big Data chops to lead this shift, the rank and file healthcare researcher still routinely uses aggregate data to analysis the healthcare system as the method of choice. This is true of academic researchers, healthcare industry administrators, and policy makers alike. Understanding statistics is really the first step towards being a well-rounded data scientist, but it is only the first step.

Traditional statistical approaches, like traditional economic approaches, are powerful because they make certain simplifying assumptions about the world. Like many generalizations they are useful cognitive shortcuts until they are too frequently proven untrue. There is no “normal” prescribing pattern, for instance, to which a given provider can be judged.

Using averages, across zipcode, city, state or regional boundaries is a useful way to detect and describe the problems that we have in healthcare. But it is a terrible way to create feedback and control loops. An infectious disease clinic, for instance, will be unperturbed to learn that it has a higher “rate of infection” scores than neighboring clinics. In fact they are a magnet for infection cases, and hopefully should have a higher infection “case loads”, but a lower “infection” rates. Many scoring systems are unable to make these kinds of distinctions. Similarly, a center for cancer excellence would not be surprised to learn that they have a shorter life span scores then other cancer treatment centers. Any “last resort” clinical center would show those effects as the attract the most difficult cases.

It is difficult to use averages, scores, and other simplistic mathematical shortcuts to detect real problems in healthcare. We need a new norm where the average healthcare researcher’s initial tool of choice is Gephi rather than Excel and Neo4J rather than SQL. The aggregate approach has taken the healthcare system this far, but we need to have deeper understandings of how the healthcare system works—and fails—if we want to achieve the triple aim. We need models that incorporate details about specific doctors and hospitals. We need to move from simplistic mathematical shortcuts to complex mathematical models, Big Data if you like that term.

We need to have both shifts at the same time. It is not enough to have the shift to open data, in aggregate, or the shift to network models trapped behind the “insiders curtain”. That has been happening for years and this creates troubling power dynamics. When the public can see only averages, but the insiders get to see the graph of healthcare, we will enjoy only narrow optimizations and limited uses.

Currently, there is no simple way for data scientists at an EHR company, an insurance company, and a drug company to teach each other how to better leverage the healthcare graph. Each of these companies has a lens on the “true” graph created using only a slice of the relevant data. but the limitations of the data are really the least of the problems facing these researchers working in isolation. For each data scientist, working in isolation, there is no way to generally test hypothesis with outsiders. There is no way to “stand back and try science” because science is a community process.

We need to create a community of healthcare graph researchers and provide that community with non-aggregate data it needs to create the algorithms that will dictate how medicine operates for the next century. This is not a project for any single company to take on; we are betting too much as a society to have that kind of pressure. No company or data scientist I know of even wants that kind of role. Instead, every company in the space is interested in leveraging and contributing open data, so that the hypothesis and methods developed “behind the curtain” can be validated in the open.

Before the release of these three data sets, data scientists were in the tremendously uncomfortable position of having to make critical business decisions while being only “probably” right. Given the ease with which “probably right” can turn into “completely wrong” with data, we should work hard to ensure that data scientists are not put in this position again.

Strata Rx Heath Data Conference — Strata Rx brings together the diverse communities driving innovations in big data analytics for health care. Learn about the transformation of health care through big data and how to position your company to benefit from these trends. Learn more.

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Featured Video

Is Privacy Becoming a Luxury Good? Julia Angwin discusses how much she has spent trying to protect her privacy, and raises the question of whether we want to live in a society where only the rich can buy their way out of ubiquitous surveillance.