Free EMR Newsletter Want to receive the latest news on EMR, Meaningful Use,
ARRA and Healthcare IT sent straight to your email? Join thousands of healthcare pros who subscribe to EMR and HIPAA for FREE!!

Email Address:

We never sell or give out your contact information.
We respect our readers' privacy.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

The previous section of this article introduced Apixio’s analytics for payers in the Medicare Advantage program. Now we’ll step through how Apixio extracts relevant diagnostic data.

The technology of PDF scraping
Providers usually submit SOAP notes to the Apixio web site in the form of PDFs. This comes to me as a surprise, after hearing about the extravagant efforts that have gone into new CCDs and other formats such as the Blue Button project launched by the VA. Normally provided in an XML format, these documents claim to adhere to standards and offer a relatively gentle face to a computer program. In contrast, a PDF is one of the most challenging formats to parse: words and other characters are reduced to graphical symbols, while layout bears little relation to the human meaning of the data.

Structured documents such as CCDs contain only about 20% of what CMS requires, and often are formatted in idiosyncratic ways so that even the best CCDs would be no more informative than a Word document or PDF. But the main barrier to getting information, according to Schneider, is that Medicare Advantage works through the payers, and providers can be reluctant to give payers direct access to their EHR data. This reluctance springs from a variety of reasons, including worries about security, the feeling of being deluged by requests from payers, and a belief that the providers’ IT infrastructure cannot handle the burden of data extraction. Their stance has nothing to do with protecting patient privacy, because HIPAA explicitly allows providers to share patient data for treatment, payment, and operations, and that is what they are doing giving sensitive data to Apixio in PDF form. Thus, Apixio had to master OCR and text processing to serve that market.

Processing a PDF requires several steps, integrated within Apixio’s platform:

Optical character recognition to re-create the text from a photo of the PDF.

Further structuring to recognize, for instance, when the PDF contains a table that needs to be broken up horizontally into columns, or constructs such the field name “Diagnosis” followed by the desired data.

Natural language processing to find the grammatical patterns in the text. This processing naturally must understand medical terminology, common abbreviations such as CHF, and codings.

Analytics that pull out the data relevant to risk and presents it in a usable format to a human coder.

Apixio can accept dozens of notes covering the patient’s history. It often turns up diagnoses that “fell through the cracks,” as Schneider puts it. The diagnostic information Apixio returns can be used by medical professionals to generate reports for Medicare, but it has other uses as well. Apixio tells providers when they are treating a patient for an illness that does not appear in their master database. Providers can use that information to deduce when patients are left out of key care programs that can help them. In this way, the information can improve patient care. One coder they followed could triple her rate of reviewing patient charts with Apixio’s service.

Caught between past and future
If the Apixio approach to culling risk factors appears round-about and overwrought, like bringing in a bulldozer to plant a rosebush, think back to the role of historical factors in health care. Given the ways doctors have been taught to record medical conditions, and available tools, Apixio does a small part in promoting the progressive role of accountable care.

Hopefully, changes to the health care field will permit more direct ways to deliver accountable care in the future. Medical schools will convey the requirements of accountable care to their students and teach them how to record data that satisfies these requirements. Technologies will make it easier to record risk factors the first time around. Quality measures and the data needed by policy-makers will be clarified. And most of all, the advantages of collaboration will lead providers and payers to form business agreements or even merge, at which point the EHR data will be opened to the payer. The contortions providers currently need to go through, in trying to achieve 21st-century quality, reminds us of where the field needs to go.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Many of us strain against the bonds of tradition in our workplace, harboring a secret dream that the industry could start afresh, streamlined and free of hampering traditions. But history weighs on nearly every field, including my own (publishing) and the one I cover in this blog (health care). Applying technology in such a field often involves the legerdemain of extracting new value from the imperfect records and processes with deep roots.

Along these lines, when Apixio aimed machine learning and data analytics at health care, they unveiled a business model based on measuring risk more accurately so that Medicare Advantage payments to health care payers and providers reflect their patient populations more appropriately. Apixio’s tools permit improvements to patient care, as we shall see. But the core of the platform they offer involves uploading SOAP notes, usually in PDF form, and extracting diagnostic codes that coders may have missed or that may not be supportable. Machine learning techniques extract the diagnostic codes for each patient over the entire history provided.

Many questions jostled in my mind as I talked to Apixio CTO John Schneider. Why are these particular notes so important to the Centers for Medicare & Medicaid Services (CMS)? Why don’t doctors keep track of relevant diagnoses as they go along in an easy-to-retrieve manner that could be pipelined straight to Medicare? Can’t modern EHRs, after seven years of Meaningful Use, provide better formats than PDFs? I asked him these things.

A mini-seminar ensued on the evolution of health care and its documentation. A combination of policy changes and persistent cultural habits have tangled up the various sources of information over many years. In the following sections, I’ll look at each aspect of the documentation bouillabaisse.

Although many accountable care contracts–like those of the much-maligned 1970s Managed Care era–ignore differences between patients, more thoughtful programs recognize that accurate and fair payments require measurement of how much risk the health care provider is taking on–that is, how sick their patients are. Thus, providers benefit from scrupulously complete documentation (having learned that upcoding and sloppiness will no longer be tolerated and will lead to significant fines, according to Schneider). And this would seem to provide an incentive for the provider to capture every nuance of a patient’s condition in a clearly code, structured way.

But this is not how doctors operate, according to Schneider. They rebel when presented with dozens of boxes to check off, as crude EHRs tend to present things. They stick to the free-text SOAP note (fields for subjective observations, objective observations, assessment, and plan) that has been taught for decades. It’s often up to post-processing tools to code exactly what’s wrong with the patient. Sometimes the SOAP notes don’t even distinguish the four parts in electronic form, but exist as free-flowing Word documents.

A number of key diagnoses come from doctors who have privileges at the hospital but come in only sporadically to do consultations, and who therefore don’t understand the layout of the EHR or make attempts to use what little structure it provides. Another reason codes get missed or don’t easily surface is that doctors are overwhelmed, so that accurately recording diagnostic information in a structured way is a significant extra burden, an essentially clerical function loaded onto these highly skilled healthcare professionals. Thus, extracting diagnostic information many times involves “reading between the lines,” as Schneider puts it.

For Medicare Advantage payments, CMS wants a precise delineation of properly coded diagnoses in order to discern the risk presented by each patient. This is where Apixio come in: by mining the free-text SOAP notes for information that can enhance such coding. We’ll see what they do in the next section of this article.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

The first part of this article summarized what Web developers have done to structure data, and started to look at the barriers presented by health care. This part presents more recommendations for making structured data work.

The Grand Scheme of Things
Once you start classifying things, it’s easy to become ensnared by grandiose pipe dreams and enter a free fall trying to design the perfect classification system. A good system is distinguished by knowing its limitations. That’s why microdata on the Web succeeded. In other areas, the field of ontology is littered with the carcasses of projects that reached too far. And health care ontologies always teeter on the edge of that danger.

Let’s take an everyday classification system as an example of the limitations of ontology. We all use genealogies. Imagine being able to sift information about a family quickly, navigating from father to son and along the trail of siblings. But even historical families, such as royal ones, introduce difficulties right away. For instance, children born out of wedlock should be shown differently from legitimate heirs. Modern families present even bigger headaches. How do you represent blended families where many parents take responsibilities of different types for the children, or people who provided sperm or eggs for artificial insemination?

Transgender people present another enormous challenge to ontologies and EHRs. They’re a test case for every kind of variation in humanity. Their needs and status vary from person to person, with no classification suiting everybody. These needs can change over time as people make transitions. And they may simultaneously need services defined for male and female, with the mix differing from one patient to the next.

Getting to the Point
As the very term “microdata” indicates, those who wish to expose semantic data on the Web can choose just a few items of information for that favored treatment. A movie theater may have text on its site extolling its concession stand, its seating, or its accommodations for the disabled, but these are not part of the microdata given to search engines.

A big problem in electronic health records is their insistence that certain things be filled out for every patient. Any item that is of interest for any class of patient must appear in the interface, a problem known in the data industry as a Cartesian explosion. Many observers counsel a “less is more” philosophy in response. It’s interesting that a recent article that complained of “bloated records” and suggested a “less is more” approach goes on to recommend the inclusion of scads of new data in the record, to cover behavioral and environmental information. Without mentioning the contradiction explicitly, the authors address it through the hope that better interfaces for entering and displaying information will ease the burden on the clinician.

The various problems with ontologies that I have explained throw doubt on whether EHRs can attain such simplicity. Patients are not restaurants. To really understand what’s important about a patient–whether to guide the clinician in efficient data entry or to display salient facts to her–we’ll need systems embodying artificial intelligence. Such systems always feature false positives and negatives. They also depend on continuous learning, which means they’re never perfect. I would not like to be the patient whose data gets lost or misclassified during the process of tuning the algorithms.

I do believe that some improvements in EHRs can promote the use of structured data. Doctors should be allowed to enter the data in the order and the manner they find intuitive, because that order and that manner reflect their holistic understanding of the patient. But suggestions can prompt them to save some of the data in structured format, without forcing them to break their trains of thought. Relevant data will be collected and irrelevant fields will not be shown or preserved at all.

The resulting data will be less messy than what we have in unstructured text currently, but still messy. So what? That is the nature of data. Analysts will make the best use of it they can. But structure should never get in the way of the information.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Most innovations in electronic health records, notably those tied to the Precision Medicine initiative that has recently raised so many expectations, operate by moving clinical information into structure of one type or another. This might be a classification system such as ICD, or a specific record such as “medications” or “lab results” with fixed units and lists of names to choose from. There’s no arguing against the benefits of structured data. But its costs are high as well. So we should avoid repeating old mistakes. Experiences drawn from the Web may have something to teach the health care field in respect to structured data.

What Works on the Web
The Web grew out of a structured data initiative. The dream of organizing information goes back decades, and was embodied in Standard Generalized Markup Language (SGML) years before Tim Berners-Lee stole its general syntax to create HTML and present information on the Web. SGML could let a firm mark in its documents that FR927 was a part number whereas SG1 was a building. Any tags that met the author’s fancy could be defined. This put semantics into documents. In other words, the meaning of text could be abstracted from the the text and presented explicitly. Semantics got stripped out of HTML. Although the semantic goals of SGML were re-introduced into the HTML successor XML, it found only niche uses. Another semantic Web tool, JSON, was reserved for data storage and exchange, not text markup.

Since the Web got popular, people have been trying to reintroduce semantics into it. There was Dublin Core, then RDF, then microdata in places like schema.org–just to list a few. Two terms denoting structured data on the Web, the Semantic Web and Linked Data, have been enthusiastically taken up by the World Wide Web Consortium and Tim Berners-Lee himself.

But none of these structured data initiatives are widely known among the Web-browsing public, probably because they all take a lot of work to implement. Furthermore, they run into the bootstrapping problem faced by nearly all standards: if your web site uses semantics that aren’t recognized by the browser, they’re just dropped on the ground (or even worse, the browser mangles your web pages).

Even so, recent years have seen an important form of structured data take off. When you look up a movie or restaurant on a major search engine such a Google, Yahoo!, or Bing, you’ll see a summary of the information most people want to see: local showtimes for the movie, phone number and ratings for a restaurant, etc. This is highly useful (particularly on mobile devices) and can save you the trouble of visiting the web site from which the data comes. Google calls these summaries Rich Cards and Rich Snippets.

If my memory serves me right, the basis for these snippets didn’t come from standards committees involving years of negotiation between stake-holders. Google just decided what would be valuable to its users and laid out the standard. It got adopted because it was a win-win. The movie theaters and restaurants got their information right into the viewer’s face, and the search engine became instantly more valuable and more likely to be used again. The visitors doing the search obviously benefitted too. Everyone found it worth their time to implement the standards.

Interestingly, as structure moves into metadata, HTML itself is getting less semantic. The most recent standard, HTML5, did add a few modest tags such as header and footer. But many sites are replacing meaningful HTML markup, such as p for paragraph, with two ultra-generic tags: div for a division that is set off from other parts of the page, and span for a piece of text embedded within another. Formatting is expressed through CSS, a separate language.

Having reviewed a bit of Web history, let’s see what we can learn from it and apply to health care.

Make the Customer Happy
Win-win is the key to getting a standard adopted. If your clinician doesn’t see any benefit from the use of structured data, she will carp and bristle at any attempt to get her to enter it. One of the big reasons electronic health records are so notoriously hard to use is, “All those fields to fill out.” And while lists of medications or other structured data can help the doctor choose the right one, they can also help her enter serious errors–perhaps because she chose the one next to the one she meant to choose, or because the one she really wanted isn’t offered on the list.

Doctors’ resentment gets directed against every institution implicated in the structured data explosion: the ONC and CMS who demand quality data and other fields of information for their own inscrutable purposes, the vendor who designs up the clunky system, and the hospital or clinic that forces doctors to use it. But the Web experience suggests that doctors would fill out fields that would help them in their jobs. The use of structured data should be negotiated, not dictated, just like other innovations such as hand-washing protocols or checklists. Is it such a radical notion to put technology at the service of the people using it?

I know it’s frustrating to offer that perspective, because many great things come from collecting data that is used in analytics and can turn up unexpected insights. If we fill out all those fields, maybe we’ll find a new cure! But the promised benefit is too far off and too speculative to justify the hourly drag upon the doctor’s time.

We can fall back on the other hope for EHR improvement: an interface that makes data entry so easy that doctors don’t mind using structured fields. I have some caveats to offer about that dream, which will appear in the second part of this article.

Andy Oram is an editor at O'Reilly Media, a highly respected book publisher and technology information provider. An employee of the company since 1992, Andy currently specializes in open source, software engineering, and health IT, but his editorial output has ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. His articles have appeared often on EMR & EHR and other blogs in the health IT space.
Andy also writes often for O'Reilly's Radar site (http://oreilly.com/) and other publications on policy issues related to the Internet and on trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business. Conferences where he has presented talks include O'Reilly's Open Source Convention, FISL (Brazil), FOSDEM, and DebConf.

Well, it’s not as if the rest of the universe is a pristine source of well-formed statistics. Every field has to deal with messy data. And somehow retailers, financial managers, and even political campaign staff manage to extract useful information from the data soup. This doesn’t mean that predictions are infallible–after all, when I check a news site about the Mideast conflicts, why does the publisher think I’m interested in celebs from ten years ago whose bodies look awful now? But there is still no doubt that messy data can transform industry.

I’m all for standards and for more reliable means of collecting and vetting patient data. But for the foreseeable future, health care institutions are going to have to deal with suboptimal data. And OCHIN is one of the companies that shows how it can be done.

I recently had a chance to talk and see a demo of OCHIN’s analytical tool, Acuere, with CEO Abby Sears and the Vice President of Data Services and Integration, Clayton Gillett. Their basic offering is a no-nonsense interface that lets clinicians and administrator do predictions and hot-spotting.

Acuere is part of a trend in health care analytics that goes beyond clinical decision support and marshalls large amounts of data to help with planning (see an example screen in Figure 1). For instance, a doctor can rank her patients by the number of alerts the system generates (a patient with diabetes whose glucose is getting out of control, or a smoker who hasn’t received counseling for smoking cessation). An administrator can rank a doctor against others in the practice. This summary just gives a flavor of the many services Acuere can perform; my real thrust in this article is to talk about how OCHIN obtains and processes its data. Sears and Gillett talked about the following challenges and how they’re dealing with them.

Figure 1. Acuere Report Card in Acuere

Patient identification
Difficulties in identifying patients and matching their records has repeatedly surfaced as the biggest barrier to information exchange and use in the US health care system. A 2014 ONC report cites it as a major problem (on pages 13 and 20). An article I cited earlier also blames patient identification for many of the problems of health care analytics. But the American public and Congress have been hostile to unique identifiers for some time, so health care institutions just have to get by without them.

OCHIN handles patient matching as other institutions, such as Health Information Exchanges, do. They compare numerous fields of records–not just obvious identifiers such as name and social security number, but address, demographic information, and perhaps a dozen other things. Sears and Gillett said it’s also hard to knowing which patients to attribute to each health care provider.

Data sources
The recent Precision Medicine initiatives seeks to build “a national research cohort of one million or more U.S. participants.” But OCHIN already has a database on 7.6 million people and has signed more contracts to reach 10 million this Fall. Certainly, there will be advantages to the Precision Medicine database. First, it will contain genetic information, which OCHIN’s data suppliers don’t have. Second, all the information on each person will be integrated, whereas OCHIN has to take de-identified records from many different suppliers and try to integrate them using the techniques described in the previous section, plus check for differences and errors in order to produce clean data.

Nevertheless, OCHIN’s data is impressive, and it took a lot of effort to accumulate it. They get not only medical data but information about the patient’s behavior and environment. Along with 200 different vital signs, they can map the patient’s location to elements of the neighborhood, such as income levels and whether healthy food is sold in local stores.

They get Medicare data from qualified entities who were granted access to it by CMS, Medicaid data from the states, patient data from commercial payers, and even data on the uninsured (a population that is luckily shrinking) from providers who treat them. Each institution exports data in a different way.

How do they harmonize the data from these different sources? Sears and Gillett said it takes a lot of manual translation. Data is divided into seven areas, such as medications and lab results. OCHIN uses standards whenever possible and participates in groups that set standards. There are still labs that don’t use LOINC codes to report results, as well as pharmacies and doctors who don’t use RxNorm for medications. Even ICD-10 changes yearly, as codes come and go.

Data handling
OCHIN isn’t like a public health agency that may be happy sharing data 18 months after it’s collected (as I was told at a conference). OCHIN wants physicians and their institutions to have the latest data on patients, so they carry out millions of transactions each day to keep their database updated as soon as data comes in. Their analytics run multiple times every day, to provide the fast results that users get from queries.

They are also exploring the popular “big data” forms of analytics that are sweeping other industries: machine learning, using feedback to improve algorithms, and so on. Currently, the guidance they offer clinicians is based on traditional clinical recommendations from randomized trials. But they are seeking to expand those sources with other insights from light-weight methods of data analysis.

So data can be useful in health care. Modern analytics should be available to every clinician. After all, OCHIN has made it work. And they don’t even serve up ads for chronic indigestion or 24-hour asthma relief.

Free EMR Newsletter Want to receive the latest news on EMR, Meaningful Use,
ARRA and Healthcare IT sent straight to your email? Join thousands of healthcare pros who subscribe to EMR and HIPAA for FREE!

Email Address:

We never sell or give out your contact information. We respect our readers' privacy.