The Big (and Small) Data Conundrum

In a CMAJ article titled “Big data’s dirty secret” (Webster, June 6), the author highlights the barriers to leveraging big data in Canada and points to “technological limitations stemming from mismanagement by government e-health agencies and commercial turf battles” as an important cause of this problem. Webster also cites others who have had challenges with the data generated from EMRs currently used in Canada including Patricia Sullivan-Taylor, manager of primary health care information (Canadian Institute for Health Information) and Alex Mair, director of health system use for Canada Health Infoway, the organization responsible for developing and implementing Canada’s health Infostructure. According to Sullivan-Taylor, each EMR system produces different types of data, something that was not resolved before EMRs were installed in physician offices. Mair also acknowledges that progress has been slow in using “big data” and the resultant inability to gain insights into clinical data is a key gap for clinicians.

While I agree with many of the concepts and comments in the article, I would like to add further clarifications. Mario Bojilov recently published a concise overview of big data and defines it as “data sets that — due to their size (volume), the speed they are created
with (velocity), and the type of information they contain (variety) — are
pushing the existing infrastructure to its limits”. Big data, in healthcare, seems to have become synonymous with data analytics, “a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making” (Wikipedia); however, I belive it is misleading to broadly think of EMR systems in physician offices as generators of big data. While large networks of clinical users on a common EMR can certainly generage large amounts of data, many of the current limitations apply equally to “small data” as they do to the high velocity, volume, and variety of big data.

Simply put, there are many EMR systems that cannot effectively be used to generate insights from small collections of data, never mind big cumulative data sets. For example, EMR systems that record information in narrative format (not as discrete data) are unable to generate analytic reports for patient populations because they were not designed to function as analytical tools right from their early genesis. EMRs that collect data in highly structured formats are able to output information that can be queried using analytical tools that are either built into the EMR systems or third-party software that is able to take data sets generated by EMRs and provide data analysis as an integrated functionality or outside of the clinical system. However, taking data from multiple EMRs and merging it together is not possible. Why, you might ask? Well, because each EMR system has been built using proprietary data structures without any coordinated national effort to define national data standards for EMRs. The reason that Interac bank machines work is that all the messages and data structures are the same — irrespective of which bank you may use. The same cannot be said for EMR data. One of the great technology successes of the UK’s defunct National Program for IT was a project called GP2GP through which a patient’s electronic health records can be transferred directly and securely between GP practices. This was made possible because of standardized data structures.

As provincial EMR programs begin to wind down, starting with Alberta in 2014, and provinces transition to a support and optimization role for users of EMRs, the lack of data standards for EMRs is going to become a greater and greater problem. Short of completely redesigning existing EMRs using the same data structures and standards for all systems, it will be a nearly impossible task to reverse modify the EMRs used by 70% of target physicians. We are stuck with what we currently have. The policy failures of the past — insufficient support for clinical leadership in the early stages of EMRs and a lack of focus in developing and implementing standarized data structures and clinical messages for EMR systems (nationally and provincially) — have created the current pickle we are now in.

Before we can fully utilize and gain insights from big data, we had better sort out small data in EMRs. This needs to begin with a focus on data quality, the management of patient populations at the practice level, and ensuring that the clinical messages passed between EMRs use nationally-accepted standards so that data is interoperable and usable by other systems.

What do you think? Are there other priorities to consider in order to make data more usable by multiple EMRs? To add your thoughts, click on the “Comments” link below.

Comments

Your characterization of Big Data is accurate. We don't really have Big Data in Canadian Healthcare at the moment. I think Big Data in healthcare would look something like an EHR data set that included all data on all patients being generated in physician practices, hospitals, specialist offices, laboratory and diagnostic imaging facilities, genomic data, remote monitoring data and all the transactions that go with patient care, such as lab orders, prescriptions, dispenses and lab results. If all of this data from across the country were being updated in real-time, then we'd have Big Data.

In the Canadian context, we are very far from anything like that on any level of scale.

I will say that within the CPCSSN project (www.cpcssn.ca), we now have a national EMR database that is standardized. We have developed processes to extract data from 12 different EMRs from across the country and put them into a standard database. We have also developed algorithms to clean up that data for about 8 different diseases (we clean up the diagnoses, medications, vital signs, lab results, referrals, risk factors, etc).

We have been extracting and processing data on a quarterly basis for the last 3 years, continuously improving the quality of the data extraction and cleaning processes. We are now up to 380 physicians participating in CPCSSN and we have over 450,000 anonymized patient records in the database.

We recently presented our findings at the eHealth Conference and will be presenting at NAPCRG and Family Medicine Forum this year.

We are realizing how important standardization is and we're learning that we can do a lot of data cleaning behind the scenes and take the pressure off physicians in terms of forcing them to pick items off lists for data entry. Pick lists are time consuming to use and if not designed correctly can cause their own sets of errors, especially in fast paced environments. Forcing users to standardize their data is not a panacea for our data woes. It shifts the blame of dirty data onto people whose role is not data entry, but patient care. If we force standardization of data, we may inadvertently impact patient care. Benefits seldom come without harms.

I'm very curious about Karim's comment: "Forcing users to standardize their data is not a panacea for our data woes. It shifts the blame of dirty data onto people whose role is not data entry, but patient care. If we force standardization of data, we may inadvertently impact patient care. Benefits seldom come without harms."

The "shift the blame" part of this comment seems to imply that if a clinician thinks the right thought, but records the wrong thing, he or she should be blameless regarding the "dirty data" because their role is patient care, not data entry. I purposefully used the term clinician in my sentence, above. Is the assertion that such responsibility for accurate data applies to others (nurses, for example), but ought not to apply to physicians? I've known Karim a long time and don't believe that's what he's trying to say... but that impression can be taken from the comment.

Healthcare is a knowledge industry. I would say that we ARE now realizing how important standardization is; and we are seeing we DO have to force users to standardize their data. I can also say that I have been a colleague of both Alan and Karim, for many years now, at eHealth standards fora where the rest of the physician community was grossly under-represented. I'm sad to say, when it comes to both BIG data and SMALL data, we are reaping what we've sown... or perhaps not reaping what we didn't sow.

The old exhortation might need to be rewritten for the eHealth era: "physician, inform thyself."

Forcing people to anything is the old way. Google doesn't ask me to standardize data entry for giving me access to good map functionality. I can put in an address pretty much any old way and it'll show me how to get there without asking me to 'standardize' my entry to fit their pre-concieved norms.

If they can do that for a service that is free, we should be able to do it for a health care system that collectively costs us $200 Billion each year.

You obviously did not attend my session on the CPCSSN project at eHealth. If you had, you would know that we've moved the state of the art on data cleaning and you would know exactly what I'm talking about and wouldn't be misinterpreting what I wrote.

Our findings are that we can clean up data using sophisticated algorithms BETTER than humans can. Why would you ask a human to do something that a computer can do better? Isn't that the essence of health informatics --to harness the power of technology for improvement of health?

One of my major disappointments in e-health is that the potential convergence of a number of different technologies that would reduce the data capture burden and promote data standardization has not been realized. Pick lists for coded data has always been a "programmer easy but end user hard" kind of solution. A combination of structured templates, voice recognition and controlled terminologies supported by a semantic infrastructure can theoretically produce more effective standardized data collection. However, it would take a great deal of collaboration among informatics professionals, clinician specialty leaders, technical specialists in both middleware and user interface design to work out the bugs. This level of collaboration is too knowledge intensive for any single organization to undertake and has not received the kind of investment it would take to be successful. Maybe when the pressure on the current health-system gets great enough the willingness to invest and collaborate will emerge. In the meantime the old saw about data analysis will hold true - garbage in = garbage out.