How much unstructured big data is there in the EMR? Unstructured data is data that doesn’t fit into neat columns on a spreadsheet, or fields and look-up tables in a database, like the narrative text in an HPI. It used to be that we sat down with a pen and the paper chart, and wrote our progress notes in the office and in the clinic. Or, we dictated the notes, which were transcribed. But with the advent of the EMR, templates have crept in, as well as the wide-spread and controversial practice of copying and pasting text from a previous encounter (see the recent NYT article).

This is interesting in a quirky way. As physicians, nurse practitioners, and other providers have become reluctant data entry clerks, they use many shortcuts so that they will have time to take care of the patients, including templates with stylized or constrained vocabularies, self-generated “smart phrases”, and patient-specific narratives that can be recalled and modified. The remainder of the note is populated with structured data already in the system (labs, test results, x-ray results). Because medical changes are often not so dramatic from one day to the next, the actual novel unstructured information content from one note to the next may only be a tiny fraction of the total bytes, and probably the change between the current and previous note may carry as much information than the actual content. But, when people get hurried or sloppy, old information gets carried along that is no longer current, but has not been changed in the notes. So, the key information extraction question is identifying the true changes, separating them from relatively static or outdated data that is carried along, and extracting the novel information.

How is this relevant to big data analytics in medicine? If much of the content is captured by a stylized vocabulary, and filled with structured data already present in data tables, how much independent information will there be in a medical note? If the data has dependencies because of this stylized nature and controlled vocabularies, how does this impact data mining and statistical analytics. I am not sure if this type of problem has a formal technical term in machine learning, but if not it is likely to get one soon!