Primary Menu

At Zebra-med we set out on a long journey to create an automated radiology assistant – to help radiologists become more efficient and enable them to take care of many more patients. With 2 billion people joining the middle class worldwide and a growing global shortage of clinical experts, there is a sense of urgency to develop technology which can help bridge the gap between supply and demand of radiology services – technology such as the one I’ll discuss here.

The chest X-ray scan is by far the most commonly performed radiological examination for screening and diagnosis of many cardiac and pulmonary diseases. It is also one of the hardest to interpret, and is known for high disagreement rates even between experienced radiologists. At Zebra-med, we have access to many millions of X-ray scans, as well as their associated anonymized textual reports written by hospital radiologists.

Can data from these scans and reports be used to teach an algorithm to identify significant clinical findings?

Sentence is King

Most of our reports are written in free Hebrew text. Here is an example (translated to English):

This text must undergo extensive “cleaning”, which includes all the problems you usually face when dealing with free text, in addition to some additional hurdles related to Hebrew, such as right-to-left artifacts. Once this tiresome cleaning is done, a close examination of the text reveals that the basic unit of information is the sentence. Radiologists are trained to write self-contained sentences, and hence a single sentence, even out of context, can often indicate the presence of a finding (a “positive” sentence) or its absence (a “negative” sentence).

Furthermore, we noticed that many of the sentences occur multiple times (up to tens of thousands) in our reports. This was especially true for negative sentences (perhaps because of the heavy use of templates and copy-paste), but also for positive sentences. Could we use this to our advantage? To our surprise, a simulation showed that by understanding only 20,000 sentences we can fully cover 1.5M reports! This didn’t seem like a very large number of sentences, so we decided that instead of building a fancy NLP system, we could start by just tagging them all! And so, Operation Textray begun.

Operation “Textray”: Mapping our Sentences To Findings

What does it mean to “understand” a sentence? We noticed that in our data almost all the positive sentences relate to a relatively small number of findings. We asked one of our expert radiologists to compose a list of those findings, by going over the top positive sentences, adding categories as needed. She got back to us with just over 60 categories.

Armed with this ontology, we launched our sentence tagging operation. Our expert radiologist trained two Hebrew-speaking medical students to map positive sentences into the finding categories, and in a few weeks they mapped the top 20,000 sentences. Some sentences also mentioned the location of the finding (‘left’, ‘right’, ‘RLL’, etc.) or its severity (‘small’, ‘light’, ‘severe’), and the students were asked to tag those properties as well. At the end of this operation we were able to put the reports aside, and treat each of our studies as “a bag of findings”. More importantly, we finally had a bird’s eye view of our data and the findings it actually contained.

Over half of our X-ray studies are chest X-rays, and most of them include both a frontal view (“PA”) and a lateral view, so we decided to target these types of studies first. We ended up compiling a dataset with close to one million studies – 1.7M images, and 40 clinical findings. This dataset is almost 9 times larger than the largest published dataset (cxr14) to date, and probably the largest dataset anyone has ever trained on for this task!

Each segment of the circle circumference represents studies that have a particular finding, and the bands represent studies that have the two connected findings (for this analysis we show only studies with one or two findings).

Now the fun part

Using the labels extracted from the reports, we were able to build a neural network that receives as input the two X-ray views and outputs a list of 40 probabilities, each corresponding to the chance of the patient having a particular finding.

Frontal (PA) and lateral view images each go through a separate CNN. A fully-connected layer is applied on their concatenated feature vectors and emits the confidence for each finding. Training labels were extracted by analyzing the report sentences. Negative (green) and positive (red) sentences identified. Findings in positive sentences receive a positive training label. Negative or unmentioned findings receive a negative label.

We report the AUCs of the model against a held-out validation set. In the figure below we compare our base model (that uses two views) to a version that uses only the frontal view (like most prior work on chest X-ray) — it turns out the lateral view improves the performance for many findings!

Do you see the cluster of findings at the top right of the figure? Our model achieves near perfect performance on them. These are mostly findings that indicate the presence of artificial objects (i.e. cardiac pacer), which not surprisingly are easier to detect.

Once the network said that a finding is positive, we can do some network-probing and create heatmaps that show which parts of the image were the most indicative of the finding:

The coffee-break criterion

For 12 chosen findings we measured our model more rigorously, by comparing the performance of the model to a team of 3 radiologists tagging (independently) the same validation set (one set per finding). Even though we heard it many times, we were surprised by the low level of agreement between the radiologists (ranged from 62%-90%). In 8 cases we found that our radiologists are more likely to agree with our model than with each other! We invented “the coffee-break criterion”, which if true, indicates that a radiologist going for a coffee break would rather have the model replace him/her than any of his peers. In 3 of those 8 cases (pulmonary edema, elevated diaphragm, abnormal aorta) this effect was statistically significant (e.g. the CI did not include 0).

Embrace the noise

Some critics of the cxr14 dataset, whose labels were also obtained from textual analysis of reports, claim that such labels are unreliable and cannot be used for training. We agree that report-based labels are noisy. But are they noisier than our own radiologists? It turns out that if we add the report label as “another radiologist”, it performs similarly to the average radiologist. Our conclusion is therefore that the report labels, provided they were extracted correctly, add valuable information, and can be treated as the opinion of a regular radiologist.

Some critics argue that it is impossible to train a model that provides predictions that are better than the quality of its labels. In other words, a model that is trained from noisy labels, would produce noisy predictions. We reject this claim. As long as the positive set is saturated with true positives, and the negative set is saturated with negatives, the signal can beat the noise. The literature shows that with a large enough dataset, the model performance can easily surpass the quality of its noisy labels in many domains. The recent ChexNet paper from Stanford and our Textray paper show it is true for the chest X-ray domain as well. Our lesson from this work is that a larger training set more than compensates for the label noise, and we call upon other researchers to embrace the noise in the reports.

For us here in Zebra-med, this is just the beginning. We have far more reports and studies waiting to be analyzed, in English as well as Hebrew, and in other modalities as well. Our data enables us to track patients over years of treatment. We can give better predictions using the scans from the patient history, and we can use scans and reports “from the future” to provide even better labels for the current scans. We are excited by the sea of possibilities.