IBM’s Watson versus cancer: Hype meets reality

Five years ago, IBM announced that its supercomputer Watson would revolutionize cancer treatment by using its artificial intelligence to digest and distill the thousands of oncology studies published every year plus patient-level data and expert recommendations into treatment recommendation. Last week, a report published by STAT News shows that, years later, IBM’s hubris and hype have crashed into reality.

Watson defeats two human champions. Unfortunately, it doesn’t seem to be doing as well for cancer.

For nearly as long as I can remember, I’ve been a fan of Jeopardy! Indeed, if I’m at home at 7:30 PM on a weeknight, Jeopardy! will usually be on the television. Given that, I remember what was basically a bit of stunt programming in 2011, when Jeopardy! producers had IBM’s artificial intelligence supercomputer Watson face off against two of the most winning champions in the history of the show, Ken Jennings and Brad Rutter. Watson won, leading Jenning’s to add to his Final Jeopardy answer, “I, for one, welcome our new computer overlords.”

Watson Oncology is a cognitive computing system designed to support the broader oncology community of physicians as they consider treatment options with their patients. Memorial Sloan Kettering clinicians and analysts are partnering with IBM to train Watson Oncology to interpret cancer patients’ clinical information and identify individualized, evidence-based treatment options that leverage our specialists’ decades of experience and research.

As Watson Oncology’s teacher, we are advancing our mission by creating a powerful resource that will help inform treatment decisions for those who may not have access to a specialty center like MSK. With Watson Oncology, we believe we can decrease the amount of time it takes for the latest research and evidence to influence clinical practice across the broader oncology community, help physicians synthesize available information, and improve patient care.

Not surprisingly, Watson’s entry into cancer care and interpretation of cancer genomics was, just like its appearance on Jeopardy!, highly hyped, with overwhelmingly positive press coverage and little in the way of skeptical examination of what, exactly, Watson could potentially do and whether it could actually improve patient outcomes. Overall, as Watson moved into the clinical realm, you’d be hard-pressed not to think that this was a momentous development that would change cancer care forever for the better. There were plenty of headlines like “IBM to team up with UNC, Duke hospitals to fight cancer with big data” and “The future of health care could be elementary with Watson.” The future looked bright.

Watson: Hype versus reality

In the story, STAT looked at Watson for Oncology’s use, marketing, and actual performance in hospitals around the world, interviewing dozens of doctors, IBM executives, and artificial intelligence experts and concluded that IBM released a product without having fully assessed or understood the challenges in deploying it and without having published any papers demonstrating that the technology works as advertised, noting that, as a result, “its flaws are getting exposed on the front lines of care by doctors and researchers who say that the system, while promising in some respects, remains undeveloped.” From my perspective, that’s an understatement. Indeed, STAT observes:

Perhaps the most stunning overreach is in the company’s claim that Watson for Oncology, through artificial intelligence, can sift through reams of data to generate new insights and identify, as an IBM sales rep put it, “even new approaches” to cancer care. STAT found that the system doesn’t create new knowledge and is artificially intelligent only in the most rudimentary sense of the term.

While Watson became a household name by winning the TV game show “Jeopardy!”, its programming is akin to a different game-playing machine: the Mechanical Turk, a chess-playing robot of the 1700s, which dazzled audiences but hid a secret — a human operator shielded inside.

In the case of Watson for Oncology, those human operators are a couple dozen physicians at a single, though highly respected, U.S. hospital: Memorial Sloan Kettering Cancer Center in New York. Doctors there are empowered to input their own recommendations into Watson, even when the evidence supporting those recommendations is thin.

Another way of saying this is that Watson isn’t really an artificial intelligence when it comes to cancer, but rather a very powerful computer that is very good at coming up with treatment plans based on human-inputted algorithms that it’s taught. An example from a hospital in Florida is presented as an example:

On a recent morning, the results for a 73-year-old lung cancer patient were underwhelming: Watson recommended a chemotherapy regimen the oncologists had already flagged.

“It’s fine,” Dr. Sujal Shah, a medical oncologist, said of Watson’s treatment suggestion while discussing the case with colleagues.

He said later that the background information Watson provided, including medical journal articles, was helpful, giving him more confidence that using a specific chemotherapy was a sound idea. But the system did not directly help him make that decision, nor did it tell him anything he didn’t already know.

But it’s more than that. You might have noted in the MSKCC blurb I quoted above that MSKCC is described as “Watson’s teacher.” That is very literally true. Indeed, the STAT story refers to Watson as “essentially Memorial Sloan Kettering in a portable box,” noting that its treatment recommendations are “based entirely on the training provided by doctors, who determine what information Watson needs to devise its guidance as well as what those recommendations should be.” This reliance on a single institution introduces an incredible bias. MSKCC is, of course, one of the premiere cancer centers in the world, but it’s a tertiary care center. The patients seen there are not like the patients seen at most places—or, to some extent, even at my cancer center. They’re different, both in the mix of race and socioeconomic status. (MSKCC tends to attract more affluent patients.) Also, the usual differences between the patient mix in a tertiary care center and a typical hospital are more pronounced, because not only is MSKCC a tertiary care center, but it’s one of the premier cancer tertiary care centers in the world. There are more advanced and unusual cases, patients who have failed multiple lines of treatment and are looking for one last chance. The mix of patients, cancers, and other factors that doctors at MSKCC see might not be relevant to hospitals elsewhere in the world—or even in different parts of the US. As Pilar Ossorio, a professor of law and bioethics at University of Wisconsin Law School, points out in the article, from the cases used to train Watson, what Watson will learn is “race, gender, and class bias,” basically “baking those social stratifications in” and “making the biases even less apparent and even less easy for people to recognize.”

Bias is inevitable, particularly when it is only one institution’s physicians who are doing the teaching.

It’s also widely known in the oncology community that there is a “MSKCC way” of doing things that might not always agree with other centers. Yet IBM denies that reliance on a single institution to “teach” Watson injects bias, to the point where I literally laughed out loud (and was half tempted to insert an emoji indicating that) when I read a quote by Watson Health general manager Deborah DiSanzo, saying, “The bias is taken out by the sheer amount of data we have.” (She is referring to patient cases and millions of articles and studies fed into Watson.) I can’t help but also note that it isn’t just treatment guidelines that MSKCC is providing. It’s basically choosing all the medical literature whose results are inputted into Watson to help craft its recommendations. As I read the STAT article, as a clinician and scientist myself, I couldn’t help but marvel that IBM is either blissfully unaware that this is a self-reinforcing system, in which one institution’s doctors would tend to recommend the very literature that would support the treatment recommendations that they prefer.

And, MSKCC being MSKCC (i.e., a bit arrogant), the doctors “training” Watson don’t see the bias as a problem:

Doctors at Memorial Sloan Kettering acknowledged their influence on Watson. “We are not at all hesitant about inserting our bias, because I think our bias is based on the next best thing to prospective randomized trials, which is having a vast amount of experience,” said Dr. Andrew Seidman, one of the hospital’s lead trainers of Watson. “So it’s a very unapologetic bias.”

I laughed out loud at that quote, too. Having a “vast amount of experience” without having clinical trials upon which to base treatments can just as easily lead to continuing treatments that don’t work or hanging on to beliefs that are never challenged by evidence. I’m not saying that having experience is a bad thing. Far from it! However, if that experience is not tempered by humility, bad things can happen. It’s the lack of humility that I perceive here that troubles me. There are awesome cancer doctors elsewhere in the world, too, you know:

In Denmark, oncologists at one hospital said they have dropped the project altogether after finding that local doctors agreed with Watson in only about 33 percent of cases.

“We had a discussion with [IBM] that they had a very limited view on the international literature, basically, putting too much stress on American studies, and too little stress on big, international, European, and other-part-of-the-world studies,” said Dr. Leif Jensen, who directs the center at Rigshospitalet in Copenhagen that contains the oncology department.

And:

Sometimes, the recommendations Watson gives diverge sharply from what doctors would say for reasons that have nothing to do with science, such as medical insurance. In a poster presented at the Global Breast Cancer Conference 2017 in South Korea, researchers reported that the treatment Watson most often recommended for breast cancer patients simply wasn’t covered by the national insurance system.

None of this is surprising, given that Watson is trained by American doctors at one very prestigious American cancer center.

Then there’s a rather basic but fundamental problem with Watson, and that’s getting patient data entered into it. Hospitals wishing to use Watson must find a way either to interface their electronic health records with Watson or hire people to manually enter patient data into the system. Indeed, IBM representatives admitted that teaching a machine to read medical records is “a lot harder than anyone thought.” (Actually, this rather reminds me of Donald Trump saying, “Who knew health care could be so complicated?” in response to the difficulty Republicans had coming up with a replacement for the Affordable Care Act.) The answer: Basically anyone who knows anything about it. Anyone who’s ever tried to wrestle health care information out of a medical record, electronic or paper, into a form in a database that can be used to do retrospective or prospective studies knows how hard it is. Heck, just from my five year experience working on a statewide collaborative quality initiative in breast cancer, I know how hard it is, and what we were doing in our CQI was nowhere near as complex as what IBM is trying to do with Watson. For instance, we were looking at only one cancer (breast) and a subset of one state (25 institutions in Michigan), and we were not trying to derive new knowledge, but rather to look at aspects of care where the science and recommendations are clear and we could compare what our member institutions were doing to the best existing evidence-based guidelines.

What can Watson actually do?

IBM represents Watson as being able to look for patterns and derive treatment recommendations that human doctors might otherwise not be able to come up with because of our human shortcomings in reading and assessing the voluminous medical literature, but what Watson can actually do is really rather modest. That’s not to say it’s not valuable and won’t get better with time, but the problem is that it doesn’t come anywhere near the hype. I mentioned that there haven’t been any peer-reviewed studies on Watson in the medical literature yet, but that doesn’t mean there are no data yet. At the American Society of Clinical Oncology (ASCO) meeting this year, there were three abstracts presented reporting the results of studies using Watson in cancer care:

The first study carried out at the Manipal Comprehensive Cancer Centre in Bangalore, India, looked at Watson’s concordance with a multi-disciplinary tumour board used for lung, colon and rectal cancer cases. The AI achieved a concordance rate of 96% for lung, 81% for colon and 93% for rectal cancer.

The second study compared Watson’s recommendations to those made by oncologists at Bumrungrad International Hospital in Bangkok, Thailand – this time across multiple cancer types. Its concordance rate was 83%.

The third concordance study compared Watson’s decisions for high-risk colon cancer to a tumour board from Gachon University Gil Medical Centre in Incheon, South Korea. Its concordance rate in terms of colon cancer decisions was 73%, however, it was only 43% in gastric cancer.

The company explained this was due to differences in treatment guidelines for the disease in South Korea, compared to where it was trained at Memorial Sloan Kettering.

This is mighty thin gruel after such grandiose claims for the technology. Sure, it’s a very good thing that Watson agrees with evidence-based guidelines a high percentage of the time. It’s not so great that its concordance with recommendations was so low for gastric cancer, but it is that lack of concordance that shows the weakness of a system so dominated by American oncologists and cancer surgeons. The reason that treatment recommendations in Asia differ so markedly from those in the US is because of differences in prevalence (which is much higher in Asia) and even biology.

Of course, it’s important that Watson be able to replicate evidence-based treatment recommendations for common cancers, but you don’t need a computer to do that, much less an AI. Where Watson was hyped by IBM was for its supposed ability to “think outside the box” (if you’ll excuse the term) and come up with recommendations that humans would not have thought of that would result in better outcomes for cancer patients. Even these modest results are being hyped in the form of embarrassing headlines. For instance, ASCO, touting the results of the three studies presented at its annual meeting and other results, wrote “How Watson for Oncology Is Advancing Personalized Patient Care.” It read like a press release from IBM. Another article proclaimed that “IBM’s Watson is really good at creating cancer treatment plans.” That’s nice. So are nearly all oncologists, and it’s even arguable that Watson is as good as a typical oncologist.

The M.D. Anderson experience

The M.D. Anderson Cancer Center was, along with MSKCC, one of the early adopters of Watson. Its experience with the project is another cautionary note that shows what can happen when not enough skepticism is applied to a project and how a project like Watson can turn into a massive boondoggle. This was revealed when the partnership between M.D. Anderson and IBM basically fell apart earlier this year:

According to a blistering audit by the University of Texas System, the cancer center grossly mismanaged its splashy program with IBM, which started back in 2012. The program aimed to teach Watson how to treat cancer patients and match them to clinical trials. Watson initially met goals and impressed center doctors, but the project hit the rocks as MD Anderson officials snubbed their own IT experts, mishandled about $62 million in funding, and failed to follow basic procedures for overseeing contracts and invoices, the audit concludes.

IBM pulled support for the project back in September of last year. Watson is currently prohibited from being used on patients there, and the fate of MD Anderson’s partnership with IBM is in question. MD Anderson is now seeking bids from other contractors who might take IBM’s place.

Usually, companies pay research centers to do research on their products; in this case, MD Anderson paid for the privilege, although it would have apparently also owned the product. This was a “very unusual business arrangement,” says Vinay Prasad, an oncologist at Oregon Health & Science University.

According to the audit report, Chin went around normal procedures to pay for the expensive undertaking. The report notes “a consistent pattern of PwC fees set just below MD Anderson’s Board approval threshold,” and its appendix seems to indicate this may have occurred with payments to IBM, too.* She also didn’t get approval from the information technology department.

Hype and hubris in AI: Beyond IBM

It’s very clear that AI will play an increasingly large role in medicine. The massive amount of genomic data being applied to “personalized medicine,” or, as it’s now more commonly called, “precision medicine,” basically demands it because no human can sift through the terabytes and petabytes of genomic data without assistance to find patterns that can be exploited in treatment. What I do have a problem with is hype, and IBM is clearly incredibly guilty of massively hyping its Watson product before it was ready for prime time, apparently not recognizing just how difficult it would be to train Watson to align company hype with scientific reality.

One way to think about it is to consider how machine learning works, how AI is trained to recognize patterns, come to conclusions, and make recommendations. In other words, how can a machine go beyond human-curated data and recommendations? It’s incredibly difficult:

To understand what’s slowing the progress, you have to understand how machine-learning systems like Watson are trained. Watson “learns” by continually rejiggering its internal processing routines in order to produce the highest possible percentage of correct answers on some set of problems, such as which radiological images reveal cancer. The correct answers have to be already known, so that the system can be told when it gets something right and when it gets something wrong. The more training problems the system can chew through, the better its hit rate gets.

That’s relatively simple when it comes to training the system to identify malignancies in x-rays. But for potentially groundbreaking puzzles that go well beyond what humans already do, like detecting the relationships between gene variations and disease, Watson has a chicken-and-egg problem: how does it train on data that no experts have already sifted through and properly organized? “If you’re teaching a self-driving car, anyone can label a tree or a sign so the system can learn to recognize it,” says Thomas Fuchs, a computational pathologist at Memorial Sloan-Kettering, a cancer center in New York. “But in a specialized domain in medicine, you might need experts trained for decades to properly label the information you feed to the computer.”

That’s the bias introduced by relying on MSKCC physicians. It’s a bias that’s much worse than it needs to be because of how IBM relies on one institution and one relatively small group of physicians to train Watson, but, in fairness, it is an unavoidable bias at this stage in the development of an AI. The problem, as it all too often is, is arrogance. IBM appears to have vastly underestimated the challenge in moving beyond the training dataset (as it’s often called in studies like this), for which the answers are known in advance to the computer’s analysis, to the validation dataset (for which the answer is not known in advance).

None of this is to say that AI won’t eventually make a major contribution to the treatment of cancer and other diseases. Rather, it’s just to say that we’re nowhere near there yet. Moreover, IBM is no longer the only player in this game, as has been noted:

Since Watson’s “Jeopardy!” demonstration in 2011, hundreds of companies have begun developing health care products using artificial intelligence. These include countless startups, but IBM also faces stiff competition from industry titans such as Amazon, Microsoft, Google, and the Optum division of UnitedHealth Group.

Google’s DeepMind, for example, recently displayed its own game-playing prowess, using its AlphaGo program to defeat a world champion in Go, a 3,000-year-old Chinese board game.

DeepMind is working with hospitals in London, where it is learning to detect eye disease and speed up the process of targeting treatments for head and neck cancers, although it has run into privacy concerns.

Meanwhile, Amazon has launched a health care lab, where it is exploring opportunities to mine data from electronic health records and potentially build a virtual doctor’s assistant.

A recent report by the financial firm Jefferies said IBM is quickly losing ground to competitors. “IBM appears outgunned in the war for AI talent and will likely see increasing competition,” the firm concluded.

But the “cognitive computing” technologies under the Watson umbrella aren’t as unique as they once were. “In the data-science community the sense is that whatever Watson can do, you can probably get as freeware somewhere, or possibly build yourself with your own knowledge,” Claudia Perlich told Gizmodo, a professor and data scientist who worked at IBM Watson Research Center from 2004 to 2010 (at the same time Watson was being built), before becoming the chief scientist at Dstillery, a data-driven marketing firm (a field that IBM is also involved with). She believes a good data-science expert can create Watson-like platforms “with notably less financial commitment.”

None of this is also to say that IBM is alone in its hubris. It’s not. This hubris is shared by many tech companies, particularly those working on computing and AI. For instance, last year Microsoft was roundly (and properly) mocked for its claim that it was going to “solve cancer” in a decade based on this idea:

The company is working at treating the disease like a computer virus, that invades and corrupts the body’s cells. Once it is able to do so, it will be able to monitor for them and even potentially reprogramme them to be healthy again, experts working for Microsoft have said.

The company has built a “biological computation” unit that says its ultimate aim is to make cells into living computers. As such, they could be programmed and reprogrammed to treat any diseases, such as cancer.

And:

“The field of biology and the field of computation might seem like chalk and cheese,” Chris Bishop, head of Microsoft Research’s Cambridge-based lab, told Fast Company. “But the complex processes that happen in cells have some similarity to those that happen in a standard desktop computer.”

As such, those complex processes can potentially be understood by a desktop computer, too. And those same computers could be used to understand how cells behave and to treat them.

Yes, there is a resemblance between cancer and computing in much the same way that counting on your fingers resembles a supercomputer. The hubris on display was unbelievable. My reaction was virtually identical to Derek Lowe’s, only more so. Indeed, he perfectly characterized the attitude of many in tech companies working on cancer as a “Gosh darn it fellows, do I have to do everything myself?” attitude. Yes, those of us in cancer research and who take care of cancer patients do tend to get a bit…testy…when someone waltzes onto the scene and proclaims to breathless headlines that he’s going to solve cancer in a decade because he has an insight that you stupid cancer biologists never thought of before: The cell is just a computer, and cancer is like a computer virus.

But I digress. I only mention Microsoft to demonstrate that IBM is not alone when it comes to tech companies and hubris about cancer. In any event, I made an analogy to Donald Trump earlier in this post. I was not surprised to find this article making a similar analogy:

“IBM Watson is the Donald Trump of the AI industry—outlandish claims that aren’t backed by credible data,” said Oren Etzioni, CEO of the Allen Institute for AI and former computer science professor. “Everyone—journalists included—know[s] that the emperor has no clothes, but most are reluctant to say so.”

Etzioni, who helps research and develop new AI that is similar to some Watson APIs, said he respects the technology and people who work at Watson, “But their marketing and PR has run amok—to everyone’s detriment.”

Former employees who worked on Watson Health agree and think the way that IBM overhypes Watson for Oncology is especially detrimental. One former IBM Watson Health researcher and UX designer told Gizmodo of a time they shadowed an oncologist at a cancer center that has partnered with IBM to train Watson for Oncology. The designer claims they spoke with patients who had heard of Watson and asked when it could be used to help them with their disease. “That was actually pretty heartbreaking for me as a designer because I had seen what Watson for Oncology really is and I was very painfully aware of its limitations,” the designer said. “It felt very bad and it felt like there was real hope that had been served by IBM marketing that could not be supported by the product I know.”

That’s part of the problem. Patients see the hype and believe it. They then want what IBM is offering, even if it is not ready for prime time. Watson Health general manager Deborah DiSanzo even said, “We’re seeing stories come in where patients are saying, ‘It gave me peace of mind,'” and concluded, “That makes us feel extraordinarily good that what we’re doing is going to make a difference for patients and their physicians.” Patient peace of mind is important, but not as important as actually producing a product that demonstrably improves patient outcomes.

Again, don’t get me wrong. AI is very likely to be quite important in years (more likely decades) to come in health care. Maybe one day it will lead to a real Tricorder, just like in the original Star Trek series. It’s just not there yet. I suspect that Watson will not be the last medical AI effort to fail to live up to its early grandiose claims.