POWERDOT awarded $500,000 and Announcing Heritage Health Prize 2.0

Team POWERDOT joined forces last October after duking it out separately as former rivals and milestone prize winners. Team members include David Vogel, Chief Scientist of Voloridge Investment Management, Dr. Randy Axelrod, Executive VP, Providence Health & Services, Rie Johnson, a machine learning researcher, Willem Mestrom, Business Intelligence specialist at Independer in the Netherlands, and Edward de Grijs, an engineer and software developer also from the Netherlands, Tong Zhang, a machine learning researcher, and Phil Brierley, Analytics Consultant of Tiberius Data Mining from Australia. Vogel, Axelrod, Mestrom and de Grijs accepted the prize winnings today on behalf of the group at Health Datapalooza.

Building on the efforts of HHP, we are very excited to announce that HPN is launching a $3 million private "masters" competition, which Kaggle will also host. The competition will be open to the top eligible finishers from the first Heritage Health Prize.

The challenge will be the same as the first prize — to predict hospitalisation of individuals — with one very substantial difference: there will be little, if any, data anonymization. For privacy reasons, the public competition used data that had been very heavily anonymized. For example, nearly all information about prescriptions was held back, and diagnostic information from lab results was summarised to just some high level information. Furthermore, information like age was categorised into a few bands — the exact age of patients was not provided. In fact, the anonymization process was so complex that the approach was detailed in a peer reviewed academic journal.

Noted data scientist Pete Warden has explained that "you can't really anonymize your data" but also pointing out "there’s so much good that can be accomplished using open datasets, it would be a tragedy if we let this slip through our fingers ..." This new competition will be the first time that the impact of data anonymization on health outcomes will really be understood, and will likely provide strong evidence that a more nuanced approach to open data legislation could greatly improve health outcomes.

This will also be the first time that there has been an invitation-only Kaggle competition with such a large purse. It will be very exciting to see how the world's best data scientists respond to this great challenge.

I'm more than a little disappointed this is only open to the teams that originally participated. SOME of us looked at the data right at the beginning, saw that it was too dirty and a waste of time, and didn't continue. Many others (look at the early forum posts) did the same. Now that it's onto the real data you're not doing a private signup, or any type of registration that is accessible. I understand the need for anonymity, and some type of screening, but only letting your "B" team compete (i.e., the ones who wanted the prize money when they knew the math wouldn't really work out) isn't fair.Or rather, it's just going to be another opportunity thrown away by Heritage Health.

To those of you who weren't involved in this early on / just tuning in, Heritage posted a large reward for a formula to predict hospital stays based on past hospital admissions. The data they provided was horrible. 5-year-old boys were listed as pregnant. When pressed, Haggle responded that the "pregnant" was because that was the initial category the child was when it was admitted. The mother came in while pregnant, they opened a file for the child, and the child's category from that point on was a pregnancy-- even 6 years later. Some of us bailed on the competition at this point.

As the competition drew on and the best teams were only marginally better than stupidly taking the global average and guessing, Heritage and Kaggle had a problem. They were contractually obligated to pay $3 to whoever had the "least worst" formula. It was a waste of time and money; Heritage wouldn't admit their historical data was awfully kept, and Kaggle wasn't getting them any promising results. Everyone was in trouble.

Now this "new" competition is an attempt to solve this problem. Heritage will release the "real" data (which will be just as dirty, except now the 5-year-old pregnant boy will have a name), thinking that somehow they'll get better results (they will slightly, since names allow guesses for things like race and socio-economic status). Kaggle can't disband the entire competition since there was paperwork signed with Heritage, and teams have been working on this for, literally, years.

And here we are.

Heritage, you won't get the formula you're looking for. You're right that non-anonymized data will be better, but it won't be worth $3 million to you. Kaggle, some of your best teams saw the competition for what it was early on. Either find a way to be more inclusive, or accept that your "contest" isn't fair, and your flagship "$3 Million Prize" competition is a failure.

Regards,

- Charles
(early participant and forum poster in the original competition)

TheWinch

Background: I started work nearly 3 decades ago in healthcare data analysis working as one of the early employees at one the three organizations doing healthcare insurance data analysis in the USA, the CHAMP division of Mercer Inc (CHAMP is now part of Thompson Reuters).

A HUGE percentage of our time was spent doing data cleaning (so reading about pregnant 5 year olds in this data is not surprising). Just for one 10,000 employer we charged them (in 1988 dollars) many tens of thousands of dollars to do the data cleanup. To do a proper data cleanup for a proper size database (more on that below) could exceed the $500,000 that was awarded.

I just took a 5 minute glance at the data description for this contest.

It appears to be data only captured by a hospital (I assume the lab and drug info also came from the hospital). That is way too little data. You need the claims data from ALL providers and you need it for at least 3 years in the past. You also need records of all the people who could have submitted claims and if they submitted claims to some outside provider. How do you even know from this data set, that someone was hospitalized by some non-Heritage provider part of the year?

All of the above is standard practice by companies whose sole task is to do data analysis of US health insurance data.

And the size of those zip files... just 100 MB? I was doing work on a single PC in 1988 with data that small (because no PC in the world could handle more). Today... I suppose I would use my smartphone to process that. 😉 Back then, we had to keep the raw data (what would be in those zip files) on shelves of tapes because of the disk size limits on PCs. What is the possible reason for doing a test on such small data?

When you look at the general population (not just those hospitalized in 1-2 years) you will find that hospitalization is a rare event (e.g. roughly 1 out of 20 people under the age of 50 are in the hospital sometime during a year, slightly higher if you are a woman in child bearing years). Repeat hospitalizations over consecutive years is mostly for the elderly, who are covered under medicare, and are at doctors offices or getting prescriptions so frequently, it doesn't require 3000 of the world's best data scientists to figure out hospitalization utilization with a normal data set.

This competition was stupid.

A 5 minute discussion with anyone working for a health care data analysis firm would have confirmed that. I'm sorry for the wasted effort of likely hundreds of thousands of man hours spent by the participants on this competition.

I don't believe the main problem with HHP was anonymization. Intuitively, it just seems hard to predict if you'll be hospitalized next year, let alone how many days. Now, predicting something like total insurance amount claimed a few months down the road might be more feasible.