Pay for the data and store it in a blockchain-protected system.

Share this story

Robert Chang, a Stanford ophthalmologist, normally stays busy prescribing drops and performing eye surgery. But a few years ago, he decided to jump on a hot new trend in his field: artificial intelligence. Doctors like Chang often rely on eye imaging to track the development of conditions like glaucoma. With enough scans, he reasoned, he might find patterns that could help him better interpret test results.

That is, if he could get his hands on enough data. Chang embarked on a journey that’s familiar to many medical researchers looking to dabble in machine learning. He started with his own patients, but that wasn’t nearly enough, since training AI algorithms can require thousands or even millions of data points. He filled out grants and appealed to collaborators at other universities. He went to donor registries, where people voluntarily bring their data for researchers to use. But pretty soon he hit a wall. The data he needed was tied up in complicated rules for sharing data. “I was basically begging for data,” Chang says.

Chang thinks he might soon have a workaround to the data problem: patients. He’s working with Dawn Song, a professor at the University of California-Berkeley, to create a secure way for patients to share their data with researchers. It relies on a cloud computing network from Oasis Labs, founded by Song, and is designed so that researchers never see the data, even when it’s used to train AI. To encourage patients to participate, they’ll get paid when their data is used.

That design has implications well beyond healthcare. In California, Governor Gavin Newsom recently proposed a so-called “data dividend” that would transfer wealth from the state’s tech firms to its residents, and US Senator Mark Warner (D-Va.) has introduced a bill that would require firms to put a price tag on each user’s personal data. The approach rests on a growing belief that the tech industry’s power is rooted in its vast stores of user data. These initiatives would upset that system by declaring that your data is yours and that companies should pay you to use it, whether it’s your genome or your Facebook ad clicks.

In practice, though, the idea of owning your data quickly starts looking a little... fuzzy. Unlike physical assets like your car or house, your data is shared willy-nilly around the Web, merged with other sources, and, increasingly, fed through a Russian doll of machine learning models. As the data transmutes form and changes hands, its value becomes anybody’s guess. Plus, the current way data is handled is bound to create conflicting incentives. The priorities I have for valuing my data (say, personal privacy) conflict directly with Facebook’s (fueling ad algorithms).

Song thinks that for data ownership to work, the whole system needs a rethink. Data needs to be controlled by users but still usable to others. “We can help users to maintain control of their data and at the same time to enable data to be utilized in a privacy preserving way for machine learning models,” she says. Health research, Song says, is a good way to start testing those ideas, in part because people are already often paid to participate in clinical studies.

Meet Kara

This month, Song and Chang are starting a trial of the system, which they call Kara, at Stanford. Kara uses a technique known as differential privacy, where the ingredients for training an AI system come together with limited visibility to all parties involved. Patients upload pictures of their medical data—say, an eye scan—and medical researchers like Chang submit the AI systems they need data to train. That’s all stored on Oasis’ blockchain-based platform, which encrypts and anonymizes the data. Because all the computations happen within that black box, the researchers never see the data they’re using. The technique also draws on Song’s prior research to help ensure that the software can’t be reverse-engineered after the fact to extract the data used to train it.

Chang thinks that privacy-conscious design could help deal with medicine’s data silos, which prevent data from being shared across institutions. Patients and their doctors might be more willing to upload their data knowing it won’t be visible to anyone else. It would also prevent researchers from selling your data to a pharmaceutical company.

Sounds nice in theory, but how do you incentivize people to actually snap pictures of their health records? When it comes to training machine learning systems, not all data is equal. That presents a challenge when it comes to paying people for it. To value the data, Song’s system uses an idea developed by Lloyd Shapley, the Nobel Prize-winning economist, in 1953. Imagine a dataset as a team of players who need to cooperate to arrive at a particular goal. What did each player contribute? It’s not just a matter of picking the MVP, explains James Zou, a professor of biomedical data science at Stanford who isn’t involved in the project. Other data points might act more like team players. Their contribution to overall success may be conditioned on who else is playing.

In a medical study that uses machine learning, there are lots of reasons why your data might be worth more or less than mine, says Zou. Sometimes it’s the quality of the data—a poor quality eye scan might do a disease-detection algorithm more harm than good. Or perhaps your scan displays signs of a rare disease that’s relevant to a study. Other factors are more nebulous. If you want your algorithm to work well on a general population, for example, you’ll want an equally diverse mix of people in your research. So, the Shapley value for someone from a group often left out of clinical studies—say, women of color—might be relatively high in some cases. White men, who are often overrepresented in datasets, could be valued less.

Put it that way and things start to sound a little ethically hairy. It’s not uncommon for people to be paid differently in clinical research, says Govind Persad, a bioethicist at the University of Denver, especially if a study depends on bringing in hard-to-recruit subjects. But he cautions that the incentives need to be designed carefully. Patients will need to have a sense of what they’ll be paid so they don’t get low-balled and receive solid justifications, grounded in valid research aims, for how their data was valued.

What’s more challenging, Persad notes, is getting the data market to function as intended. That has been a problem for all sorts of blockchain companies promising user-controlled marketplaces—everything from selling your DNA sequence to “decentralized” forms of eBay. Medical researchers will have concerns about the quality of data and whether the right kinds are available. They’ll also have to navigate restrictions a user might put on how their data can be used. On the other side, patients will need to trust that Oasis’ technology and promised privacy guarantees work as advertised.

The clinical study, Song says, aims to start resolving some of those questions, with Chang’s patients testing the application first. As the marketplace expands, researchers might make calls for specific kinds of data, and Song envisions partnering with doctors or hospitals so that patients aren’t totally alone in figuring out what types of data to upload. Her team is also looking into ways of estimating the value of particular data before the AI systems are trained so that users know roughly how much they’ll make by giving researchers access.

Wider adoption of the data ownership idea is a ways off, Song admits. Currently, companies mostly get to choose how they store user data, and their business models mostly depend on holding it directly. Companies, including Apple, have embraced differential privacy as a way to privately gather data from your iPhone and enable features like Smart Replies without revealing individual personal data. But Facebook’s core ad business, of course, doesn’t work like that. Before any smart math tricks for valuing data are useful, regulators need to sort out rules for how data is stored and shared, says Zou. “There is a gap between the policy community and the technical community on what exactly it means to value data,” he says. “We’re trying to inject more rigor into these policy decisions.”

Share this story

33 Reader Comments

Took a class on Blockchain at Berkeley that was co-taught by Song, she clearly had (has) her eye on the long game here and was (is) looking to fundamentally disrupt the status quo. The students doing research under her - Masters, PhD, postdoc - were all incredibly excited about what they were working on and I heard repeatedly that it felt like the most important work they'd do in their lives. One described it as, "It feels like we're building the original internet, but this time around we have an appreciation for just how much it can/will scale and are engineering to fit those needs."

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable.

However, the fact that they're using Song's company rather than Song's academic lab suggests an even stronger motive for blockchain: Buzzwords for those VC dollars. They're all going to need to be very careful and transparent about how Song and her company benefit from this.

Which leads to the money question. One of the big reasons that paying people for their data is ethically dubious is that it introduces a lot of incentives into the equation that you might not want. Like people lying about their situation so that they can get paid. There's no doubt that academics need a firm kick in the goolies when it comes to data sharing (they tend to be beyond horrific about it), but I'm not sure that money is the right approach. At least not this way. Having the granting agencies make data sharing mandatory, and then making proof of that sharing part of the grant scoring cycle, would probably be a lot better.

<edit> as for the computational power, AWS or Google Cloud or Azure can provide the needed horsepower. This is almost certainly being done on one of those or a serious high-performance cluster.

I don't see an issue with paying patients or anonymizing data. But black boxing the data is going to cause problems. Machine learning is a catch all phrase of a ton of statistical techniques and is not magic. The different techniques have trade offs. In statistics, one of the first things we teach is "know thy data". Model selection should be based on preliminary exploration of the data. With what they are proposing, sounds like that is off the table. This looks like throw a convolution neural network at everything solution. While convolution neural networks are powerful tools, they are not the best at everything. This is going to cause a lot of sub-optimal to just plain bad modeling and prediction. And they want to use this to help diagnose? No thank you.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable.

The problem is that there's literally nothing specific to blockchains that makes this easier, yet blockchains are inherently expensive to run reliably.

Blockchains are basically useless for anything except currency. They're just a very expensive database with permissionless global consensus on current state. Those properties aren't useful here.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable.

The problem is that there's literally nothing specific to blockchains that makes this easier, yet blockchains are inherently expensive to run reliably.

Blockchains are basically useless for anything except currency. They're just a very expensive database with permissionless global consensus on current state. Those properties aren't useful here.

I'm not sure I agree. Blockchains for cryptocurrencies need to do a lot of verification that you don't need to do when using it as a research database. For this application, all you need here is some way to verify that the data are what was originally deposited and you can associate with the right patients.

In the research projects I've been involved with, when blockchain was discussed it made some level of sense, however the final answer was always "meh". But it was a viable alternative. Since we weren't fishing for VC cash, the more mainstream answers usually won out.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable.

The problem is that there's literally nothing specific to blockchains that makes this easier, yet blockchains are inherently expensive to run reliably.

Blockchains are basically useless for anything except currency. They're just a very expensive database with permissionless global consensus on current state. Those properties aren't useful here.

I'm not sure I agree. Blockchains for cryptocurrencies need to do a lot of verification that you don't need to do when using it as a research database. For this application, all you need here is some way to verify that the data are what was originally deposited and you can associate with the right patients.

In the research projects I've been involved with, when blockchain was discussed it made some level of sense, however the final answer was always "meh". But it was a viable alternative. Since we weren't fishing for VC cash, the more mainstream answers usually won out.

There are a lot of "interesting" ideas out there (other than cryptocurrencies) that could potentially make a good use case for blockchain technology. Things like tracking the provenance of materials through complex supply chains used in manufacturing ("who actually supplied the gasket on that particular fitting in my fuel pump," etc.), tracking ownership of tangible assets, etc. I am not aware of any that have actually been proven through practice yet.

Essentially, any time you need to track something across multiple parties that do not have an inherent trust of each other, you could make a case for it.

I guess... I just don't really see it here. I'm assuming that the researchers who need the data aren't directly accessing the blockchain, because why would they need to, unless they need access to the data itself, which the article explicitly states they won't have?

Maybe the blockchain tracks who submitted the data, who's been paid for what, etc. But in a closed environment, like this seems to be, that would be easier with a traditional database.

I don't see an issue with paying patients or anonymizing data. But black boxing the data is going to cause problems. Machine learning is a catch all phrase of a ton of statistical techniques and is not magic. The different techniques have trade offs. In statistics, one of the first things we teach is "know thy data". Model selection should be based on preliminary exploration of the data. With what they are proposing, sounds like that is off the table. This looks like throw a convolution neural network at everything solution. While convolution neural networks are powerful tools, they are not the best at everything. This is going to cause a lot of sub-optimal to just plain bad modeling and prediction. And they want to use this to help diagnose? No thank you.

While I mostly agree with your spirit there is (and has been for a while) a push to stop everyone and their dog actually holding data (atleast in the UK) and using a centralised system to hold the data and allowing users to query it. There isn't a lot of detail in the article but this could be how this would work; you still get to do the data exporlation but you just can't see indivudal records (which is typically not a valid thing to need to do anyway). At the moment in the UK we have systems that do this and there is talk about moving even more this way - e.g. you can do your analysis on a jupyter notebook on a server owned by the data owner. Makes the informatics side of things way easier.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable....

I'd be super skeptical of any anonymization algorithm that allows "researchers to know which bit of anonymized data are from which patient". That sounds like the opposite of anonymization. Correlate enough apparently anonymous data sets from the same patient, you would likely start to develop a larger pattern that could be used for deanonymization.

For something so sensitive, i would want any such anonymization algorithm to be subject to the same kind of adversarial public challenges that encryption algorithms are, with monetary rewards for deanonymization, with some kind of pseudo double-blind test data sets so normal patient data isn't at risk, before I would trust any such algorithm in public. I wouldn't just trust some researcher's claim that they can anonymize data without massive adversarial challenge. Such claims have failed before.

I'm not sure whose oversimplification this is - the author's, the author's sources', or the scientists involved. But it's such a glaring oversimplification that I'm left understanding less than I did before reading.

1) Patients aren't going to "snap pictures of their health records"...at least not if the system is actually going to work. Retinal scans, chest X-rays, cranial PET scans, MRIs, DNA sequencing, biopsy histologies, and on and on: these are all extremely high-resolution data, and most (but still nowhere near all) are electronic to begin with; there's nothing to "snap."

2) Data are meaningless without their accompanying metadata (in this case including things like the patient's medical history and demographics, plus - importantly - exactly what test was performed, by whom, and how). Emailing Oasis a retinal scan accompanied by the message "I has makuler degenerashuns" obviously isn't going to cut it. The diagnostic data must be linked to medical history and to patient demographics to have any value whatsoever.

So, clearly, we're talking about the data being submitted in their original electronic format (or else rendered into high-resolution digital format, in the case of things like photographic X-rays) by the folks who actually possess that data: a huge confederation of healthcare providers, diagnostic laboratories, and insurance companies. Everyone in that chain is, pretty reasonably, going to ask "what's in it for us?" before they commit to the substantial undertaking of preparation (including digitization in many cases) and submission of terabytes of data every month, plus doing all the additional record-keeping it will require (including making sure they have the patient's or his guardian's permission to do so), then keeping the submitted medical histories up-to-date going forward (say, for instance, it ultimately turns out the patient didn't have macular degeneration, but something else instead - that would need to be corrected or else the submitted data would be literally worse than useless).

Finally, some single trusted party is going to need to take responsibility for authenticating the data and confirming that the patient has given his permission - because otherwise this proposed system will inevitably be used by hackers as a new and exciting means of monetizing stolen medical records.

There's no universally-implemented format for all this firehose of data, so somebody's going to have to standardize it. If Oasis is tackling that, God bless 'em; but I doubt they are.

All told, it's a really noble and important goal (it would hugely advance clinical research to have such a broad, deep, and secure central registry of high-quality medical data). But I'm pretty confident it's impossible today (at least in the U.S.); not technically impossible, but rather organizationally impossible.

Hence, I'm left really confused about what is actually being proposed here (aside from "something something blockchain! something medical research").

I think there's a lot of people who would be happy to both be paid and to advance medicine, even if their data is freely shared. Not everyone is embarrassed by their medical problems. Everyone has medical problems--what is there to be embarrassed about?

Quote:

The data he needed was tied up in complicated rules for sharing data.

Medicine went full-on on the side of protecting peoples' privacy, and the consequence is worse health for everyone and tremendous power in the hands of the organizations who have managed to collect data.

The real problem is not lack of privacy, it is how data is used. We all hate advertising, we hate political coercion. We don't want advertising and political coercion targeted against us with any piece of information, no matter how public or private it is. Let's ban uses of data. Even if such rules aren't easy to enforce, they would target the real problem and are better than promoting universal secrecy, which is simultaneously unrealistic and counterproductive.

Its a naive concept. 1) Without inspecting the data you can't winnow it.2) The patient doesn't have the information, and good luck getting it out of the silos that do. Given the bloated expensive USA medical system, they aren't going to give anyone access to this without serious $$$. I've had Imaging places bullshit that HIPAA means they can't share ultrasound pictures with the (child) patient's guardians (parents).

I think there's a lot of people who would be happy to both be paid and to advance medicine, even if their data is freely shared. Not everyone is embarrassed by their medical problems. Everyone has medical problems--what is there to be embarrassed about?

It's a bit unclear if the database is merely "blockchain-protected," or stored in an encrypted blockchain. If the latter, and especially if the goal is patient ownership, I need to ask: say I consent and particpate, but later change my mind. The reasons don't matter, although time and again, we read how easy it is to deanonymize user data. But I'm wondering about the practical aspects. Is it possible to delete private information from a distributed blockchain? Data permanence is one of the key features of cryptocurrency, but it doesn't seem like the ideal tool for storing confidential medical information.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

I'm thinking blockchain is needed for maintaining the silo while still being able to share data.

Part of the motivation for using blockchain may also be to create an unalterable record of analyses that have been run. I've heard that in a different context: you want to not only make sure that researchers don't attempt to identify individuals, but you have to make it commonly known that they're not attempting that. And the way to do it is to publish a record of every analysis that has been run on the data.

Where I think this fails is that it requires people to take action that isn't straight forward. Even if you have a doctor who gives you a login to your electronic medical records, all these systems operate quite differently. There's no standard for exporting/downloading your data -- or even the detail of data that are released. It'd be a lot of work for patients to keep their files up-to-date and most would probably forget to do so -- especially if keeping information updated would net them a couple bucks a year only. Consenting to have the data transferred automatically would be the only way to make this workable.

But there's a bigger issue of whether there should be opt-in consent for this. In much of medical research, patients do not get to consent to participate and are not even aware they're part of experiments. The hospital you go to, for example, might be randomized into receiving different versions of an electronic medical system. Doctors may be randomized into getting different types of reminders for best practices or the ordering of drugs they see when prescribing may be different. On top of that, health outcomes are routinely studied to evaluate the effectiveness of care, all without patient consent. All of that is super useful in improving services and patient outcomes and just couldn't be done if every patient or doctor individually had to consent.

I think a similar approach to consent makes sense in the use of medical data for analytics. The value of a single x-ray or DNA sequence is virtually zero. But if we can combine data from millions of people, we can make new insights into early-detection, causes, and treatments for a wide range of diseases. This is such an obvious case of where the benefits outweigh the "harm" -- in fact, it's not even clear there's any harm. Moreover, large university hospital systems already aggregate data from all their patients. The difference here is just that data would come from multiple hospital systems. If patients had to be paid for the data, this would only drive up the cost of this type of research and hence reduce how much of it gets done... that's not in the interest of anyone.

There are lots of reasons why you don't want any random person to have access to your medical history (see LesDawg's post for some of them). But there's absolutely no reason why you wouldn't want researchers to have access to the data. Ultimately, we benefit from learning best practices and getting better at predicting and treating medical conditions. In some cases, this may come at an economic loss to an institution (e.g. we learn that an expensive procedure is no better than a cheap alternative); we'd want to make sure this is not what's driving the research, which may happen more when you have to pay for the data.

The traditional solution to that is a marketplace. Facebook would assign a value, you decide whether that value is sufficient, if you agree you allow them to use the data and they pay you.

This is already better than what we have now (Facebook takes the data, you get nothing) but not optimal. Better would be, Facebook collects and safeguards the data but does not use it for advertising. Instead, Facebook acts as a platform that allows advertising companies to make you offers (preferably in a streamlined comparison chart you can look at when you choose to, not each of them spamming you), you decide which if any of those offers to accept. Each offer specifies how the data will be used, what data will be provided, and what the compensation will be. By separating the platform that collects the data from the companies that want to use it, you can have a single platform that respects privacy and creates the opportunity for competition and a working market. It's sort of like the way we create competition in utility markets (in the states that have done that).

Of course for this to work, you have to cross out "Facebook" in the paragraph above and insert some other company that actually has credibility.

I honestly didn't really understand how blockchain applies here. The idea of keeping the data in a protected "black box" and having the researchers submit their models to *it*, rather than the other way around, seems sensible -- but do you actually need blockchain for that?

The other thing that jumped out at me, if the AI training is done within Kara, then wouldn't that mean Kara would need truly massive amounts of computational power?

Since the goal seems to be the decentralized use of anonymous data, blockchain might actually be useful. It would allow researchers to know which bit of anonymized data are from which patient. That can certainly be done in other ways, but this seems reasonable....

I'd be super skeptical of any anonymization algorithm that allows "researchers to know which bit of anonymized data are from which patient". That sounds like the opposite of anonymization. Correlate enough apparently anonymous data sets from the same patient, you would likely start to develop a larger pattern that could be used for deanonymization.

Sorry, bad wording on my part. At least in my experience, anonymization starts with replacing the patient's name with an anonymous identifier like a UUID. Dates get replaced with something like the number of days from a certain event (like enrollment in the study). Ages are kept, though for older patients you may just see ">80".

Quote:

For something so sensitive, i would want any such anonymization algorithm to be subject to the same kind of adversarial public challenges that encryption algorithms are, with monetary rewards for deanonymization, with some kind of pseudo double-blind test data sets so normal patient data isn't at risk, before I would trust any such algorithm in public. I wouldn't just trust some researcher's claim that they can anonymize data without massive adversarial challenge. Such claims have failed before.

edit: would -> wouldn't

There really isn't a single "anonymization algorithm". It's more of a process. LIke I described above, clinical and patient like names and dates are replaced by identifiers. Even things like hospital or clinic names are replaced.

However, it is really critical that all of a patients information be tied back to the anonymous patient identifier. If you don't do that, you can't correlate anything interesting you might find with the patient's treatment and outcome. Yes, that raises the risk of re-identification. The only thing I can say is if you're that worried about your medical history being tied to you in the real world, don't volunteer your information.

I think there's a lot of people who would be happy to both be paid and to advance medicine, even if their data is freely shared. Not everyone is embarrassed by their medical problems. Everyone has medical problems--what is there to be embarrassed about?

A prospective employer might refuse to hire you, based on your medical history.

A potential partner might ghost you after googling your infectious disease medical history.

If Republicans have their way, insurance companies might once again refuse you coverage based on your medical history.

A freaking anti-abortion nutcase might stalk or harass you based on your past abortions.

Pharmaceutical companies might 'personalize' their pricing based on how desperate a patient should be for their drug (e.g.: maybe you're allergic to the only competing drug? Ka-ching!).

The list goes on and on...leave it to the ghouls to always think up more and better ways to immorally monetize your medical history.

Maybe many of these same things will happen to your children because your medical records, plus their mother's, reveal that they must certainly carry a gene predisposing them to a disease.

On and on.

Most of the things you mention are already illegal, and successfully so. You’re reinforcing my point, that we need to make specific uses illegal, rather than have very strict laws on the very act of medical data sharing. Forbidding data sharing doesn’t even prevent most of the abuses you mention, and you need to stop them in other ways anyway.

The baseline in this discussion is still that you got paid and consented to release your data. I just don’t see a need for an anonymizing blockchain protecting your privacy (while making it very hard to do data science) or panic in general.

I don't see an issue with paying patients or anonymizing data. But black boxing the data is going to cause problems. Machine learning is a catch all phrase of a ton of statistical techniques and is not magic. The different techniques have trade offs. In statistics, one of the first things we teach is "know thy data". Model selection should be based on preliminary exploration of the data. With what they are proposing, sounds like that is off the table. This looks like throw a convolution neural network at everything solution. While convolution neural networks are powerful tools, they are not the best at everything. This is going to cause a lot of sub-optimal to just plain bad modeling and prediction. And they want to use this to help diagnose? No thank you.

While I mostly agree with your spirit there is (and has been for a while) a push to stop everyone and their dog actually holding data (atleast in the UK) and using a centralised system to hold the data and allowing users to query it. There isn't a lot of detail in the article but this could be how this would work; you still get to do the data exporlation but you just can't see indivudal records (which is typically not a valid thing to need to do anyway). At the moment in the UK we have systems that do this and there is talk about moving even more this way - e.g. you can do your analysis on a jupyter notebook on a server owned by the data owner. Makes the informatics side of things way easier.

This works well with structured data such as census data where the data fields are well understood.

But patient data consists of lots of data much of it poorly structured : there are doctors' comments which don't follow a common standard , diagnostic reports from widely different lab machines etc. It requires a lot of exploration and experimentation to featurize such data.

So yes, once you have proper features you don't have to look at the records anymore - but if you prematurely silo the data then a lot of research simply can't be done .

I am unconvinced that the best way to combine access to my data with data privacy is blackboxing and metadata sales. Essentially, that is the popular approach now, and it is entirely unsatisfactory from a data owner point of view.

The world may not be ready for it yet, but any solution to this need to encompass retaining permanent ownership of data concerning yourself. It may be licensed, but it must be fairly (and that won't happen until the worth of data is visible to a common citizen). It may be used, but it must be transparant, so that trust can be verified when necessary.

Perhaps it could be initially licensed to brokers along the line of retirement trusts, where you can pick from a list with clearly offered terms. Either way, data is too valuable to be left unused by market actors and too valuable to be free. But this proposal goes about it the wrong way.

If researchers cannot see the data, does that mean that the connection to the patient is invisible, or the data itself is? I imagine that researchers might WANT to look at say, an eye scan, to see why a neural net flagged it, but is THAT hidden? If not, what prevents people from just hijacking that data and leaking it to be used for free?

Honestly I do not understand blockchain very well at all, so forgive me if these are stupid questions

If researchers cannot see the data, does that mean that the connection to the patient is invisible, or the data itself is? I imagine that researchers might WANT to look at say, an eye scan, to see why a neural net flagged it, but is THAT hidden? If not, what prevents people from just hijacking that data and leaking it to be used for free?

Honestly I do not understand blockchain very well at all, so forgive me if these are stupid questions

Blockchain is really irrelevant here. If you ignore that part, it will probably make more sense.

Machine learning typically requires a training set of data and a test set of data. The training set is used to "teach" the system what to look for. In this case you might have a set of eye scans each labeled with a diagnosis. You don't tell the system what features to look at, you just give it tens or hundreds of thousands of images and adjust the result based on the diagnosis. You generally won't know what exactly it's looking at, or why, and looking at any individual training image does not tell you very much -- if it's good training data. If the training data is not uniform, though, you might end up training the system to look at those irrelevant differences. E.g. you took all the images with the diagnosis on one piece of equipment and the images without the diagnosis on the other, and there are slight differences in the color, contrast, background, etc. Basically if the training data is not good enough, the machine learning won't be good either. So you have to either look at the original data or trust whoever you got it from. Here, it sounds to me like you would need to trust it.

Then you have some test data. You give the test data to the system and it spits out a diagnosis for each image. It's more likely you'll want to look at the images here. If the system misdiagnoses an image, you might look to try to figure out why. Maybe there is a bias issue, maybe the image actually does look similar to whatever syndrome you are looking for, maybe it was even misdiagnosed originally.

Sharing patient data is something that needs to happen for research but also for the industry to function more efficiently. For example, the pharmacy you get your prescription filled at should know about all your medications, not just the ones that you happened to fill at that pharmacy chain location. To get that data the pharmacist has to know to call all the different doctors that you have seen over the years and request that data be sent over to them because in most states in the US the physician technically owns that data, not the patient. Patients can request full copies of the data, but they might just end up with a stack of papers from all the different doctors and facilities that they have visited over the years, if they remember all of those doctors and facilities.

MIT media lab has a proposal for a blockchain based method of sharing this data that stays within the current system where the data is distributed among the physicians and facilities that host the data but patients can grant access to the data to be shared via the blockchain. Researchers play the role of miners and are rewarded with data sets that pertain to their research. I as a patient in the system say that my records pertaining to my condition can be shared anonymously with researchers.

If researchers cannot see the data, does that mean that the connection to the patient is invisible, or the data itself is? I imagine that researchers might WANT to look at say, an eye scan, to see why a neural net flagged it, but is THAT hidden? If not, what prevents people from just hijacking that data and leaking it to be used for free?

Honestly I do not understand blockchain very well at all, so forgive me if these are stupid questions

Blockchain is really irrelevant here. If you ignore that part, it will probably make more sense.

Machine learning typically requires a training set of data and a test set of data. The training set is used to "teach" the system what to look for. In this case you might have a set of eye scans each labeled with a diagnosis. You don't tell the system what features to look at, you just give it tens or hundreds of thousands of images and adjust the result based on the diagnosis. You generally won't know what exactly it's looking at, or why, and looking at any individual training image does not tell you very much -- if it's good training data. If the training data is not uniform, though, you might end up training the system to look at those irrelevant differences. E.g. you took all the images with the diagnosis on one piece of equipment and the images without the diagnosis on the other, and there are slight differences in the color, contrast, background, etc. Basically if the training data is not good enough, the machine learning won't be good either. So you have to either look at the original data or trust whoever you got it from. Here, it sounds to me like you would need to trust it.

Then you have some test data. You give the test data to the system and it spits out a diagnosis for each image. It's more likely you'll want to look at the images here. If the system misdiagnoses an image, you might look to try to figure out why. Maybe there is a bias issue, maybe the image actually does look similar to whatever syndrome you are looking for, maybe it was even misdiagnosed originally.

I do have a decent understanding of neural nets (though, recurrent ones are still weird) it's really just the last bit that's confusing. I guess what you're getting at is that while the training data is "invisible", the test data can be individually converted to the format that the NN interprets, but be initially visible?