MUSINGS ON SCIENCE & SOCIETY

Can we predict a person’s face using DNA?

So, do you think that a person’s face can be reconstructed by just using their DNA information? After all, it has been a long cherished dream for many security agencies, the stuff of nightmares for common folks and often features quite high among sci-fi themes. Now, according to a paper published by Craig Venter, the multi-millionaire genome entrepreneur and colleagues from his San Diego based Human Longevity Research (HLI) in Proceedings of National Academy of Sciences (PNAS), claim that they just discovered this magic potion [1].

What they claim?

Using some very costly camera setup to get facial information, hundreds of rounds of high-coverage sequencing for 1000+ people and some fancy machine learning algorithms they used SNP’s to check their association with various facial features like the height of cheekbones. They could correctly identify one individual out of ten drawn randomly from the HLI’s database around 74% of the time. They then proceeded to caution that public genomic databases can be used to identify an individual’s facial identity and hence privacy is at risk.

Now, this does sound very dangerous and hence quite rightly created a furore since high chances of an individual’s identification just from the genetic makeup would be boon to catching criminals but also put at risk all the publish genetic databases which routinely remove individual information before their public release. And right on cue, came the heavy-handed rebuttals from scores of geneticists and computer scientists well versed with not only the advanced machine learning algorithms used here but also human genomics.

Controversy!

The main problem with Venter & Co’s magic is that they claim to have used genomic data (SNP’s) to predict the facial features of an individual including the age etc based on their database. But their (HLI) database contains just around 1000 individuals and if one knows just the demographic variables like age, sex or race then it’s quite easy to identify an individual out of 10 without their genomic information. To prove this claim, Yaniv Erlich from the New York Genome Centre undertook a small study in which he used a simple re-identification procedure that relied on basic demographic information: age,
sex, and self-reported ethnicity present in many databases. Unsurprisingly, Erlich was able to achieve a 75% success rate compared to Venter’s 74% which relied on much sophistication and other fanfare[2].

Apart from this, Venter & Co’s claimed novelty is in using trait predictions to identify people’s facial features. Unfortunately, if one takes a careful look at the facial features then it’s quite understandable that their re-identification power probably comes from inferring genomic ancestry and sex from the database rather than trait-specific markers [2]. Thus, they do not predict the face structure of a specific person or their height. Instead, they get something that is very close to the population average and then uses those values to re-identify.

Just have a look at the images below from their paper (S11) [1, 2]:

What we can see is that predicted faces are pretty much identical to each other. Hence, the much vaunted Venter paper predicts a “genetic white male face” [2] rather than the actual face of a person.

Now the smelly stuff

Gymrek et al., [4] in their paper used a combination of a surname with age, state etc., to successfully re-identify a target individual. So, as Erlich [2] mentioned, if Venter et al.’s, method work astronomically well then wouldn’t it be easier to just use some normal public database instead of a highly curated HLI one (which has been sequenced with great precision) to pick an individual and then successfully re-identify that person.

Now it’s quite reasonable to ask if this paper contains such misleading information (especially the way it’s promoted), then how come it got published in a prestigious journal like PNAS. Here, the excellent coverage by Nature News comes into perspective[3]. Earlier this was flat-out rejected by the journal Science based on the negative reviewer comments. Then Venter used the highly controversial PNAS submission policy in which a member of National Academy of Sciences can recommend the reviewers for the paper to be published. In fact, Venter used this route and chose the reviewers. Two of them are information-privacy experts and the remaining reviewer is a bioethicist. So, can you spot the problem here? A paper which uses sophisticated machine learning algorithms and uses genomic databases doesn’t even have a genomics expert or a statistician for its review, is the main stinker here!

Many would remember such controversial paper submission routes used by PNAS was the main reason why a few years ago we saw a highly unscientific and balmy paper getting published which famously claimed that “worm-like creature must have had sex with a winged, insect-like creature, and the metamorphoses of butterflies resulted from these distinct lines of genes merging” [5]. So, is this a similar case? Are we yet again seeing a modified form of cronyism here? This point is worth pondering, I say!

So what now?

According to Nature News, Venter & Co would soon come back with a fanciful reply to all the quite mathematically valid objections put forth by Erlich, and rather than wait for those magic tricks I would rather carry on with vigilance. Science has become very powerful and has permeated almost every sphere of human existence imaginable. We as scientists should continue to check, re-check every piece of scientific work coming out and should also have the moral tenacity to call ring the bell on misinformation.

Especially when modern public genomic databases in various biobanks have become an important resource to study complex diseases and come up with personalised medicine, we should exercise due caution towards papers like Venter’s which claim non-existent threats to genetic privacy.