DNA storage: The code that could save civilisation

Ed Yong

About the author

Ed is an award-winning science author. He writes the blog Not Exactly Rocket Science and his work has appeared in New Scientist, Nature, Scientific American, the Guardian, the Times, Wired UK, Discover and more. He tweets at @edyong209.

Related

Scientists have given another eloquent demonstration of how DNA could be used to archive digital data.

Neither Ewan Birney nor Nick Goldman can remember exactly how they came up with the idea of storing all the world’s knowledge in DNA. They know it happened in the bar of the Gastwerk Hotel in Hamburg, and that many beers were involved. They may or may not have scrawled their ideas on a napkin. “It must have involved a pen or pencil because I can’t think without holding one,” says Goldman. “It would’ve involved a lot of hands from me,” says Birney.

Their chat was fuelled by a simple realisation: scientists would soon start amassing more genetic information than they could afford to store. In the 1990s, this problem would have seemed laughable. Back then, it took a decade to sequence the human genome and geneticists could store their data on an Excel spreadsheet. Since then, the relentless improvement in sequencing machines has turned that trickle of genomic data into a full-on flood. This technology doubles in efficiency every six months, allowing you to sequence twice as much DNA for the same amount of money. However, it takes 18 months to get twice as much hard disk for your buck, so it is starting to cost more to store the results of experiments than to actually run them in the first place. “And at some point, not too far in the future, you would run out of either disk space or money,” says Goldman.

That would be a setback for a normal lab and an outright catastrophe for the place where Birney and Goldman work. Located in an isolated campus on the outskirts of Cambridge, UK, the European Bioinformatics Institute (EBI) stores genomic data from labs all over the world. At an internal conference in Hamburg, in April 2010, “you couldn’t move for someone saying the EBI will have to close down the DNA archive because it’s unsustainable”, says Birney.

After the conference, Goldman and Birney retreated to a local pub and started batting around possible solutions, beers in hand. They realised that the big problem was the cycle of obsolescence that all data-storing technologies go through. Old machines are junked in favour of new hardware (remember VCRs?) and any data stored on out-of-date media must be re-read and re-written onto the medium du jour, all at great expense. “We thought: Isn’t there some other nano-machine that would allow us to store digital data?” says Birney. Both of them start laughing—the answer was so obvious. “We said: Duh! It’s going to be DNA.”

Living things have been storing information in DNA since the dawn of life, including the instructions for building every human, animal, bacterium and plant. The molecule itself looks like a twisting ladder, whose rungs are made of four molecules called bases that pair up in specific ways—adenine (A) with thymine (T), and cytosine (C) with guanine (G). If you can create your own strands of DNA, with the ones and zeroes of binary data converted into these As, Gs, Cs, and Ts, you have a storage medium that will never go obsolete. Sequencing machines will continue to improve and will need to be replaced, but once information is stored in DNA, that’s that.

In terms of information density, DNA outclasses anything we’ve been able to invent. A single gram can contain as much data as 3 million CDs. All of the world’s data would fit in the back of a minivan.

And once encoded into DNA, information is a doddle to copy. To transfer the contents of one hard disk into another, you need to hook both of them up to a computer and wait for minutes or hours. To transfer the contents of a tube of DNA, you dissolve it in water, suck up some of the liquid into a pipette, and squeeze it into another tube. It takes seconds. “I could copy a petabyte like this,” says Birney, who mimes depressing his thumb.

‘I have a dream’

Scientists have encoded short messages in DNA for years using simple ciphers. When sequencing pioneer Craig Venter’s team implanted a bacterium with a fully synthesised genome in 2011, they added their names and several famous quotes into the fabricated DNA. The messages were coded using combinations of three bases to signify each character – for example, AGT stood for the letter B.

But encoding longer messages – say, a book or a video file – is far more difficult. DNA can only be synthesised and read as small fragments of around 200 base pairs or smaller, so larger chunks of information must be broken down before they are encoded. When those fragments are synthesised, you get a messy soup containing millions of copies of each one. So, every piece needs an identifier that reveals where it fits into the overall message – “I’m the first fragment” or “I’m the 765th”.

George Church, a geneticist at Harvard University, used this approach to encode a copy of his entire book – Regenesis – into DNA. The text, including 53,426 words, 11 illustrations and a JavaScript program, came in at 5.2 million bits of information. Church split these into almost 55,000 fragments and converted them into bases by using A and C to represent zeroes and G and T to represent ones, and his work was published in the journal Science in August 2012.

Working on a similar principle, Birney and Goldman chose five files representing a range of formats and (mostly) material of great cultural value. A PDF of the classic 1951 paper in which James Watson and Francis Crick described DNA’s double helix was an obvious choice. The duo originally wanted Shakespeare’s complete works but they underestimated the size of the Bard’s output, so they settled for just the 154 sonnets in ASCII text. A 26-second MP3 clip of Martin Luther King’s “I have a dream” speech filled the audio slot after the duo ruled out Lady Gaga. A copy of the cipher used to encode the data was a practical choice, and a JPEG picture of the EBI was the lone concession to narcissism.

Birney and Goldman also devised a more complex cipher than Church. First, they converted binary data into base-three, replacing every byte – a string of 8 zeroes and ones – with a corresponding string of 5 zeroes, ones and twos. Next, they replaced these numbers with DNA letters, using a code where the meaning of each letter depends on the one before it. For example, A means 1 if it follows a G, but 0 if it follows a T and 2 if it follows a C.

Why so complicated? Because in this code, no letter ever appears twice in a row. Repetitive strings of bases – such as AAAAAAA – are the bane of both DNA synthesisers and sequencers. If you can avoid them, your error rate plummets.

Still, there would be mistakes. “We had to go in saying we were going to make errors,” says Birney. “It’s a disaster to think your technology won’t have errors.” Church got 11 mistakes out of 5.2 million letters – hardly catastrophic, but Birney and Goldman wanted none. So, they built redundancies into their code. They broke the five files into more than 153,000 fragments, each 117 letters long. Each string overlaps with four others, so that every bit of information is repeated four times. If any fragment is synthesised wrongly or cannot be read, its contents can be pieced together from at least three others.

To illustrate this system during talks, Goldman has made Lego versions of his DNA strings, using red, yellow, blue and green bricks to represent the four bases. When I clumsily drop one of the strings, it shatters on the floor, but we quickly use the neighbouring strands as templates for reassembling the broken one. We even deduce that a blue piece has rolled off under some furniture. The fail-safes work.

This side of crazy

In March 2012, Californian company Agilent Technologies created the DNA strands that Goldman and Birney had designed and shipped them back to the researchers. The sonnets, speech, paper, image and program all arrived as dry white dust specks at the bottom of several pinkie-sized tubes. Goldman, who hadn’t worked with lab experiments since he was 16, thought they were empty. “Nick said, ‘Agilent haven’t sent us anything! Are you going to have to write the email or am I?’,” says Birney.

When the team sequenced the DNA they received from Agilent, they reconstructed four of the files perfectly. But the Watson and Crick paper had two gaps of 25 letters each, where several consecutive fragments had mysteriously gone missing. After two days of staring, Goldman worked out that the missing pieces had matching ends that caused them to fold into a hairpin-shape – which the sequencer skips past without reading. Fortunately, the gaps were flanked by a recurring pattern, so he could enter the missing letters by hand. In the end, the team rebuilt all five files with 100% accuracy.

“I describe this project as being on just this side of crazy,” says Birney. “It works but isn’t commercially feasible now.” The exorbitant cost of making DNA is the biggest hold-up. For the moment, you need $220 to read each megabyte of DNA data but $12,400 to write it in the first place; however, these costs are likely to fall 100-fold within the next decade. They are also one-off investments; once data is written as DNA, it never needs to be re-written into new-fangled formats. Birney and Goldman predict that soon, DNA will be the ideal medium for storing data that you want to keep for a long time but not regularly revisit, such as wedding videos or the archives of huge science projects like the Large Hadron Collider at Cern.

Or, perhaps, all of human knowledge? Besides being universal, dense and easily copied, DNA is also incredibly stable. A recent study showed that DNA has a half-life of 521 years – that’s how long it takes for half the chemical bonds in its double helix to break. This estimate was based on DNA recovered from the 8,000-year-old leg bones of giant extinct birds called moas. But that’s nothing – these bones were preserved at 13C in New Zealand. Under gentler conditions, DNA’s shelf life last can stretch to tens of thousands of years. “For perspective, that’s all of modern human evolution,” says Birney.

At some point during their project, Goldman and Birney realised that such long-lasting information stored in DNA won’t just outlast new pieces of technology, but entire civilisations. They started thinking about a fanciful application: using DNA to apocalypse-proof human culture.

Imagine a future cataclysm that sends humanity back to the dark ages. Our population dwindles from billions to thousands. Electronic devices malfunction and digital information gets wiped. Languages die out, scientific knowledge is lost, and works of art are destroyed. But humanity bounces back. It takes 10,000 years but a new civilisation rises from the ashes and starts the process of re-discovery. The letters D, N and A would mean nothing to these descendants, but at some point, they would discover that a common molecule, shaped like a double helix, unites all living things. Using that molecule, we could provide them with an archive of all our scientific discoveries, our literature, and our immense treasure trove of cat videos.

Goldman and Birney have run through this thought experiment in surprising detail. For a start, they would build the archive somewhere cold, dry, dark and with “no immediate geological plans to relocate”. Think of the Svalbard Global Seed Vault on the Norwegian island of Spitsbergen. Built 130 metres (425 feet) above sea level and 120 metres (390 feet) into a mountainside, this unstaffed vault houses thousands of seeds to preserve the genetic diversity of food crops in case of a global crisis. It runs on coal power, but a DNA archive would not even need that. The elements would suffice.

Chamber of codes

Nature has already proved that this can work. Scientists have sequenced most of the woolly mammoth’s genome in this way, after extracting DNA from specimens that had spent up to 60,000 years in Siberian permafrost. The genome of the Denisovans – a group of extinct humans who lived in Asia – was sequenced using only the DNA from a 41,000-year-old finger bone found in a Siberian mountain cave.

Goldman and Birney’s hypothetical vault would have three rooms. First up: an introductory chamber, with explanatory illustrations etched on a durable metal like nickel or gold. We cannot assume that any modern languages or visual conventions will survive, but thankfully, the chemistry of DNA will remain unchanged. We could draw out the entire double helix, atom by atom, using concentric circles to represent the individual elements. To emphasise the point, add pictures of humans and other living things on the side. If the people who find the vault have already re-discovered DNA, this would all make sense. If they have not, then future-Watson and future-Crick just got a massive clue.

The chamber would contain a few test vials containing simple DNA messages for future cryptographers to decipher. Every step of the way, from DNA sequence to binary code would be etched in full on the walls – a Dummies’ Guide to DNA code-breaking. “You’d step them through all of those things visually and symbolically,” says Birney.

Maybe we would write the first hundred numbers in binary and circle the primes. Maybe we would present a full periodic table, with arrows pointing to the right elements.

By now, the explorers of 12,013 have worked out how to read our DNA code, but aside from some images, they are mostly faced with gobbledygook written in dead languages. “You’ve reconstructed the bytes of the sonnets but you have no idea what English is,” says Goldman. This is where the second room comes in – it is a massive Rosetta Stone. The same piece of text – perhaps the United Nations Declaration of Human Rights – would be translated into the most common languages and etched next to vials containing the same message in DNA. “The hope is that future civilisations can recognise one of these things, just like the Rosetta Stone,” says Birney. “We knew Greek, and from there we could go to hieroglyphics.”

Our intrepid descendants will need to get this far to decrypt the instructions for entering the third and final room, all of which are written in modern languages. It might take years or centuries. Birney imagines them “stepping through these dusty rooms and being confronted with all this ancient symbolism on some painstakingly engraved metal”. Eventually they gain access to the mother lode – the chamber that stores the complete knowledge from humanity’s first age, waiting to be decoded.

All of this is technically feasible, if not financially so. Birney says: “If someone turned up and challenged me and Nick to store all of human knowledge for 10,000 years...”