MP3 files written as DNA with storage density of 2.2 petabytes per gram

Researchers used trinary to take advantage of DNA's four bases.

The general approach to storing a binary file as DNA, described in detail below.

Goldman et al., Nature

It's easy to get excited about the idea of encoding information in single molecules, which seems to be the ultimate end of the miniaturization that has been driving the electronics industry. But it's also easy to forget that we've been beaten there—by a few billion years. The chemical information present in biomolecules was critical to the origin of life and probably dates back to whatever interesting chemical reactions preceded it.

It's only within the past few decades, however, that humans have learned to speak DNA. Even then, it took a while to develop the technology needed to synthesize and determine the sequence of large populations of molecules. But we're there now, and people have started experimenting with putting binary data in biological form. Now, a new study has confirmed the flexibility of the approach by encoding everything from an MP3 to the decoding algorithm into fragments of DNA. The cost analysis done by the authors suggest that the technology may soon be suitable for decade-scale storage, provided current trends continue.

Trinary encoding

Computer data is in binary, while each location in a DNA molecule can hold any one of four bases (A, T, C, and G). Rather than using all that extra information capacity, however, the authors used it to avoid a technical problem. Stretches of a single type of base (say, TTTTT) are often not sequenced properly by current techniques—in fact, this was the biggest source of errors in the previous DNA data storage effort. So for this new encoding, they used one of the bases to break up long runs of any of the other three.

(To understand how this could work practically, let's say the A, T, and C encoded information, while G represents "more of the same." If you had a run of four A's, you could represent it as AAGA. But since the G doesn't encode for anything in particular, TTGT can be used to represent four T's. The authors' system was more complicated, but did manage to keep extended single-base runs from happening.)

That leaves three bases to encode information, so the authors converted their information into trinary. In all, they encoded a large number of works: all 154 Shakespeare sonnets, a PDF of a scientific paper, a photograph of the lab some of them work in, and an MP3 of part of Martin Luther King's "I have a dream" speech. For good measure, they also threw in the algorithm they use for converting binary data into trinary.

Once in trinary, the results were encoded into the error-avoiding DNA code described above. The resulting sequence was then broken into chunks that were easy to synthesize. Each chunk came with parity information (for error correction), a short file ID, and some data that indicates the offset within the file (so, for example, that the sequence holds digits 500-600). To provide an added level of data security, 100-bases-long DNA inserts were staggered by 25 bases so that consecutive fragments had a 75-base overlap. Thus, many sections of the file were carried by four different DNA molecules.

And it all worked brilliantly—mostly. For most of the files, the authors' sequencing and analysis protocol could reconstruct an error-free version of the file without any intervention. One, however, ended up with two 25-base-long gaps, presumably resulting from a particular sequence that is very difficult to synthesize. Based on parity and other data, they were able to reconstruct the contents of the gaps, but understanding why things went wrong in the first place would be critical to understanding how well suited this method is to long-term archiving of data.

Long-term storage

In general, though, the DNA was very robust. The authors simply dried it out before shipping it to a lab in Germany (with a layover in the UK), where it was decoded. Careful storage in a cold, dry location could keep it viable for much, much longer. The authors estimate their storage density was about 2.2 Petabytes per gram, and that it included enough DNA to recover the data about ten additional times.

Which brings us to the authors' more general argument. Assuming the process is streamlined and automated, and the physical cataloging can be handled at minimal cost, is this ever likely to be a cost-effective way to store data? As a point of contrast, the authors considered a data set from the LHC. After a few years, access to data from these archives tends to be very limited, while the cost of maintaining them tends to involve sporadic migrations to upgraded magnetic tape technology.

With current, state-of-the-art DNA synthesis and sequencing, the economics start to make sense only if you're planning on storing the data for over 500 years (although it would be good out to 5,000). But they also note that if the relevant technologies continue improving at their current rates, it will only take a decade to get to where DNA-based storage would start making sense for archiving data for as little as 50 years.

Both this team and the previous one point out that, unlike storage media, the actual physical storage of information won't change if DNA is used, even if we end up using different methods to synthesize or read out the molecule's contents. And life on Earth will always ensure that whatever we need to manipulate it will always be available to us. In other words, if we're ever not able to read the information in DNA, then we've got bigger problems than having lost some data.

Promoted Comments

What is driving the trend to deprecate the term "junk" is the amount of functional code with an unidentified function. Yes there is a lot of code that has been disabled by "maintenance", but there is also a lot of that code that gets activated. Sometimes by rare environmental triggers, sometimes as an unidentified portion of a known gene and sometimes as an unidentified gene.

Some people clearly misinterpret the term "junk". Sidney Brenner remarked (he may not have been first, but he said at a seminar I attended) that the difference between "junk" and "garbage" is that both are useless, but that junk is kept, while garbage is discarded. For humans with attics filled with junk, the junk is material that is kept because the space is there, and because it might be useful in the future. The genome is somewhat similar, although the genome sometimes keeps ticking time-bombs, while most humans tend to discard those.

The junk is the source material for, and the residue of, many evolutionary events. Most of the genome (probably at least 70%, and likely more than 90%) has no current function, and it is difficult to justify the statement that deleting it would necessarily be harmful. Some expressed genes are redundant, and can be deleted with no detectable effects on the organism. Some of the non-functional DNA may be activated in the future, but that is not the same as claiming that it has a current role. The term junk is appropriate for the material that may have a function in the future, but exists now merely as additional sequences that the replication polymerases copy because they have no mechanism to allow avoiding doing so.

Is it worth editing the genome to remove the junk? (Humans currently lack the technology to do so for eukaroytic genomes, so the question remains hypothetical.) Some junk might be useful to remove. As an example, chromosome 6 has CYP21 and CYP21A sequences. CYP21A is non-functional, and crossing-over that joins part of CYP21 with CYP21A results in a relatively frequent, potentially fatal, genetic abnormality known as 21-hydroxylase deficiency.

As I mentioned in the last thread, these folks need to talk to people in the EE and CS departments at their university. Data encoding was studied to death in the 50s-70s and there are existing encoding schemes that provide guaranteed transitions with far less overhead than their approach. Finding the optimal encoding method for DNA using information theory and empirical data on the types of transcription errors that occur would make for a very interesting interdisciplinary paper.

Do you have an 8-track player? If not, you'd have to build one from scratch to load my father's music library. Or a 5.25" floppy drive, or a particular type of magnetic tape reader, etc. The hardware we use to store information has changed with every new technology level we achieve. That makes it difficult to access information that you plan to store for decades, centuries, or millennia.

The last paragraph of the article addresses the fact that the storage media (DNA) has been around for a really, really long time and will likely continue to exist for a long while. So the tools of writing, copying, etc. that information will still be germane in centuries. So while the actual encoding process, the binary formats, the MP3 format itself may be lost the actual media is not going to change any time soon. And if it does ... well as the author states, we've got bigger issues to worry about.

Dude... what if HUMANS are just the DNA-stored data of some other civilization...?

That explains the obesity epidemic! All those Supersize fries encode data at 2.2 PB/gram... so an extra 20 kilos on each data node, er, person, ....42.9 exabytes of data! Clearly the use of high fructose corn syrup and fatty foods is an alien plot to increase their storage efficiency.

Do you have an 8-track player? If not, you'd have to build one from scratch to load my father's music library. Or a 5.25" floppy drive, or a particular type of magnetic tape reader, etc. The hardware we use to store information has changed with every new technology level we achieve. That makes it difficult to access information that you plan to store for decades, centuries, or millennia.

The last paragraph of the article addresses the fact that the storage media (DNA) has been around for a really, really long time and will likely continue to exist for a long while. So the tools of writing, copying, etc. that information will still be germane in centuries. So while the actual encoding process, the binary formats, the MP3 format itself may be lost the actual media is not going to change any time soon. And if it does ... well as the author states, we've got bigger issues to worry about.

I suppose. But if you unearth a cache of DNA stored music and have no DNA player how is that any better than finding the 8-tracks and having no 8-track player? The technology seems really cool, but that last paragraph just seems silly to me.

Now if we could get some genetic modifications to make our bodies natively read and play back DNA MP3's...

Excellent news, does it need artificial cold storage or a cold cave would suffice? If humanity suddenly disappeared, most of our modern knowledge would disintegrate after a couple of decades, only some of your knowledge could survive a couple of centuries but after a thousands years no one would know we existed or what we knew.

This is ironic considering how much we can learn about old civilizations through clay tablets. Storing our knowledge in DNA–form could be the modern equivalent of clay tablets, they could store a lot of data for a very long time, as long as the temperature is right.

Ragashingo wrote:

I suppose. But if you unearth a cache of DNA stored music and have no DNA player how is that any better than finding the 8-tracks and having no 8-track player? The technology seems really cool, but that last paragraph just seems silly to me.

Now if we could get some genetic modifications to make our bodies natively read and play back DNA MP3's...

This actually brings to mind the movie Titan AE a little bit in terms of long term data storage. Undoubtedly a cool concept. I suspect in our lifetime we'll see much more merging between biology and technology (yeah yeah i know they've got a name for that already).

Excellent news, except for the energy requirements for long–term storing. If humanity suddenly disappeared, most of our modern knowledge would disintegrate after a couple of decades, only some of your knowledge could survive a couple of centuries but after a thousands years no one would know we existed or what we knew.

This is ironic considering how much we can learn about old civilizations through clay tablets. Storing our knowledge in DNA–form could be the modern equivalent of clay tablets, they could store a lot of data for a very long time, as long as there is energy helping preserve the DNA.

For super long term storage I am a fan of microlithography of textual information on extremely corrosion resistent metals. Use several different languages and make the print large enough that it's obvious there's something there - see for example: http://rosettaproject.org/

I suppose. But if you unearth a cache of DNA stored music and have no DNA player how is that any better than finding the 8-tracks and having no 8-track player? The technology seems really cool, but that last paragraph just seems silly to me.

Now if we could get some genetic modifications to make our bodies natively read and play back DNA MP3's...

P.S. Thanks for the good reply.

I would assume that once we get really good at genetic manipulation we'll be able to create a "self extracting archive." Just add water (maybe some sugar) and it'll "create" the 8-track player for you along with the tapes...

I suppose. But if you unearth a cache of DNA stored music and have no DNA player how is that any better than finding the 8-tracks and having no 8-track player? The technology seems really cool, but that last paragraph just seems silly to me.

Now if we could get some genetic modifications to make our bodies natively read and play back DNA MP3's...

P.S. Thanks for the good reply.

What the last sentence is saying is that, assuming humanity doesn't fall into decay for some reason, we will always be able to read DNA, even if the tools for it change. Imagine Dr. Crusher running a tricorder over someone to scan their DNA to check for genetic diseases or whatever. That same tool would be able to read the DNA sequence of an iTunes library that was encoded using Dr. McCoy's tricorder.

The point of the last sentence (or there abouts) is that if we don't have ANY tricorders laying around, we have bigger problems, like going into a technological dark ages.

As I mentioned in the last thread, these folks need to talk to people in the EE and CS departments at their university. Data encoding was studied to death in the 50s-70s and there are existing encoding schemes that provide guaranteed transitions with far less overhead than their approach. Finding the optimal encoding method for DNA using information theory and empirical data on the types of transcription errors that occur would make for a very interesting interdisciplinary paper.

With 2.2 PB / gram, do you really care about whether or not the encoding scheme is optimal at this stage? I would think the main goal is to guarantee the data can be extracted reliably.

We already store things in base-10, and for us english speakers, base-26. 8-bit ASCII? 16-bit Unicode? You get the idea. What's relevant is the overhead of encoding/decoding and validation/error checking.

I wonder how much this is a real possibility or just tonic lotion for hair growth. DNA is very delicate. Also, how about reading and writing speed? What about reliability of this kind of method? I don't think this is just a matter of capacity . This might have some niche application where capacity is high priority

Actually DNA is not as delicate as you might think it is. Especially not with all the repair mechanisms nature came up with. They completely sequenced DNA from mammoths frozen for thousands of years...try that with a CD...