DNA, the Ultimate in Data Memory

While the role of DNA as a biological memory is well established exploring its potential as a data memory is relatively new. DNA data memory has not quite yet reached the stage where a blob of DNA can have some wires attached to it to write and read its data content, good progress has been made.

The their latest work, published in Science magazine, Yaniv Erlich and Dina Zielinski from Columbia University and the New York Genome Center, mixed some clever biochemistry with some leading edge communications data encoding techniques and added a dash of processing power. The result, under the heading of “DNA Fountain,” is a demonstration of the ability to use DNA to store a complete operating system of 1.4 MBytes, a movie and other files for a total of greater than 2 Mbytes.

This is now possible because at the same time they have provided a new level of efficiency and reliability for the technique. If the DNA-data memory must have an acronym to fit it in the SRAM, DRAM, NVRAM memory spectrum, then biologic archival read rarely memory (BARRM) might be one choice.

As illustrated in Figure 1, within the DNA helix each cross linking nucleotide (nt) will contain one of the four nucleobases (bases). Given the the ability to be able to selectively place them in order along a DNA helix backbone offers the possibility of a binary data memory of two bits/base or nucleotide (i.e. 00, 01,10 and 11). The bonds between the bases linking the DNA spiral backbones are characterised by either two or three covalent hydrogen bonds.

It is suggested in the DNA would offer an eye catching memory data density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.

At its core, the DNA-data memory methodology relies on a technique used in data communication where instead of repeating the transmission when an erroneous piece of a data stream is received, enough bytes are transmitted to allow by statistical analysis the correct data to be extracted. The technique is based on what are called “Fountain” codes.

Fountain codes allow data (such as a file) to be divided into an unlimited number of encoded pieces, in a form which allows them to be reassembled into the original file given any subset of the encoded pieces of data, provided that you have a little more than the size of the original file.

In data communications and now for memory a “Fountain” of suitably, encoded data is fired at a receiver, which is able to reassemble the file by catching enough "droplets" (the bits of encoded data). It is immaterial which bits of encoded data are received or missed. The water analogy using “fountains," “droplets” and “buckets” is now part of the language of these techniques. A bucket full of droplets will give you enough information to extract the original data.

Fountain codes are only a part of the method of changing a binary data stream into a form suitable for translation into strands of DNA. This latest work adds a new twist which accommodates the special stability needs of a potential DNA data memory. Emphasising the most desirable links and removing undesirable features such as too many (GC) links and long sequences of the same link the latter called homopolymer runs (TTTT...).

The target of the memory “Write” process is to turn the original data steam into a series of DNA oligonucleotides or “Oliogs” as the short form. These can be sent to a company specializing in the manufacture of DNA to order who return a small ampoule of the data encoded DNA.

Resistion-Early days yet. Retention keep it close to your body its perfect environment. Its easy to replicate nature and evolution has taken care of that so in theory almost unlimited retention. Who wants to store all the data ever generated in the world in a motor vehicle that niche market must be left to others for the moment.

I thought you would be more interested in the ability of the bio-chemist to produce and resolve 10nm diameter clusters while the semiconductor industry struggles (see my final figure). Might be some lessons to be learned there for those struggling with sub-10nm lithography.

Jim I raised your question with Yaniv Erlich, PhD Assistant Professor of Computer Science, Columbia University Core Member, New York Genome Center, who replied as follows:

"Great question.

Pathogenic DNA sequences are highly complex. Even tiny viruses such as HIV need a sequence of 10000 nucleotides (nt) and also require additional machinery (e.g. capsid) to execute their pathogenic function and very specific environment (stuffiest level of viral load, contact with the right tissue, etc).

DNA storage is based on short DNA fragments that are only 200nt long, much shorter than the smallest viruses. Just this will effectively eliminate pathogenic reactions. Moreover, the probability of getting a viral genome by random chance is ~ 4^10,000, which is extremely small even after writing a lot of data.

Finally, for DNA to encode a harmful organism, it needs to be translated into protein sequences. Proteins are the building blocks of molecular machines from toxins to viral capsids. For such translation to occur, the DNA needs to have a specific combination of letters at the beginning of the segment and a specific combination of letters at the end of the segment. As another layer of protection, we successfully tested (but not included in the paper), a step that screens the DNA droplets and only accepts droplets with DNA sequences that are cannot be translated into long proteins.

Hope it helps!"So Jim it looks as though we are pretty safe from any sort of DNA-Data Black Death Plague

If you just create random DNA sequences willy-nilly isn't there a possibility that you will inadvertently create some DNA that, if accidentally released into the environment, might embody itself as some horrible disease?

I'm an electrical engineer. I may be missing some fundamental biological fact that renders this concern ridiculous.