Description

Every one of us is creating, storing and transmitting lots of data every
day. Text messages, emails, online clouds are only a few examples.
Since 2002 the amount of digitally stored data has exceeded the
amount of data stored on analog media. By now less than 6 % of the
world’s data is still analog [1].

It is not surprising that data breaches orchestrated by hackers
are on the rise as well. Financial and legal records, military and
government documents, these are examples of important information
that must be preserved for a long time, but could cause great
damage in the wrong hands. Moreover, the Dutch National Cyber
Security Center revealed that it is us citizens who are most likely to be
attacked by cybercrime. We have become a civilization dependent on
information, and this information must be stored somewhere. As a
result, we are faced with two problems: where do we store all of
our data, and how do we keep it safe?

In addition to the low safety of digital and analog data storage, they
require a large amount of resources like storage space and
electricity. Data is stored in huge data centers as can be seen in Figure 1. In 2015, 416.2 TWh were used for storage of digital
data 3 costing 41 billion USD. This is higher than the annual
power consumption of the entire UK [2], and is responsible for
approximately 2% of global greenhouse emissions, rivaling the
airline industry [4]. In 2015 about 2,500,000 TB of new data were
produced per day [1]. While the densest storage medium in use has a
capacity of 10 GB/mm³, data up to a density of up to 109 GB/ mm³
can be stored in DNA [5]. It is furthermore expected that the
demand for silicon, which is required for flash memory, is to exceed
silicon supply by 2040 [6].

Figure 1. A data center. (Source: BalticServers)

The iGEM team Groningen 2016 thinks it is about time to develop
a safe and novel data transmission system. Especially as scientists
we are in the urgent need of that. As our team consists of
biologists as well as computer scientists we developed a
multidisciplinary approach. Safe encryption of digital data,
conversion into a DNA sequence and integration into a bacterial
genome: This approach provides a system with multiple digital and
biological safety layers. DNA is an infinite resource and in a
spore it is safe from environmental influences.

Storage of data in DNA has been proposed as early as the 1960’s,
but has only recently become a hot topic [7]. This is in part due
to the ever-growing demand for data storage, as well as
advancements in DNA synthesis and sequencing technologies. Our goal
is to create a system for long-term data storage and data transfer
which cannot be hacked by digital means. Digital methods of
encrypting information and converting it into binary code are well
established, and data storage in DNA has already been demonstrated.
Our project combines these two approaches by first converting
information into binary code, encrypting it, and then storing it
safely in DNA. Additional measures based on molecular biology will
prevent unauthorized access, ensuring the safety of the stored
information.

Our system will be useful for the kind of information that
should be stored and transferred in a very secure manner, but does
not have to be accessed within seconds. It will be possible to
obtain the message in about 24-48 hours; however, this timeframe is
likely to be reduced as new sequencing technologies are developed.

DNA is a far more stable data storage medium compared to
magnetic and optical media, remaining intact for at least 700,000
years at -4 °C [8]. Even in harsh environments, DNA has a half-life
of over 500 years [9]. In contrast, current storage technology is rated to
last only up to 30 years [10]. Given the stability and compactness
of DNA, our system could also be adapted to serve as a time capsule
for human knowledge. Additionally, DNA storage will soon become a
cheaper alternative for data storage as DNA synthesis and
sequencing costs drop. It is estimated to become a cost-effective
method for long-term data storage within approximately ten years
[11].

Figure 2. DNA. (Source: Nogas1975)

DNA data storage is an apocalypse-proof technology because DNA
will be relevant to future civilizations. As long as intelligent
DNA-based life exists, there will be compelling reasons to study
and manipulate DNA.

Our system is especially safe as it cannot be hacked by computer
scientists and it requires the specific knowledge of the recipient
to retrieve the stored data.

As our project is designed by BioBricks it is easily adjustable to individual wishes. Message and key are easy to implement and exchange, and biological safety layers can be customized. According to the iGEM values we worked with BioBricks that we designed ourselves as well as with BioBricks of previous teams.
Read more about how we worked on the
characterization of the B. subtilis integration vector BBa_K823023.

History of data storage

The concept of storing data in DNA molecules was created and
published in the 1960s by the Soviet physicist Mikhail Samoilovich
Neiman [12]. He came up with the idea that digital data can be
stored in the base sequence of DNA. Because DNA exists of four
different nucleotides, the information density can be up to two
times higher compared to our familiar binary storage systems. Since
the transition from analog to digital storage devices the storage
half-life of our digital information has dropped a lot. Besides
this current efforts to guarantee longevity of digital data storage
are scarce [13]. Optical and magnetic storage devices are not
reliable for long-term data storage. When DNA is encapsulated
within silica spheres and stored at -18°C it is possible to
recover data more than 1 million years later. But if you don’t have
a freezer and live in central Europe your data will be safe for up to 2k
years [14]. So storing information in DNA can be done with higher
density and it can be kept much longer.

The first messages that had actually been stored in DNA had to
wait until 1988 where Davis managed to do so [15]. In 2010
scientists were able to encode 7920 bits in synthetic DNA [16]. Up
to this milestone it was difficult to write and read long perfect
DNA sequences. In 2012 Church et al. developed a new strategy to
encode arbitrary digital information with an encoding scheme that
uses better DNA synthesis and sequencing technologies. In this
research they were able to encode and decode a message containing a
little over 50k words and a few images [17]. To learn more about data storage we visited the archives of the city Groningen (figure 3).

Figure 3. A visit to the archives of Groningen.

Encryption

The original message is encrypted into a new message (the
ciphertext) by using the Rijndael algorithm [18], which was
developed in 1998 by two Belgian cryptographers, Joan Daemen and
Vincent Rijmen. In November 2001 this algorithm was selected by the
U.S. National Institute of Standards and Technology (NIST) as the
new Advanced Encryption Standard (AES) [19]. Since then, it has
been adopted by the U.S. government to secure highly classified
data and has been used worldwide. The process of encryption is represented in figure 4.

After encryption, the message will be converted into a binary
message, by making use of the American Standard Code for
Information Interchange (ASCII). ASCII encodes characters into
integers, which can be represented as sets of binary digits.

The encrypted binary message is translated into a sequence of
the nucleotides ACTG by using the following translation scheme: the
binary pair 00 will be represented as A, 10 as T, 01 as C and 11 as
G (see table 1). Subsequently, the obtained string of nucleotides
is integrated into the DNA of Bacillus subtilis, which will serve
as the carrier organism for our secret message. For example, table
2 shows the translation of the plaintext “Hello world” into a
sequence of nucleotides (encryption is not applied in this
example).

The same strategy is applied to the encryption key, which is
integrated into the DNA of a separate Bacillus subtilis strain in
the same manner. In order to retrieve the original message, the
message needs to be decrypted by using the same key that was used
for encryption.

Table 2. The message “Hello world” is
translated into a sequence of nucleotides ACTG.

Figure 4. The text “Hello world!” is encrypted by
using a key. Subsequently, both the key and the
encrypted binary message are translated into
nucleotide-sequences.

DNA Synthesis

DNA synthesis is the natural or artificial creation of
deoxyribonucleic acid (DNA) molecules. In the cell, each of the two
strands of the DNA molecule acts as a template for the synthesis of
a complementary strand. Based on a similar principle, polymerase
chain reaction (PCR) too is being used for DNA synthesis in vitro.
Further, with advances made in science, it is now possible to
create artificially synthesized novel nucleotide DNA sequences
[20]

DNA and DNA replication mechanisms appeared late in early life
history. DNA traces its origin from RNA/Protein [21], however, to
remain in the scope of our current project, the natural process of
DNA replication proceeds in an enzymatically catalyzed and
coordinated steps: initiation, elongation and termination.

The first step in DNA replication involves the unzipping of the
double helix structure of the DNA. This is carried out by an enzyme
called helicase, which breaks the hydrogen bonds between the
complementary bases pairs of DNA (A with T, C with G). This leads
to the separation of two single strands of DNA, creating a ‘Y’
shaped replication fork. The two separated strands serve as
templates for making new strands of DNA subsequently.

One of the strands is oriented in the 3’ to 5’ direction
(towards the replication fork), whereas the other strand is
oriented in the 5’ to 3’ direction (away from the replication
fork). Due to this difference in their orientation, the two
strands replicate differently.

A short piece of RNA sequence called primer (produced by an
enzyme called primase) binds to the end of the leading strand
(3’ to 5’). The primer acts as the starting point for DNA
synthesis. Thereafter DNA polymerase binds to the leading
strand and starts adding new complementary nucleotides to the
template DNA in the 5’ to 3’ direction.

The replication process for the lagging strand (5’ to 3’)
involves multiple RNA primers binding at random points along
the template DNA. This leads to the formation of short chunks
of DNA in the 5’ to 3’ direction, called Okazaki fragments.

Once the bases pairs are formed (A with T, C with G), another
enzyme called exonuclease dissociates the primer from the DNA
strand. The gaps are filled by complementary nucleotides and the
new strand of DNA is proofread by DNA polymerase. Finally, an
enzyme called DNA ligase seals the sequence of DNA into two
continuous double strands, following which the new DNA
automatically winds up into a double helix.

Kary Mullis invented the PCR technique (Figure 5) in 1985 [22]. His work
revolutionized the process of making millions of copies of a scarce
sample of DNA, and was awarded the Nobel Prize for Chemistry, 1993.
The procedure follows the basic principle of DNA replication in
vivo. A small amount of the DNA containing the desired gene of
interest is aliquoted into a tube consisting nucleotides, primers
(pair of synthesized short DNA segments, that match segments on
each side of the desired gene), DNA polymerase enzyme and a buffer
that allows optimal activity of the polymerase enzyme. Thereafter
the tube containing the mix is subjected to cycles of repeated
heating and cooling which leads to amplification of the gene of
interest.

The first step in a regular cycling event involves heating
the mix to 94–98 °C for 20–30 seconds. It disrupts the
hydrogen bonds between the complementary base pairs, yielding
single strands of DNA.

This is followed by lowering of temperature to 50–65 °C for
20–40 seconds, allowing the primers to form hydrogen bonds and
bind to the single-stranded DNA template. The polymerase binds
to the primer-template hybrid and begins DNA formation.

Following the binding of primers, the temperature is
increased to about 72 °C for the DNA polymerase to synthesizes
a new DNA strand complementary to the DNA template strand by
adding dNTPs that are complementary to the template in 5' to 3'
direction.

The processes of denaturation, annealing and elongation
constitute of one cycle. Multiple cycles are required to amplify
DNA.

For over 60 years, the synthetic production of new DNA sequences
has helped researchers understand and engineer biology. Gene
synthesis is also accelerating research in well-established
research fields by providing critical advantages over more laborious
traditional molecular cloning techniques. De novo DNA synthesis
involves the chemical synthesis of relatively short but specific
fragments of nucleic acids. Chemical oligonucleotide synthesis does
not have the limitation of unidirectional nucleotide addition (5’
to 3’), as compared to the naturally occurring DNA synthesis and
PCR. To obtain the desired oligonucleotide, loose nucleotides are
sequentially added to the growing oligonucleotide chain in the
required sequence. Typically, synthetic oligonucleotides are
single-stranded DNA or RNA molecules.

The synthesis starts with a non-nucleosidic linker being
attached to a solid support material. The oligonucleotide sequence
remains covalently bound to the support material over the entire
course of the chain assembly via its 3'-terminal hydroxy group. The
chain assembly is then continued until the completion, after which
the release of the oligonucleotides occurs by the hydrolytic
cleavage of a P-O bond that attaches the 3’-O of the 3’-terminal
nucleotide residue to the universal linker.

Synthetic genes offer several advantages over cloned native DNA.
These sequences are subjected to stringent quality checks to match
100% sequence verification by the private companies involved in
synthesis of synthetic DNA. Moreover, artificial DNA synthesis
allows the flexibility to researchers for changing enzyme
specificities and activities to suit the needs of their
experiments. Also, synthesis of specific sequences allows the
insertion of localization signals to target specific
protein/nucleic acid in vivo.

Decoding – costs and fidelity

The cost of sequencing DNA has plummeted in the last two
decades. For instance, in the early 2000’s it took 13 years and $3
billion US dollars to sequence the entire human genome. With
current technologies, we have approached the $1,000 dollars mark. This development can be seen in figure 6. In
fact, since 2007 you can have your whole DNA sequenced for less
than that! However, there are still issues with the fidelity and
accuracy of the readings obtained that would prevent them from
being used for our system [27].

Figure 6. Price of sequencing one million of base pairs. Prices dropped
from 2007, when second generation techniques were introduced in the
marked. Source: National Human Genome Research Institute

Existing laboratory-level DNA sequencing technologies typically
allow for a reading error of ~1%. Thanks to optimization and
fine-tunning, traditional readings using Sanger biochemistry offer
now accuracies of up to 99.999% in a read length of 1,000 bp. With
modern, second generation or cyclic sequencing, higher reading
lengths have been achieved but with a decrease in accuracy [26].

Among the bioencryption layers, CryptoGErM prevents unauthorized
parties from reading the message by having a high ratio of decoy
spores that contain a useless sequence. In addition, if the right
growing conditions are not supplied our system prevents germination
and replication of spores that do contain the message. So, what
ratio of decoy:spores do I need to prevent brute force sequencing and
message retrieval from a third party? (see figure 7)

Figure 7. A) One of the bioencryption bilayers is having a
high proportion of decoy spores vs those that do contain the
intended useful sequence. If the right conditions are not met
(i.e. adding an X antibiotic to the system) those spores that
contain our message will die and be outgrown by the decoy. B)
Current sequencing techniques will not be able to distinguish
the hidden message from noise.

Fox et. al. (2014) reports that there is a 50% chance of
accurately distinguishing a true subclonal variant from a
sequencing artifact in an excess of 100 wild-type DNA sequences
using standard Q30 filter reads (error rate: 10-2) [23]. In
agreement with that result, in an experiment carried out to detect
genomic variations in marine pests, Pochon (2013) was able to
detect one variant out of 150 wild-type sequences [24].

Reading DNA has become increasingly more accurate. Schmidt
(2012) developed a method called Duplex Sequencing that uses both
strands of DNA to obtain a more precise consensus sequence yielding
an theoretical error of 3x10-10[25]. That means that we could
transmit a message with a length in the order of gigabytes without
expecting any loss! On the other hand, that also means that they
allow a more precise measurement in decoy-spores mixtures, and a
1:150 spore:decoy ratio might be insufficient in the future. In
fact, using Duplex Sequencing they were able to identify one mutant
sequence per 10,000 wild-type molecules.

Integration in B. subtilis

The chassis B. subtilis 168

The key and message sequence are stored in Bacillus subtilis (B.
subtilis), a gram-positive, catalase positive bacteria, usually
found in soil and gastrointestinal tracts. This rod-shaped
bacterium is about 4-10 µm in length and has a diameter of about
0.25–1.0 µm [28]. The cell is heavily flagellated, allowing the
microbe to move quickly in liquid medium. It is one of the most
well studied gram-positive microbe, and is one of the widely
adopted model organism to study bacterial cell differentiation and
sporulation. These cells are amenable to a wide array of genetic
manipulations, and the ease of transformation has allowed B.
subtilis to be used for selective protein expression to suit our
requirements. Moreover, the integration of DNA sequences into its
genome is well known, which has been an alternative to using
plasmids. The process of integration is demonstrate in figure 8.

Figure 8. Integration of key and message sequence into the genomic DNA of Bacillus subtilis.

Storing and sending

Spores

B. subtilis form endospores when environmental factors do not
favour survival or reproduction [29]. These endospores are highly
resistant and durable structures. They have a central cytoplasmic
core where the DNA and ribosomes are protected by an impermeable
and rigid coat. Spores, when released in the environment, can
survive extreme heat and freezing, lack of water, high pressure
exposure to many toxic chemicals and certain radiations. Compared
to the vegetative cells, endospores are contained in a thicker cell
wall along with additional layers that make them last long periods
in its dehydrated metabolically inactive state.

Sporulation

The process of endospore formation within a B. subtilis can take
up several hours to complete and is called sporulation or
sporogenesis. Sporulation can be induced manually in the lab by
limiting the availability of a key nutrient, such as the carbon or
nitrogen source.

Sporulation begins with a small portion of cytoplasm, along with
a newly replicated bacterial chromosome, is isolated by an
ingrowth of the plasma membrane, called spore septum. The spore
gets a double-layered membrane that will surround the chromosome
and cytoplasm. This structure is entirely enclosed within the
original cell, and is called forespore.

Peptidoglycan layers are laid down between the two membrane
layers. A thick spore coat is formed around the outside membrane,
which is responsible for the resistance of endospores. The last
stage of sporulation is the degradation of the original cell and
the release from the endospore [30].

Germination

An endospore returns to its vegetative state by a process called
germination. This process is triggered by physical or chemical cell
damage to the endospore coat. The enzymes of the endospores then
break down the extra layers surrounding the endospore. The water
enters and the metabolism resumes. [30]

Natural Competence

The natural competence of B. subtilis is one of its less used
advantages. Competent B. subtilis can actively pull DNA fragments
from their environment. These uptaken nucleotides change the
genotype by homologous recombination, also known as natural
transformation. In order for B. subtilis to integrate DNA from
medium, the cells can synthesize a specific DNA-binding and uptake
system as seen in figure 9. In this figure there are a few proteins
which form the translocation complex drawn (A, NucA; C, ComC; E,
ComE; F, ComF; G, ComG; CW, cell wall; CM, cell membrane; CYT,
cytoplasm). This system has no specifity for DNA, therefore B.
subtilis can take integrate plasmid DNA, phage DNA or chromosomal
DNA. [31]

Figure 9. Natural competence of B. subtilis

Storing of spores

Bacterial spores are tough, non-reproductive structures produced
by bacteria. They are highly resistant to aging, radiation, heat
and chemical damage. Endospores formed by Bacillus subtilis could
be found viable after millions of years [32]. These properties make
them the ideal storage medium for data in DNA.

During freeze-drying water is removed from a substance to
increase the storage-life and to make shipping easier. The most
effective method for long-term storage and thus for shipping of
Bacillus subtilis spores appears to be freeze-drying [33].
Freeze-drying is commonly used for long-term storage of bacteria
[34] and spores of B. subtilis. On top of the fact that spores are
highly resistant under different harsh conditions, Fairhead et al.
showed that spores of B. subtilis are very resistant to several
cycles of freeze-drying [35]. The sending process is demonstrated in Figure 10.

[21] Andras, P. and Andras, C. 2005. The origins of life – the “protein interaction world” hypothesis: protein interactions were the first form of self-reproducing life and nucleic acids evolved later as memory molecules. Medical Hypotheses. 64, 4 (2005), 678–688.