I am a hobby programmer with a background in biology and have developed an encryption program based on DNA. I tried to make it hard to crack, but it's essentially a substitution cipher and uses the default Java random number generator so my guess it could be cracked relatively easily. But how do I find out how good my encryption is? Can I post an encrypted message here and see if someone can crack it?

Again, I am not a professional cryptographer or programmer, I'm a grad student who does too much outside the lab like attempting to write encryption programs, so if there is already a question about this, I wouldn't know because I don't understand any of the terms I'm seeing in the similar questions.

The comments probably aren't good enough to understand what I'm doing. First, you should know a little how DNA works. There are 4 bases, A, G, C, and T. DNA codes for proteins, which are made from 20 amino acids. Since $4^1$ = 4, and $4^2$ = 16, we need $4^3$, for 64 possible combinations of 3 bases. This 3 base unit is called a codon. Since 64 is larger than 20, most amino acids are coded for by more than 1 codon, and 3 codons are stop codons, simply marking where the protein ends.

But 20 symbols isn't enough to encrypt a message, I figured 64 symbols would be ok, that gives me all the letters (uppercase only), all the numbers, and most of the punctuation. I also wanted each symbol to be represented by more than 1 codon, so instead of a 3 base codon, I used a 4 base codon, which gives 256 possible combinations. So I assigned each symbol 4 random 4 base codons.

Another concept from DNA is reading frame. A DNA double strand has 6 possible ways to translate protein, 3 forward and 3 reverse, depending on whether you start on the first, second, or third base on either end. To mess up the reading frames in my encrypted messages, I insert a random amount of random bases in between each codon. Also, each codon has a 50% chance to be reversed to it's complement, so A becomes T, G becomes C, and so on.

This means that in order to succesfully decrypt a message, you need to find all 4 codons for each symbol, and sort out the junk, and determine which codons have been reversed. To further complicate things, you could encrypt the ciphertext with a ceasar cipher or other simple encryption to make it look like you have more than 4 characters and disguise the DNA. Or you could go with a hide in plain sight approach and post the message on any number of publicly available DNA databases.

Don't post an encrypted block here as it would render the question off-topic. Instead edit this post to include the algorithm (not the code, if possible). DNA has low entropy if I'm not mistaken and Java's standard random gen. is not cryptographically secure (although that can be remedied by using another generator). What are you using as key material?
–
rathNov 9 '13 at 7:34

The absolute minimum is a specification using typical notation and a reference implementation in c. You should also consider an attacker who knowns or even chooses both message and ciphertext. If such an attacker can break your scheme then it's very weak by modern standards.
–
CodesInChaosNov 9 '13 at 9:46

I too am a complete amateur, but I have been playing with several classical ciphers. I would probably not be the best choice to test your algorithm as I am not good at breaking them, but I would be fascinated to know how you used DNA as a basis for a cipher. If you would care to post your algorithm I would be happy to look at it, for what it may be worth. Edit: My apologies. I meant to post this as a comment, not an answer...
–
DanielNov 10 '13 at 8:14

3 Answers
3

Based on your sample code I do not consider the scheme secure enough for implementation. Additionally you will run into a few problems if you actually try to implement this to generate DNA strands with encrypted messages in them (ala some kind of futuristic scifi thriller).

As the other answer suggests, it would be best to think about using the DNA sequence as a data storage layer. That would allow storage of encrypted or plaintext data of any type. As technology progresses, the cost to generate custom DNA strands will only drop, in 20 years this may be a commonplace method.

Problem 1: Redundancy
If you generate a DNA strand, inject it into a whatever, transport it across a few continents over a period of time, and then try to read it... it will most likely be interpreted as something else. You will need a large amount of redundancy in order to make sure that degradation of the strand and transcription errors do not destroy the encrypted data. Some species encode dozens or even hundreds of the same sequence so that it will survive intact. At a minimum you will need a very robust encoding method for the input data that takes into account the time and environment the strand is exposed to.

Problem 2: The Other Side
There are 2 sides to a DNA strand. A on one side is T on the other. If you encode using all 4 bases, you will run into a problem (possibly) when reading the strand. The most simple solution is to use the base pair as a binary value instead of a codon, AT and CG as 0 and 1. This simplifies the nucleotide encoding algorithm (now there is none!) and allows reading from any side (not any end) of the strand without determining which side is the correct one (by some termination sequence perhaps)

Problem 3: The Other End
As you mentioned, reading from one end may be a problem, of which there are 2 solutions. The first is to encode the sequence, then reverse and append. This makes the strand the same in both directions. The other solution is to use some kind of termination sequence in order to determine which end to read the strand from.

Problem 4: Bad Code
You don't want to accidentally code for botulinum toxin or something, this is probably not an issue unless you are actually generating DNA strands that may wind up being exposed to living organisms. This can be solved by using an encoding that uses only codons that generate the same amino acid. A long sequence of arganine will not spontaneously generate botox. Arganine has 6 codons that will create it, giving multiple options for encoding. The simple solution to problem 2 may not be very friendly with the solution to this problem, although CGC And CGG both encode arganine, and on the flip side GCG and GCC both encode alanine, so you could encode binary data on either side of the strand and have it generate a long string of the same amino acid!

Using a codon to encode only 1 bit of data, and requiring several codons for genetic redundancy and several bits for data redundancy will add up VERY quickly. I can see 5 duplicate codons requiring 3 to match to encode a single bit (15 base pairs per bit), and then an 8x4 hamming code which takes 16 bits to encode a byte (480 base pairs per byte!!), and some termination codes on each end to make sure it is read in the correct way (another 100 or so base pairs per end). Disadvantage lots of DNA, advantages highly redundant storage that is easily read and wont accidentally create a bio weapon.

As for the actual cryptography part, the best method would be to compress your input data, then use something like AES_CTR (if no authentication required) and use that data to generate the DNA code. The cryptographic method you are using appears to be a simple substitution at first glance, and figuring it out from a known plaintext and knowledge of the algorithm is not difficult.

DNA digital data storage

I also wanted each symbol to be represented by more than 1 codon, so instead of a 3 base codon, I used a 4 base codon, which gives 256 possible combinations. So I assigned each symbol 4 random 4 base codons.

With this kind of encoding, you will end up representing one byte per four codons.

The most of current cryptographic algorithms work with groups of bits, with commonly requirement that group needs to be multiple of 8 bits, 64 bits or maybe 128 bits.
Such groups can be represented with 4 codons, 32 codons, and so on.

DNA digital data storage refers to some recent experiments of storing digital information on DNA. The concept of storing data on DNA is not new. DNA has one important advantage: data storage density is some orders of magnitude better than current commercially available solutions for spinning disc and solid state disc storage. For this reason, it is interesting research target. The main reason it is not currently used commercially as storage device is that writing and reading DNA is prohibitively expensive.

Consider:

When thinking DNA as a binary data storage medium (2 bits per codon), it would be immediately obvious that you can actually use any current well tested cryptographic algorithms like AES and RSA for processing information to store on DNA.

Classical cryptographic mechanisms commonly do not offer security strong enough against current attackers.

Result: I would recommend to trying to repartition the problem: consider DNA as storage layer and consider cryptography as another layer and apply the best solutions available on each layer.

This was never about compression or storage, I realized pretty quickly that even short messages would become very long sequences of DNA. My method isn't suited for anything other than text. And you're right about reading and writing DNA being expensive, if I could just have 5000 bases of DNA printed on a machine, I could actually get something done in lab instead of spending all my time building DNA.
–
user137Nov 11 '13 at 16:15

The point being made by all the answers currently written is that you should
is that you should consider your scheme in two parts:

Cryptography Layer

DNA encoding layer

You are using DNA as a storage mechanism (about which I'm sure you know more than me), and as noted in the other answers there are some issues and research papers. So, taking into account that it may well be that writing arbitrary DNA is not possible (eg as Richie's answer points out you can't make a virus), let's assume you've managed to create a sufficiently large list of DNA strings that you can confidently read/write.

My contribution to answering your question is to put forward the suggestion of using Format Transforming Encryption. Only recently developed, the idea is to efficiently combine [authenticated] encryption with an encoding engine that maps arbitrary binary strings onto elements of some language (using regular expressions - read the paper, it's interesting).

So, to summarise: There are solutions for encrypting information into DNA, but I suspect that's not really what you were interested in. Unfortunately, your crypto scheme has some pretty serious issues once we assume the attacker is in a similar situation to the legitimate user (ie he has access to everything apart from the key).$^{[1]}$ Clearly, you might decide that in the case of reading DNA this is not accurate - that actually your adversary would not be able to read the DNA you'd encoded due to lack of equipment, but in that case you might as well just use a direct encoding.

[1] Reading again, I'm not even clear that there is any cryptography in your suggestion other than adding random data every now and then (which would be just as hard for the legitimate user as an attacker).