Even more remarkable is that non-overlapping triples of the letters A, C, G, and T which are often referred to as codons (especially when the triples are part of a gene), may spell out (reading in the 5' to 3' direction) the order of the amino acids that form the long linear molecules known as proteins. It turns out there are 20 amino acids which are the building blocks for proteins. Since there are 64 possible triples (repeated letters within the triples are allowed) using the 4 letters A,C, G, and T, different triples can represent the same amino acids. There are also 3 triples (TAA, TAG, TGA) known as stop codons that do not represent (a common phrase being do not code for ) an amino acid, but are involved in signaling a termination to the protein production process. One particular codon (ATG), which does represent one of the amino acids, is an indication for the start of the production of a protein. This system for coding proteins has come to be known as the genetic code. Before the genetic code was fully understood, based on work by Marshall Nirenberg, Heinrich Matthaei, Har Gobind Khorana, and others, various mathematical ideas were invoked to suggest what kind of code might be involved. There is a diagram of the RNA version of the genetic code, where U takes the place of T (see below).

One approach to finding a gene is to look for a stretch of DNA which starts with ATG (the start codon) and ends with one of the stop or termination codons. This situation in a stretch of DNA is referred to as an open reading frame (ORF). However, not all ORFs correspond to a stretch of DNA which will initiate the production of a protein, and so the location of an ORF is not equivalent to finding a gene. A major problem facing biologists is how to locate those stretches of DNA that are genes in the large amounts of DNA found in a chromosome.

The way that DNA is involved in the production of proteins is not direct. The mechanism involves another helical molecule called RNA, of which there are a variety of types. (Among the types of RNA are messenger RNA, or mRNA, ribosomal RNA, or rRNA, and transfer RNA, or tRNA.) When a protein is to be produced, the DNA separates and a copy of the gene is transcribed (the process is referred to as transcription by molecular biologists) into a strand of RNA (ribonucleic acid). RNA, like DNA, uses an alphabet of 4 nucleotides A, C, G, and U (for uracil, a pyrimidine like thymine which takes the place of T). In a somewhat complex series of steps, a protein is produced.

A very rough schematic of how genes create proteins. This schematic is often referred to as the central dogma.

In this process sections of the original DNA, which are known as introns but which are not involved in the production of a protein, are separated or snipped out and a resulting string of exons from the original DNA is left. What is left is sometimes referred to as a coding sequence, and it contains the codons (in the letters A, C, G, and U) which will create the protein, by the process usually referred to as translation. The actual manufacture of the proteins involves structures called ribosomes. Production of proteins does not go on within the nucleus but in the cytoplasm surrounding the nucleus. One can express the codons involved in the manufacture of proteins either in DNA terms (using the letters A, C, G, and T) or in RNA terms (using the letters A, C, G, and U).

A gene in the DNA includes sections (introns) that are cut out before the RNA, through a complex process, directs the manufacture of a protein (polypeptide chain).