Bioinformatics

Bioinformatics is, effectively, an attempt to pull out usefulinformation from what looks, to the untrained observer, like several gigabytes of randomjunk. The human genome project and others like it have produced sequence data in huge quantities. Sadly, though, a very long string of 4 letters is not the easiest thing to interpret. One of the most productive pieces of information obtained from them by bioinformatics has been regions that look like they might be genes. Genes tend to have fairly predictable structures, being preceeded by a higher than average number of adjacent Cs and Gs, followed within a few kilobases by a methionine residue that functions as a start signal. Writing software that can predict these to any great degree of accuracy has proven somewhat more difficult than originally anticipated. One of the major problems is a growing awareness that all sorts of other factors, such as the way in which the DNA is folded are also influencing things.

Effectively, it's all information theory. Bioinformaticians have been given a stack of data that is known to contain a large amount of information, and they're trying to get it out. For the next few years, at least, a lot of this is going to be guesswork and be based on a lot of assumptions. Even so, it's a field that has already produced lots of useful stuff and is likely to produce more. A full understanding of how the genome actually works is likely to have to wait until the entire biochemistry of a cell can be simulated.

The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acidsequences (and the proteinsequences derived from them) comprise the majority of such databases. While the storage and or ganization of millions of nucleotides is far from trivial, designing a database and developing an interface whereby researchers can both access existing information and submit new entries is only the beginning.

The process of evolution has produced DNA sequences that encode proteins with very specific functions. It is possible to predict the three-dimensional structure of a protein using algorithms that have been derived from our knowledge of physics, chemistry and most importantly, from the analysis of other proteins with similar amino acid sequences.

While most biological databases contain nucleotide and protein sequence information, there are also databases which include taxonomic information such as the structural and biochemical characteristics of organisms. The power and ease of using sequence information has however, made it the method of choice in modern analysis.

In the last three decades, contributions from the fields of biology and chemistry have facilitated an increase in the speed of sequencing genes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into bacteria. In this way, rapid mass production of particular DNA sequences, a necessary prelude to sequence determination, became possible. Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing. These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence. Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA sequences or to modify these sequences. With these techniques in place, progress in biological research increased exponentially.

For researchers to benefit from all this information, however, two additional things were required:

a way to extract from this pool only those sequences of interest to a given researcher.

Simply collecting, by hand, all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organization and analysis of this data still remained. It could take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins.

Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to analyze sequence data rapidly.

The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models. The physical linking of a vast array of computers in the 1970s provided a few biologists with ready access to the expanding pool of sequence information. The Internet has since evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it.

Searching for Genes

The collecting, organizing and indexing of sequence information into a database, a challenging task in itself, provides the scientist with a wealth of information, albeit of limited use. The power of a database comes not from the collection of information, but in its analysis. A sequence of DNA does not necessarily constitute a gene. It may constitute only a fragment of a gene or alternatively, it may contain several genes.

Genetic elements share common sequences, and it is this fact that allows mathematicalalgorithms to be applied to the analysis of sequence data. A computer program for finding genes will contain at least the following elements.

Elements of a Gene-seeking Computer Program

Algorithms for pattern recognition: Probability formulae are used to determine if two sequences are statistically similar.

Data Tables: These tables contain information on consensus sequences for various genetic elements. More information enables a better analysis.

Taxonomic Differences: Consensus sequences vary between different taxonomic classes of organisms. Inclusion of these differences in an analysis speeds processing and minimizes error.

Analysis rules: These programming instructions define how algorithms are applied. They define the degree of similarity accepted and whether entire sequences and/or fragments thereof will be considered in the analysis. A good program design enables users to adjust these variables.

Step One: Location of Transcription Start/Stop

A proper analysis to locate a genetic locus will usually have already pinpointed at least the approximate sites of the transcriptional start and stop. Such an analysis is usually sufficient in determining protein structure. It is the start and end codons for translation that must be determined with accuracy.

Step Two: Location of Translation Start/Stop

The first codon in a messenger RNA sequence is almost always AUG. While this reduces the number of candidate codons, the reading frame of the sequence must also be taken into consideration.

There are six reading frames possible for a given DNA sequence, three on each strand, that must be considered, unless further information is available. Since genes are usually transcribed away from their promoters, the definitive location of this element can reduce the number of possible frames to three. There is not a strong consensus between different species surrounding translation start codons.

Intron/exon splice sites can be predicted on the basis of their common features. Most introns begin with the nucleotides GT and end with the nucleotides AG. There is a branch sequence near the downstream end of each intron involved in the splicing event. There is a moderate concensus around this branch site.

Step Four: Prediction of 3-D Structure

With the completed primary amino acid sequence in hand, the challenge of modelling the three-dimensional structure of the protein awaits. This process uses a wide range of data and CPU-intensive computer analysis. Most often, one is only able to obtain a rough model of the protein, and several conformations of the protein may exist that are equally probable. The best analyses will utilize data from all the following sources.

Pattern Comparison: Alignment to known homologues whose conformation is more secure

X-ray Diffraction Data: Most ideal when some data is available on the protein of interest. However, diffraction data from homologous proteins is also very valuable.

Physical Forces/Energy States: Biophysical data and analyses of an amino acid sequence can be used to predict how it will fold in space.

All of this information is used to determine the most probable locations of the atoms of the protein in space and bond angles. Graphical programs can then use this data to depict a three-dimensional model of the protein on the two-dimensional computer screen.

From the BioTech Project at http://biotech.icmb.utexas.edu/. Written largely by Stephen Vigo. Used with permission. For further information see the BioTech homenode.