Bioinformatics and Transcription

Creating a map of genetic characteristics isn't simply a matter of figuring out which gene causes what condition. Very few states that we consider genetic characteristics are the product of a single gene, but rather, are created by a complex configuration of genes at various levels. This chapter outlines the problems and complications created by these multi-level configurations on the mapping of the human genome.

This chapter is from the book

INTRODUCTION

Although, in principle, the genome contains all the information one would need to understand a complex metabolic pathway or
disease, the information is encoded in a combination of physical and logical constructs that make interpretation very difficult.
As you saw in Chapter 3, genes are found in six different reading frames running in both directions and, more often than not, they contain intervening
sequences that are subsequently spliced out of the final transcript. Genes are also embedded in multilayer three-dimensional
structures. The primary unit of structure, the nucleosome, is composed of chromosomal DNA coiled around a histone protein
complex. The location of individual genetic elements within this structure significantly impacts both transcription and replication.
The positional effects are subtle because they are related to the topology of a helix coiled around a cylinder. Residues exposed
to the interior face of the nucleosome are not accessible to enzymes involved in transcription or replication. Conversely,
residues exposed on the surface of the structure and residues contained in the segments that connect nucleosomes are fully
accessible. Enzymatic digestion experiments have revealed a level of variability in the structure, and it is now known that
protein-coding regions can be hidden or made available for transcription in a time-dependent fashion that relates to the cell
cycle. Furthermore, epigenetic factors such as methylated DNA and translocated genes can affect gene expression across multiple
generations; as a result, genetically identical cells often exhibit different phenotypes. The combined effect of these genome-level
variations makes it difficult to predict the expression of a particular gene and even more difficult to predict the sequence
of a final spliced transcript.

Unfortunately, very few physical states or disease conditions are monogenic in nature. Even if a physical state were controlled
by a single gene coding for a single protein, the up regulation of that gene would perturb the broader systemmany coding
regions would be affected. Biology is a systems problem, nonlinear in nature, and the expression of a single gene has very
little meaning outside the context of the web of interactions that describes a metabolic state. Each stage in the gene-expression
pipeline provides important information about the factors that ultimately determine a phenotype:

Base sequence information can be used to identify conserved sequences, polymorphisms, promoters, splice sites, and other relevant
features that are critical to a complete understanding of the function of any given gene.

Information about the up and down regulation of closely related messages, mRNA interference, life expectancy, and copy count
of individual messages can help build a transcriptional view of a specific metabolic state.

Despite much analysis, it is not yet possible to predict the three-dimensional folded structure of a protein from its gene
sequence. Furthermore, the final protein is often a substrate for any of a number of post-translational modificationsacetylation,
methylation, carboxylation, glycosylation, etc. The enzymes that catalyze these reactions recognize structural domains that
are difficult to infer using exclusively genomic data.

Intermediary metabolism is the result of millions of protein:protein interactions. These interactions are context sensitive
in the sense that a given protein can exhibit different characteristics and serve completely different functions in different
environments. The complex networks that describe these interactions are routinely referred to as systems biology.

The genome-centric view of molecular biology is slowly being replaced by a more comprehensive systems view. One of the most
important elements of this approach is a comprehensive understanding of the transcriptional state of all genes involved in
a specific metabolic profile. This picture is complicated by the fact that many species of RNA that will be identified as
playing an important role in the profile are never translated into protein. As previously discussed in Chapter 3, many messages are degraded by the RNA silencing machinery within the cell and others are prevented from engaging in protein translation. These control mechanisms
can cause a high copy-count message, one that is highly abundant within the cell, to be translated into a very small number
of protein molecules. Regulatory messages (miRNA, siRNA) are relatively straightforward to spot because they are reproducibly
short and lack sequences that are normally associated with ribosomal binding. However, these small regulatory messages are
spliced from longer transcripts that certainly have the potential to cause confusion.

Any technique used to study the transcripts within a cell must be capable of spanning the range from single digit copy counts
to very large numbers, often in the thousands. Accuracy is important because at the single-digit level, small changes in the
number of copies of certain messages can have significant effects on metabolism and disease.

This chapter specifically focuses on transcriptionthe process of creating a messenger RNA template from a gene sequence. The process has particular significance in the context of this book because it represents
the first information-transfer step between gene and protein. In one sense, the step from gene to transcript represents a
step up in complexity because many different transcripts can be created from a single coding region. Conversely, each transcript
may be viewed as a simplification because much extraneous information has been removed from the raw gene sequence. This simplification
is particularly apparent in situations where splicing operations create a message that is substantially different from the
original chromosomal sequence. The structural link between mRNA transcript and protein is much more direct than the link between
gene and protein. Furthermore, the appearance of a message in the cytoplasm is a clear indication that a particular gene has
become involved in a cell's metabolism. This direct link between the appearance of a transcript and metabolic changes within
the cell has given rise to a new emerging discipline known as transcriptional profiling.

This discussion begins with a review of the different types of transcripts and their roles in gene expression. The goal is
to lay a foundation for the remainder of the chapter, which focuses on various techniques for identifying and counting individual
species of mRNA. Transcriptional profiling depends on these techniques in addition to a portfolio of algorithms for data analysis.
Over the past few years, the size of a typical expression-profiling experiment has grown to include thousands of transcripts.
The result has been a corresponding increase in the level of sophistication of the statistical methods used to analyze the
results. These methods are the focus of much of this discussion.