If there were a mantra for molecular biologists, it might be this: Structure and function are related, two sides of a coin. To know how a protein does its job inside a living cell, look at the structural details, how it's put together. The twists and turns, helices and sheets of a protein's 3D shape define what biomolecules it can interact with and how.

In the 1990s, through a variety of genome projects, scientists are mapping and sequencing the genes of many organisms, and the resulting data corresponds to the root of protein structure: the linear sequence of amino acids, like beads on a chain, that precedes and determines 3D shape. As this data streams into the marketplace of knowledge, a few scientists, like Hugh Nicholas and John Hempel, are using it to open up new territory in understanding how amino-acid sequence interrelates with 3D structure.

Hempel, a University of Pittsburgh biologist, has for 20 years focused on a family of enzymes called aldehyde dehydrogenase (ALDH). In the early 80s, he worked out the first two ALDH sequences to be solved and, in collaboration with Ron Lindahl at the University of South Dakota, has continued this work. In the past few years, scientists worldwide have expanded work in this area, and currently sequences are determined for over 200 related ALDH enzymes in a wide range of plants and animals.

In the late-80s, Hempel began collaborating with Nicholas, a Pittsburgh Supercomputing Center scientist who specializes in sequence analysis, the process of analyzing relationships among nucleic-acids (DNA and RNA) or proteins through comparison of their sequence data. In 1993, they worked on a group of 16 ALDH sequences representing the diversity of the ALDH family sequenced at that time.

In 1997, Hempel collaborated with a University of Georgia research group led by B.C. Wang in work that solved the first ALDH 3D structure. This made it possible for Nicholas and Hempel to embark on an ambitious project investigating the interplay between 3D structure and sequence data. Beginning in September 1997, Hempel's student, John Perozich, gathered 145 full-length ALDH sequences, the complete pool of sequenced ALDHs at the time. Using the sequence-analysis facility at PSC to align them, the researchers produced one of the largest multiple-sequence alignments achieved to date.

Nicholas and Hempel then applied techniques they developed to identify recurring sequence elements and analyze them in relation to function. They identified 10 sequence motifs, amino-acid patterns, that recur with a high degree of regularity in the 145 ALDH sequences. Their analysis of these motifs offers fresh insight into how sequence influences 3D structure. Their research, furthermore, offers an approach with potential wide application in other protein families.

An Extended Family of Enzymes

ALDHs have been found in nearly every form of living thing. Their primary role in humans and other mammals is protecting the body from toxic compounds called aldehydes. Early interest (in the 70s), focused on an ALDH in the liver that helps metabolize an aldehyde (acetaldehyde) that comes from alcohol, changing it to acetic acid, which the body burns for energy. A drug called Antabuse  sometimes used in treating alcoholism  deactivates the relevant ALDH, making you sick if you drink.

Further research has found a number of closely related but different ALDH species, over 10 now identified in humans, with various functions. A number of these are the subject of public-health research. One of them, also a liver ALDH, is genetically inactive in about half of all Asians, causing severe alcohol intolerance. In 1996 researchers showed that defects in another ALDH cause a genetic disease called Sjögren-Larsson syndrome, which involves mental retardation, scaly skin and shortened life.

ALDHs also affect cancer treatment. A number of chemotherapy drugs work through conversion in the body to an aldehyde that attacks cancer cells. These therapies lose potency over time bacause the relevant ALDH increases in concentration, deactivating the aldehyde more quickly. Better structure-function knowledge will make it possible to develop specific ALDH-inhibitor drugs to regulate this kind of chemotherapy.

Sequence and Evolution

As one part of their project, Hempel and Nicholas classified the 145 sequences, each more than 700 amino-acids in length, into sub-families. These groupings, explains Nicholas, are based on evolutionary adaptation that's consistent with having a common-ancestor ALDH. In one form of adaptapion, the gene that codes for a protein reappears with little change when organisms evolve to another organism  from one bacterium to another, for instance.

In another kind of adaptation, however, a gene duplicates within an organism. One copy of the gene can then diverge slightly in structure and take on a modified function, such as to react with a different form of aldehyde. Nicholas and Hempel tracked this adaptation by sequence analysis. "We think this happened at least 13 times in the history of ALDHs," says Nicholas. Through statistical measures of sequence similarity, Hempel and Nicholas grouped the 145 ALDH sequences into 13 distinct sub-families. "Sequences within a sub-family are more similar to each other than to the sequences in other sub-families."

The researchers also generated a "phylogenetic tree," which charts evolutionary relationships among the sub-families. Each branch represents a point of divergence, where a gene duplicates and evolves to a new function. Distance between branches corresponds to evolutionary distance as measured by how much the sequences differ.

This representation shows structure of an ALDH from rats. Colors show the conserved sequence motifs identified by Hempel and Nicholas. The spheres represent very highly conserved residues within each motif.Download a larger (99K) version of this image.

Another product of Nicholas and Hempel's analysis is identification of conserved residues  amino-acids in a protein chain are called residues, and a conserved residue stays the same across sequences. They found four 100 percent conserved residues  the same amino-acid at the same position in all 145 sequences. This is reduced from 23 in their 1993 alignment of 16 ALDHs. Twelve other residues are 95 percent conserved.

The logic of evolution holds that conserved residues should be important in structure and function. Nicholas and Hempel's analysis of ALDH supports this view. The four invariant residues participate in binding with other molecules involved in ALDH's catalytic function. Most of the other highly conserved residues, they found, are part of motifs  highly conserved short sequences  that cluster around the enzyme's active site, the part of the 3D structure where it binds with other molecules to carry out its function.

The researchers also extended this kind of analysis to the 13 sub-families, using computational tools to search out what residues are conserved within a particular sub-family and discriminate that group from other ALDHs. This is a particular interest of Nicholas, who sees it as offering the potential to develop drugs that interact with only a particular ALDH, rather than the entire family: "If with chemotherapy you could give the patient an inhibitor for the ALDH that inactivates the chemotherapy, you could get by with a lower dose. You wouldn't want to inhibit basic metabolism, which inhibiting a broad spectrum of ALDHs would do. This work is a first step in that direction."

Interplay of Sequence and 3D Structure

A major finding from the analysis is identification of 10 sequence motifs that are themselves highly conserved. "These are stretches of sequence  from five amino-acids up to 14 or 15," says Nicholas. "They're spread uniformly along the entire ALDH sequence, but when you look at the 3D structure they fold back together and come into contact with each other."

This interplay between sequence and 3D structure, uncovered by Nicholas and Hempel's analysis, provides a new way of looking at proteins, say the researchers. "You can gain a great deal of insight," says Hempel, "looking at conserved residues and how they relate to 3D structure." With respect to Sjögren-Larsson syndrome, detailed analysis of several mutant ALDH sequences associated with SLS suggests hypotheses for precisely what happens to cause the syndrome, which Nicholas and Hempel are investigating with computational simulations. "It gives a synergy to the whole process of understanding how this works."

A particularly interesting finding, observes Nicholas, arises from seeing that the conserved motifs contain a small, highly conserved, water-avoiding (hydrophobic) amino-acid not directly involved in enzyme function. "Each of the motifs," he ways, "has not only a functional definition but also a structural definition." These small, hydrophobic amino-acids, he notes, are generally involved with turns or tight-packing with other proteins.

"This represents information about the interplay between sequence and structure that's generalizable to other proteins," says Nicholas. "It indicates that perhaps the details of how a protein folds into its 3D shape are determined by these smaller amino-acids that form at critical turns or packing junctions, where several strands of the protein come together to carry out function."

These insights from sequence shed new light on traditional ways of thinking about protein structure, which tends to focus on 3D patterns (known as secondary structure) such as helices and sheets. "Our results tells us," says Nicholas, "that it's the residues at the ends of these structures that are conserved. Perhaps you can think of a protein not as a bunch of helices, but as defined by the ends of the helices, where it turns. These may be two equally informative, inversely related ways of thinking, and the genomics data suggests we should look at the turns more than we have."

As a goal for this kind of work, the researchers hope to be able to build a sophisticated description based on conserved residues that identifies sequence elements that can predict 3D structure, not only in ALDH but in other protein families. "This provides a model for other systems," says Hempel. "We're developing tools needed to rigorously analyze and determine highly conserved regions and to correlate them with structure. To look at long segments with high degrees of similarity, we need capability like the Pittsburgh Supercomputing Center gives us."