The Bioinformatics Core of the Joint Center for Structural Genomics

NA sequences are much the same for all individuals within a species, while differences are more pronounced from one species to another. The human DNA sequence is very much like that of the chimpanzee (estimated to be 98 percent alike) or the mouse (estimated to be 95 percent alike), and mammalian DNA sequences are more like one another than they are like those of, for example, plants. The DNA sequence determines corresponding sequences of genes, the active lengths of the string that encode the amino-acid sequences of proteins. "While all living things share a great deal of DNA, it is the little differences that distinguish the proteins of one species or individual from those of another." said John Wooley, UCSD Assistant Vice Chancellor for Research. "Now that scientists have deciphered the complete genomes of several dozen species, it's time to go after the structure and function of all the proteins." The project is called the Joint Center for Structural Genomics (JCSG), and it began in October 2000 with a five-year, $24 million grant from the National Institutes of Health (NIH).

Figure 1. Information Flow in the JCSG

High-throughput structural genomics requires automatic methods to generate, analyze, and validate experimental data. Data gathered at each stage will be part of a centralized bioinformatic system.

JCSG comprises three interactive core projects (see diagram, Figure 1). One of them, Bioinformatics, is centered at SDSC. "Our part of the JCSG is dominated by issues of data access. We need to keep track of our progress in the context of all available current research, so we can rationally improve our drive toward determining the structure of large numbers of proteins," said Bioinformatics core leader Adam Godzik of UCSD. He has developed new methods for identifying protein motifs (characteristic bits of structure). He works with Mark Miller of SDSC, a crystallographer and structural biologist who is the project coordinator.

"We're also working closely with other SDSC and UCSD scientists: Lynn Ten Eyck, director of the Computational Center for Macromolecular Structure; Shankar Subramaniam, the developer of the Biology WorkBench; Susan S. Taylor, whose group solved the first protein kinase structure; Philip Bourne (who directs the SDSC portion of the Protein Data Bank) and Ilya Shindyalov, structural bioinformatics leaders at SDSC; and Michael Gribskov, developer of the Molecular Information Agent and other innovative data mining applications," Miller said.

SECRETS OF HIGH-THROUGHPUT GENOMICS

Figure 2. Ribosome

Ribbon representation of a ribosome. These complex, multiunit proteins are the replication sites for new proteins manufactured within the cell.

"The JCSG is taking a systems engineering approach to determining protein structure and function," Ten Eyck explained. "There are five major steps in high-throughput protein structure determination. We begin with genomic sequences of interest, for example, the genome of the multicellular worm Caenorhabditis elegans, and select target protein sequences from those encoded by the genes."

"What we're looking for are proteins that have, say, cellular signaling functions, but which differ significantly from similar proteins whose structure has already been solved," Miller continued. "The bioinformatics approach allows us to sort through genomic sequences and target proteins on the basis of criteria like these, automatically." The worm genome is simpler than the human, yet the fact that the organism is a metazoan (multicellular) means there will be proteins having the functions JCSG scientists want to study. Such studies will quickly be extended to similar proteins found in more complicated organisms, including mammals.

"The targeting procedures can be applied to any source of sequences," Ten Eyck noted, "with appropriate filtering--and it is the filtering methods that we have been developing at SDSC." Soluble proteins thought to be involved in intra- or extracellular signaling processes will be first to be solved. The group will then target a number of transmembrane proteins that facilitate signaling across cell boundaries.

EXPRESSION, PURIFICATION, CRYSTALLIZATION

When target proteins have been selected, the next steps are expression, purification , and crystallization of the proteins, carried out in the Crystallomics core project. Ray Stevens, Peter Schultz, and colleagues at TSRI and the Genomics Institute of the Novartis Research Foundation (GNF) have been developing breakthrough robotic technologies for expressing and obtaining large amounts of purified protein. The expression systems include Escherichia coli bacteria and yeasts, into which genomic sequences are inserted that code for the proteins of interest.

"Crystallization has been a hit-or-miss business for far too long," Miller said. A mass of folklore has grown up around various procedures that work (but not always) to produce good crystals. By keeping track of all attempts to crystallize the proteins of interest, he noted, JCSG's Bioinformatics core will have a record, "both positive and negative," of what works and what doesn't work. The GNF automated system can make crystals with as little as 2 nanoliters of protein, so many trials can be made under varying conditions.

Successful crystals will be sent to the Stanford Synchrotron Radiation Laboratory (SSRL), where the Structure Determination core scientists begin by X-raying the crystals using the high-power beamline. Also to be automated is a procedure to make crystals with heavier atoms inserted at appropriate spots (e.g., substituting selenium for sulfurs in methionine residues) to enable scientists to determine "phasing" of the crystals by multiple anomalous diffraction and other means. "We'll be keeping data on the full set of X-ray reflections for each crystal studied," Miller said, "and again, this information will be a guide to future procedures with more complex proteins." (While individual investigators are encouraged by PDB to submit their raw crystallographic data, it is not required.)

From the X-ray diffraction data, JCSG scientists will build electron-density maps that, combined with the original sequence information, will enable complete models to be made of each protein's three-dimensional crystal structure. "All of these procedures, which used to take many months, if not years, are to be automated and routinized," Miller said.

"We will also be using new methods developed by PDB scientists to keep track of the various classes of 'folds' to be found in the structural data," he said. There appear to be tens of thousands of variations in folding motifs. The Combinatorial Extension algorithm developed by Ilya Shindyalov and Philip Bourne has been used for the classification of all folds found in the structures in the PDB, and the same methods can be extended to encompass new folds. For proteins that act together in a complex, like ribosomes (see Figure2), the orientation and connection of the subunits is yet another classification problem.

AN ENORMOUS TASK

The JCSG is one of seven pilot projects initiated last year by the NIH National Institute of General Medical Sciences, in hopes of determining the structures of thousands of proteins. "An engineering approach here is going to drive down the cost of obtaining protein structure solutions," Ten Eyck noted, "and thus the Bioinformatics core of the JCSG is truly the key to success." JCSG's objective is to solve some 50 structures in the first year, accelerating to 1000 solutions by the fifth year, for a total of some 2000 completely new protein solutions.

"The fifth-year figure works out to about three new structures every day," Miller said, "and similar goals have been postulated by the other six structural genomics consortia. We're talking about a tremendous amount of data, comparable to the total depositions in PDB from all investigators for the past year alone. Storing and mining it will be a challenge for us-and for the PDB also, a very ambitious task.

"Moreover," he continued, "high-throughput structural genomics requires automatic methods to validate structures against experimental data." The diversity of experimental data requires a deep understanding of the refinement and validation process. "We expect to improve greatly in both data acquisition and analysis," Miller said, "and we will be depending on strong links to be forged with, for example, the PDB, other large database efforts, and the new Alliance for Cellular Signaling, led by Alfred Gilman at the University of Texas, for which SDSC is also developing new bioinformatic techniques."

JCSG is currently on the lookout for experienced bioinformaticists and database programmers to help the project get up to speed. Interested persons should contact Miller via e-mail: mmiller@sdsc.edu.-MM