Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

7.
PrefaceWe are in the middle of a genome period marked by the full sequencingof complete genomes. Last year (2001) will be identiﬁed in the historyof biology by the publication of the ﬁrst draft of the complete sequenceof the human genome. Much work still lies ahead to achieve the goal offully ﬁnishing many of these eukaryotic and prokaryotic genomes that,as published, still contain gaps. At a ﬁrst glance, genomics has not produced a strong conceptualchange in biology. The fundamental problems remain: understandingthe origin of life, the complex organization of a cell, the pathways ofdifferentiation, aging, and the molecular and cellular bases for thecapabilities of the brain. What has happened is an explosion of molec-ular information; genomic sequences will be followed in the near futureby exhaustive catalogs of protein interactions and protein function (asproteomics takes the lead). This wealth of information can be analyzed,visualized, and manipulated only with the help of computers. Thisbasic contribution of computers was initially not recognized by biolo-gists. Certainly, by the time of the beginning of GenBank, in the 1980s,the experimentalist could imagine an institute where computational bi-ology was merely technical support for databases and access to Gen-Bank, and maybe a classic Bohering metabolic chart hung on the wall(initiated in the 1960s by G. Michal). The inﬂuence of genomes is suchthat today what Francois Jacob conceived as the Mouse Institute would ¸do much better having on staff experimentalists, computer scientists,statisticians, mathematicians, and computational biologists. We havereached a point where biology articles are published with contributionsfrom researchers who recently were, for instance, computer scientistsworking in logic programming.

8.
This is no small change if we remember the place of theoretical and mathematical biology as an activity that could be fascinating, but to a large extent was done in isolation, having little inﬂuence on main- stream experimental molecular biology. Today, the student, post- doctoral fellow, or even young professor who is knowledgeable both in biology and in computer science has much broader opportunities. Gen- omics may really be opening the door to a more profound conceptual change in the way we study living systems in the laboratory. With a foot in sequence analysis, this book is centered on current computational approaches to metabolism and gene regulation. This is an area of computational biology that welcomes new methods, ideas, and approaches with the goal of generating a better understanding of the complex networks of metabolic and regulatory capabilities of the cell. Classical concepts have to be redeﬁned or clariﬁed to address the study of the genetics of populations and of the biochemical interactions and regulatory networks organizing a living system. Given the con- stant and pervading importance of comparative genomics, these con- cepts must be precise when comparing genes, proteins, and systems across different species. The ﬁrst chapter, by Jeremy Ahouse, is an exercise in thinking about the concept of homology (the common origin of similarities) in order to use it adequately when considering homologous networks of gene reg- ulation between species. Currently, DNA sequence data is the most abundant material with which to begin a project in computational biology. Raw sequences from genomes have to be analyzed and annotated, in ways that improve continuously as the databases expand and sharper methods are used. The second chapter, by Rolf Apweiler and colleagues, describes an integrated system for this task. Databases centering on speciﬁc signals, motifs, or structures have exploded in number in the last years. The databases describe those pieces of macromolecules whose function we know, and therefore are essential for algorithmic analyses. The third chapter, by the team of Ralf Hofesta ¨dt, shows a system capable of in- tegrating data from different databases, and its subsequent use in the integration and modeling of metabolic pathways using a rule-based system. Once the computational and basic annotations are in place, we can move from sequences to networks of gene regulation and cell differen-viii Preface

9.
tiation. The second part of the book begins with chapter 4, by Gary Stormo, who describes the foundations of weight matrices and their biophysical interpretation in protein-DNA interactions. In a way, this method and its variants are for regulatory motifs what the Smith- Waterman algorithm was for coding sequence comparisons. Deﬁning the best matrix is based on the problem of deﬁning the best multiple alignment, given the constraints of no gaps, symmetry, and other prop- erties describing most protein-DNA binding sites in upstream regions. Abigail McGuire and George Church, in chapter 6, show how the inte- gration of gene regulation has to be supported by experimental studies of transcriptome analyses combined with computational motif searches. Chapter 5, by Julio Collado-Vides and colleagues, is devoted to com- putational studies of gene regulation in E. coli in which different pieces are put together, making it feasible to think of a global computational study of a complete network of transcription initiation in a cell. A pair of chapters illustrate the complexity of these issues when studying eukaryotes, as seen in the signal transduction modeling by Nikolay Kolchanov and colleagues (chapter 7), and by the Boolean network methodology and its plausible application to modeling the network of factors involved in the biology of asthma by Sui Huang (chapter 8). In chapter 9 Edward Marcotte presents a relatively novel approach using phylogenetic proﬁles to deﬁne a quantitative deﬁnition of func- tion in genomics. This is a powerful method that does not require homology among genes to identify groups of genes involved in the same function. Metabolic ﬂux analysis as well as the comparison of pathways in different genomes is illustrated in chapter 11, by Steffen Schmidt and Thomas Dandekar. The book ends with a chapter by Masaru Tomita that describes a more ambitious modeling that inte- grates metabolism, regulation, translation, and membrane transport. A comprehensive in silico complete cell model is still in its infancy, but Tomita points to what lies ahead. Still more important is evaluating the predictive capability of all these computational modeling and simula- tion projects. This book does not attempt to provide a complete account of this expanding and exciting area of research. Many other databases, algorithms, and mathematical approaches are enriching postgenomic computational research. In 1995 and 1998 we participated in the organization of two Dagstuhl seminars centered on modeling andix Preface

10.
simulation of metabolism and gene regulation. This book is the out- growth of a summer school following the Dagstuhl seminars that we organized in Magdeburg in the summer of 1999. We acknowledge the sponsorship of the Volkswagen Foundation for these activities. We also ´ acknowledge Alberto Santos-Zavaleta and Cesar Bonavides-Martınez ´ for their help in editing the book. Last but not least, we are both grate- ful to our families for their support during the compilation of this book.x Preface

13.
1 Are the Eyes Homologous? Jeremy C. Ahouse Since the 1990s research in developmental genetics has followed the approach of borrowing pathways described in one context and testing to see if the members of a pathway or genetic regulatory circuit can be found in a new context. This approach has raised questions of how the concept of homology should be used when comparing genetic regula- tory circuits. One particularly cautious response has been to claim that gene expression patterns are informative for the understanding of mor- phological evolution only when coupled with a detailed understand- ing of comparative anatomy and embryology. This reﬂects the concern that recruitment can lead to a situation where orthologous genes are expressed in novel contexts during development, thus suggesting that these similarities in gene expression patterns were not derived from a common ancestor with the structure of interest. De- ﬁning homology as a property of structures, genetic networks, or genes, rather than viewing homology as a particular way to explain observed similarities, is confusing. Specifying the similarities ﬁrst and then enter- taining hypotheses to explain them (including appealing to common ancestry, i.e., homology) allows us to dispense with tortured discussions of levels of biological organization at which the concept of homology may be applied. Other chapters in this book address speciﬁc questions of gene reg- ulation and metabolism without explicit mention of the connection between networks and the phenotype. One of the challenges, compu- tationally, in understanding gene regulation is ﬁnding, capturing, and leveraging the information in better-studied networks. It is standard practice to apply conclusions from well-studied proteins to similar, but less well-understood, proteins. This is done when annotating for

14.
function and even when trying to predict structure (see the cautions in chapter 2 in this volume). This practice of borrowing annotations and setting expectations relies on tacit assumptions about the transitive nature of these attributes once homology has been established. It is my goal in this essay to clarify what hypotheses of homology actually are in the context of borrowing network and gene regulatory informa- tion from one (well-described) regulatory circuit to another (less well- understood). To make the case for homology of regulatory circuits, and using what is known in one context and applying it to another, we will have to examine homology and the emergence of phenotype from regulatory circuits. This is the current challenge in computational biology. As genomes are sequenced, there comes the realization that interpreting the genome sequence is not straightforward. Coding regions are inter- spersed with noncoding regions, and an individual locus may give rise to multiple gene products. This has stimulated experimental approaches to identify the full spectrum of messenger RNAs (the transcriptome) and their corresponding protein products (the proteome) (RIKEN, 2001). If we now ask about the many modiﬁcations of proteins, and the numer- ous interactions and the detailed biophysics of protein-protein, protein- DNA, protein-RNA, and protein-lipid interactions (see chapter 9 in this volume), we quickly see why sequence-based computational biology hits a snag. Part of the enthusiasm for moving to descriptions at the network level is the hope (or intuition) that there will be regularities that allow us to offer useful descriptions without losing the emergent biological narrative in a fog of biophysical details. In addition, the increasing availability of transcription proﬁles and the need to interpret them has encouraged researchers to use known regulatory networks to establish expectations against which proﬁling experiments can be statistically compared. I will offer an operational deﬁnition of homology, watch it at work in a current example of gene regulation (eye development), and endorse hypotheses of gene regulatory homology that push experi- mental work and set expectations for establishing statistical signiﬁcance. HOMOLOGY Since evolution was championed in the mid-1800s, it has been possible to deﬁne homologies as similarities due to shared ancestry (Lankester,2 Jeremy Ahouse

15.
1870; Donoghue, 1992; Patterson, 1987; Patterson, 1988). To understand the use of this concept when thinking about developmental regulatory circuits or pathways, it is worth reﬂecting on the use of the term ‘‘homology.’’ There is general agreement that attributions of homology are shorthand for the claim that particular similarities are best ex- plained by common ancestry (Abouheif et al., 1997; Bolker and Raff, 1997; de Beer, 1971; Hall, 1995; Roth, 1984; Roth, 1988; Wagner, 1989a; Wagner, 1989b). There is still some confusion that ﬂows from conﬂat- ing ‘‘homology as an explanation for similarity’’ (as hypothesis) with treating homology as if it were a (discernible) property of individual things. As more and more developmental pathway information becomes available, comparative work becomes of particular interest. I will try to provide the framework within which concepts of homology can be based in these cases. My goal is to reciprocally illuminate the compari- son of regulatory pathways and those explanations that rest on homol- ogy. I will use examples from spatiotemporal gene expression patterns in developmental biology because these are the best studied. But I think much of the argument carries easily to gene regulatory circuits or met- abolic pathways (see Burian, 1997 for tensions between developmental and genetic descriptions). Here is an example. The eyespots on the wings of butterﬂies in the genera Precis and Bicyclus look very similar. In both species, eyespot foci are established in the larval stage. However, at the pupal stage things look quite different. The pattern of engrailed expression corre- lates with the development of eyespot rings. Engrailed is a transcription factor that is also involved in establishing body segments by activating the secreted protein hedgehog. In Precis, engrailed expression extends out to the second ring by 24 hours after pupation and then collapses to the center of the ring by 48–72 hours. In Bicyclus, it is expressed at the third ring but not in the second. Whereas both butterﬂies may use the same mechanism to place eyespots, the ways in which they specify the developing rings of the eyespot appear to be different, though the adult pattern appears similar again (Keys et al., 1999). Given the prof- ligate reuse of transcription factors in development, we have a real chal- lenge in applying notions of homology and in borrowing annotations from one situation to the next. Reactions to complicated (i.e., actual) examples include the claim that homology at one level does not require homology at another, or that3 Are the Eyes Homologous?

16.
homology means nothing more than shared expression patterns of im- portant regulatory genes during development, or that any assignment of homology must specify a level in order to be meaningful. Although homology may apply to (developmental) mechanisms per se (‘‘process homology’’), rather than to their structural end products, there is ten- sion in the possibility that homology at one level of organization may not imply homology at another. For example, nonhomologous wings are said to have evolved from homologous forelimbs. Pterosaurs, bats, and birds share the underlying pattern of homologous forelimb bones of their tetrapod ancestor, but their wings have evolved inde- pendently. The problem is that because there is no clear way to assign levels unambiguously, one may conclude, unnecessarily, that gene expression patterns should not be used as a primary criterion of homology. In addition to rejecting hypotheses of homology using gene expres- sion patterns because they may disagree with each other at varying levels of organization, some critics cite speciﬁc errors that have come from using expression patterns (Abouheif et al., 1997; Bolker and Raff, 1997). These include the failure to distinguish between orthology and paralogy,1 the confusion of analogy (convergence) and homology (not- ing that gene-swapping experiments do not resolve this question), the failure to notice that orthologous genes can be recruited and expressed in structures whose similarities may not be due to common ancestry. So, for example, the distal-less gene (the transcription factor that is the ﬁrst genetic signal for limb formation to occur in the developing zygote) may be homologous in different animals, but its cis regulation may be convergent in different lineages, so that ﬁnding distal-less expression in different outgrowths does not, by itself, warrant the claim that the re- sultant limbs are homologous. These concerns all seem reasonable, and might chill our enthusiasm for recognizing and borrowing knowledge gleaned from develop- mental regulatory circuits in different contexts. Must any hypothesis of morphological homology based on gene expression include, at a mini- mum, a robust phylogeny, a reconstructed evolutionary history of the gene, extensive taxonomic sampling, and a detailed understanding of comparative anatomy and embryology? Or are these requirements unnecessarily cumbersome? To untangle these issues I will return to a deﬁnition of homology.4 Jeremy Ahouse

17.
HOMOLOGY: A DEFINITION The use of the term ‘‘homology’’ implies that a given similarity is a result of common ancestry. This deﬁnition has a critical requirement: similarity comes ﬁrst. There are many cases in which the similarity is cryptic, but this should not fool us into thinking that we are explaining something other than the similarity. There are some instructive examples of structures that are not at ﬁrst glance similar, but are more obviously so once the hypothesis of com- mon ancestry is considered seriously, as in studies of insect wing evolution (Kukalova-Peck, 1983) and wing venation patterns (Kukalova- Peck, 1985). But we generally begin with the perception of similarity and then explain the similarities by appealing to a short list of possi- bilities. Biologists usually consider similarity to be the result of shared ancestry (homology), chance, convergence (homoplasy), or parallelism (including repeated co-optation of the same regulatory genes), or an intricate mix of these. Explanations that posit horizontal transfer are still appealing to homology to explain similarity, even though they re- lax the requirement for a unbroken shared lineage. We should not appeal to homology to explain dissimilarity. And, importantly, it is not at all clear what the claim that dissimilar objects are ‘‘nonhomologous’’ would mean. Homology as I have deﬁned it is coherent only when we begin with similarity. Nonhomologous simi- larity does make sense, however. Claiming that similarity is not due to shared ancestry sends us to the other possibilities (convergence, chance, and biomechanical constraint). There are other uses of ‘‘homology’’ that we will set aside. There is the unfortunate use of the word to refer to the degree of DNA sequence identity or similarity (e.g., 30% homology). This use does not make particular claims about the origin or process that gives rise to the similarity. Then there is the interesting phenomenon of serial homology, as in the forelimbs and hind limbs of quadrupeds, the repeated segments of a millipede, or the petals of a ﬂower. A similar situation arises in developmental genetic terms when, for example, the expression of apterous in dorsal cells and engrailed in posterior cells in both wing and haltere discs has been taken as evidence that these two appendages are built on a ‘‘homologous groundplan’’ (Akam, 1998). Serial homology5 Are the Eyes Homologous?

18.
does not imply the existence of a common ancestor with just one seg- ment, limb, or other structure; rather, it gives us insight into how a structure develops. Sometimes paralogy is assumed to be ‘‘serial homology’’ at the level of genes. However, paralogy of open reading frames does imply a common ancestor with just one copy. HOMOLOGY AS HYPOTHESIS As biologists, when we give ourselves the task to explain similarity, we have a limited list of options: 1. Mistaken perception: the similarity is solely in the eye of the be- holder (ﬂightlessness, an outgrowth, the coelom) 2. Shared ancestor had the anatomical structure, gene, regulatory network, behavior, temporal and spatial protein distribution, or other component (homology or horizontal transfer, developmental con- straints) 3. Convergence, parallelism (adaptation) 4. Chance (drift, contingency, historical constraints) 5. Physical principles (biomechanics). These options are not mutually exclusive. The claim that the percep- tion of similarity itself is illusory is an epistemological question (and not unique to biologists), so I will put it aside. Physical constraints have been in vogue as an explanation of similarity periodically since the work of D’Arcy Thompson. Contemporary practitioners who focus on biomechanics (e.g., Mimi Koehl and Steven Vogel) are part of this tra- dition, as are the recent wave of neostructuralists (Webster and Good- win, 1996; Depew and Weber, 1996). The clearest examples of this kind of similarity are in chemistry (ice crystals look similar due to the phys- ical processes involved, not shared ancestor relationship between indi- vidual water molecules). Physical and chemical constraints do not play a large part in most biologists’ explanations, so explanations involve appeals to the other three. Much of the discussion of homology as structural, or dependent on the relative position of surrounding parts or on the percent of iden- tical bases or amino acids comes down to questions of the relative merits of attributing overall similarity to common ancestors, not argu- ments about the deﬁnition of homology.6 Jeremy Ahouse

19.
The job of explaining similarities is one of partitioning credit. Take two gene sequences that can be aligned. There will be certain positions where the residues are shared (i.e., the same). As we move along the alignment, we can imagine that some of the shared residues reﬂect a shared ancestor, whereas others have mutated since the common an- cestor and have secondarily returned to the same residue thanks to either drift (there are only four bases possible) or to convergence (the protein works better if a particular residue is coded for at a particular position). Clearly the observation of the similarity depends strongly on the alignment (already an important hypothesis that privileges the idea that shared residues are due to homology). It should be clear that understanding what percent of the identities are due to homology, chance, and convergence may be difﬁcult, but it is at least formally possible. Many biologists take identical residues to indicate common ancestry in combination with stabilizing selection. Sequence comparison allows us to partition credit, at least in princi- ple. Doing the same thing when we are discussing morphology or gene regulatory circuits is more difﬁcult. This is both because it is much harder to atomize the trait unambiguously and because the explana- tions are deeply intertwined. This difﬁculty does not have to block inquiry. Focusing on convergence is the traditional way to gain insight into the selectionist forces at work. Lineages are assumed to be independent trials in a natural experiment, so convergence suggests similar selection pressures (Losos et al., 1998). Alternatively, attention to the underlying homologies2 offers insight into possible origins, and relationships among and constraints on the evolution of forms in the taxa under consideration (see Amundson, 1998 for a discussion of the structuralist tradition). Devotion to chance events has been used to good effect in both understanding the distribution and abundance of lineages and in inferring times of divergence by using background mutation rates of DNA sequences. The importance of contingent events in the history of life is well described by Gould’s review of the Burgess shale fossils and his discussion of which lineages got to participate in the Cambrian ex- plosion (Gould, 1990). These three accounts are not mutually exclusive; rather, they are the strands from which evolutionary explanations are braided.3 Can gene circuits and spatial and temporal expression patterns be perceived as similar? Certainly. Are they candidates for hypotheses of7 Are the Eyes Homologous?

20.
homology? I would say, absolutely yes! Now the question of diagnosis is open and difﬁcult—but the appeals to homology, chance, and con- vergence as parts of an explanation are not especially problematic for developmental genetics (see also Gilbert et al., 1996; Gilbert and Bolker, 2001). Due to changes in developmental timing, it is often a real chal- lenge to identify the equivalent developmental stages across lineages. Correlating equivalent developmental stages in different organisms is much like testing multiple alignment hypotheses in sequence-based comparison, though the criteria for identity are less obvious. However, if we are comparing which regulatory elements are upstream or down- stream in a circuit, we can anchor our particular questions to the circuit under consideration, even before we have full resolution of the stage problem. Can regulatory genes be homologous if the structures they produce are not? Again, I would answer this with an enthusiastic yes. I suspect that what is usually meant by ‘‘not homologous’’ is that the structures produced are not similar (or the part of the structures we are trying to explain are not the similarities). I ﬁnd it less likely, but formally possible, that someone could convince us that the similarities of the structures are best explained by an appeal to convergence or chance or physical constraint even if the regulatory genes’ similarities were best explained by their sharing a common ancestor (i.e., they are homolo- gous). Are tissues homologous if similarity is cryptic and apparent only at level of genes? We are constantly increasing the number of ways that we can probe and understand a tissue. As should be clear by now, I would prefer to reserve assertions of homology for the actual simi- larities (the noncryptic gene similarities). THE EVOLUTION OF THE EYE The evolution of the eye stood for years as a paradigmatic example of independent evolutionary paths fulﬁlling the same need. Vertebrates and mollusks have single-lens eyes (though the photoreceptive cells under the lens have opposite orientation), whereas insects have com- pound eyes. These differences had been taken to imply that the eye evolved (independently) numerous times. We now know that the large morphological differences share a common developmental pathway of elements for optic morphogenesis. The evidence for commonality in these developmental pathways comes from looking at similar proteins8 Jeremy Ahouse

21.
in mammals and ﬂies (the Pax proteins) (Gehring, 1999). A particular protein, called eyeless for its mutant phenotype in fruit ﬂies, was shown to produce eye structures on wings and legs of ﬂies when ectopically expressed in those locations. It seems reasonable to conclude that it must be near the top of the developmental hierarchy for eye development. A mutation in a similar protein in mammals (Pax6, the eyeless homologue, based on sequence and motif similarities) results in abnor- mal formations of the eye. The mouse protein, when expressed in un- usual locations in the ﬂy, also results in production of ectopic ﬂy eyes. Whether Pax6 recruits native eyeless, which then auto-upregulates more eyeless, or does the job itself is not known. But in either case, these two proteins have very similar functions. This ﬁnding also suggests that ei- ther (a) the common ancestor of ﬂies and mice also had working eyes whose development used this protein (i.e., the common ancestor of Pax6 and eyeless) or (b) whatever this protein was doing in the common ancestor, it facilitated the evolution of eyes in other lineages (a Pax6- like protein is found in squid and octopus, too). So are the eyes homologous? If we begin with similarities, we can avoid a fruitless argument. The differences between compound ﬂy eyes and single-lens vertebrate eyes cannot support a hypothesis of homol- ogy because they are differences. This allows us to focus on the simi- larities; bilateral symmetry, positioning on the head, the expression patterns of regulatory genes, the pathway itself (eyeless, twin of eyeless, sine oculis, eyes absent, dachshund . . .). All of these similarities do seem to be homologous; or, more carefully, we would credit those similarities to shared ancestry. It is relevant to point out that work on the regulation of chick muscle development has shown that homologues of genes involved in mouse eye development (Dach2, Eya2 and Six1) are involved in vertebrate somite (muscle) development (Heanue et al., 1999). Again by focusing on the similarities, in this case the regulatory feedback loops, we might appeal to homology while simultaneously avoiding the question of whether eyes are homologous to the segmentally organized meso- dermal structures that are the embryonic precursors of skeletal muscle. Do we need a new word for homologous gene circuits (e.g., true homology, deep homology, homoiology), or should we talk about homology at different levels? I have been arguing that attribution of similarity to historical relatedness is an appeal to homology, whenever it is made. The additional adjectives (‘‘true’’ or ‘‘deep’’) do not add much.9 Are the Eyes Homologous?

22.
Contingency, homology, selection (functional convergence), and physi- cal constraints are constitutive parts of any explanation for a trait, whether it is a gene sequence, a gene expression pattern, or an adult tissue. METHOD While similarity surely results from a mix of explanations, a method- ological preference for homology can still be defended. Looking for and highlighting homology when discussing developmental regulation serves us by generating hypotheses that inspire tests in ways that con- tingency and convergence do not. This does not mean that the hypoth- esis of homology will be supported by those tests, but we know what to do next in the laboratory. I would like to contrast the kinds of hypotheses that are generated when we focus on differences attributed to selection rather than on similarities attributed to homology. C. J. Lowe and G. A. Wray studied several homeobox genes and concluded that they were recruited into new roles: ‘‘Each of these cases [orthodenticle, distal-less, engrailed ex- pression in brittle stars, sea urchins, and sea stars] represents recruit- ment (co-option) of a homeobox gene to a new developmental role. . . . Role recruitment implies that the downstream targets are different from those in other phyla.’’ This assessment—that if the genes were recruited into new roles, their downstream targets would be different—presents a signiﬁcant experimental challenge. Where to go next? What if, in- stead, Lowe and Wray had asserted that the upstream and downstream factors were what had been found previously in other organisms? They would then have known which genes (and expression patterns) to hunt for. This suggests that it may be methodologically useful to hypothe- size homologies, especially when looking at pathways and develop- mental circuits, since previously characterized networks provide a list of candidates that might be involved in the new situation. Most evolutionists recognize that explaining every feature of an or- ganism as an adaptation can become mere storytelling. This is why nonhomologous similarities are of special interest (i.e., distinct clades that share the feature of interest). With multiple clades, if we have ruled out homology, chance, and physical constraint, we can then look to commonalities in the respective environments to suggest that there may have been similar selection regimes. Dispensing with the compar-10 Jeremy Ahouse

23.
ative step can result in an uncritical adaptationism that explains (by an appeal to natural selection) the existence of a trait that is unique or novel in our lineage of interest. Without multiple lineages for compari- son (focusing just on the autapomorphy) we are free to assert that the population faced whatever challenges could select for the structures under consideration. These selectionist accounts are too difﬁcult to challenge and can be produced at will. Flying, for example, has arisen numerous times from ﬂightless ancestors. Should every structure that makes ﬂight possible be treated as a complete novelty in each lineage? Because of the possi- bilities of ﬁnding developmental and structural homologies, there are certain parts of the explanation of ﬂight in these lineages that will be better examined by restricting our inquiry to the three vertebrate clades that had ﬂight (pterosaurs, birds, and bats) as distinct from the ﬂying insects. It should be clear that comparative work is critical, and for- tunately the sequencing projects and advances in transcript and protein proﬁling make comparative work ever easier. And the information that can be gleaned from comparative work (borrowing annotations and candidates justiﬁed by hypotheses of homology) should motivate ever more comparative studies. From a methodological standpoint, then, identifying homologies has salutary effects. First, it demands an actual comparison. Second, in comparing across clades we can easily generate hypotheses. If our trait of interest stands in particular relations to other features in one organ- ism—a given regulatory gene, for example—we can hypothesize that it will also do so in another. We still may not ﬁnd the targets, but hypotheses of homology can tell us what to test initially. As we move from the initial wave of genome sequencing to the wonderfully more complicated problems of understanding what pro- teins do, how they interact, and how they are regulated, we will need principled ways to interpret proﬁling information, generate network hypotheses, and annotate myriad functions. In that project, homology plays a useful role both in giving a methodological starting point for generating candidate interactions and in reminding us that inference from similarity is difﬁcult. The use of comparative developmental genetics to generate hypotheses of homology should be embraced. Ex- pression patterns and regulatory networks are legitimate foci for hy- potheses of homology, because they help us understand the origin and evolution of structure. Finally, attributions of homology should be11 Are the Eyes Homologous?

24.
sought, solely on methodological grounds, because they offer us spe- ciﬁc testable hypotheses. ACKNOWLEDGMENTS I would like to acknowledge pivotal conversations with Georg Halder, John True, and Jen Grenier during my postdoctoral work with Sean Carroll in the Laboratory of Molecular Biology, Howard Hughes Med- ical Institute, Madison, Wisconsin, and very useful comments from Kevin Padian at the Museum of Paleontology, UC Berkeley, and Scott Gilbert at Swarthmore College. NOTES 1. The paralogy and orthology distinction was introduced to distinguish two kinds of homology in proteins (Fitch, 1970). Paralogy is meant to cover those situations when a gene duplication allows related proteins to evolve independently within the same lineage. Orthologues are found in different individuals, and paralogues can be found in the same individual (reviewed in Patterson, 1987). 2. ‘‘The importance of the science of Homology rests in its giving us the key-note of the possible amount of difference in plan within any group; it allows us to class under proper heads the most diversiﬁed organs; it shows us gradations which would otherwise have been overlooked, and thus aids us in our classiﬁcation; it explains many monstrosities; it leads to the detection of obscure and hidden parts, or mere vestiges of parts, and shows us the meaning of rudiments. Besides these practical uses, to the naturalist who believes in the gradual modiﬁcation of organic beings, the science of Homology clears away the mist from such terms as the scheme of nature, ideal types, archetypal patterns or ideas, &c.; for these terms come to express real facts. The naturalist, thus guided, sees that all homological parts or organs, however much diversiﬁed, are modiﬁcations of one and the same ancestral organ; in tracing existing gradations he gains a clue in tracing, as far as that is possible, the probable course of modiﬁcation during a long line of generations. He may feel assured that, whether he fol- lows embryological development, or searches for the merest rudiments, or traces grada- tions between the most different beings, he is pursuing the same object by different routes, and is tending towards the knowledge of the actual progenitor of the group, as it once grew and lived. Thus the subject of Homology gains largely in interest’’ Charles Darwin, On the Various Contrivances by Which British and Foreign Orchids Are Fertilised by Insects, 2nd ed. (London: John Murray, 1877), pp. 233–234. 3. This insistence on a pluralistic account (including homology, selection, and chance) is not meant to defend claims of percent homologue. A particular similarity either is or is not homologous. The use of ‘‘homology’’ with respect to gene sequences to indicate per- cent similarity should be avoided. I am only making the uncontroversial claim that any comparison of particular traits in toto will be require an appeal to homology, conver- gence, and chance.12 Jeremy Ahouse

31.
2 Automation of Protein Sequence Characterization and Its Application in Whole Proteome Analysis Rolf Apweiler, Margaret Biswas, Wolfgang Fleischmann, Evgenia V. Kriventseva, and Nicola Mulder The ﬁrst complete genome sequence of an organism, the ﬁve-kilobase sequence of the bacterial virus phi-X174, was achieved by Fred Sanger and coworkers in Cambridge (Sanger et al., 1978). Only more recently, however, has the technology developed to a stage where the sequenc- ing of the complete genome of a living organism can be contemplated as a practical and routine possibility. A major breakthrough was the sequencing of the ﬁrst complete eukaryote chromosome, chromosome III of Saccharomyces cerevisiae, in 1992 by a European Union-funded consortium (Oliver et al., 1992). In 1995 the TIGR group published the ﬁrst complete sequence of a bacterial genome, that of Haemophilus in- ﬂuenzae (Fleischmann et al., 1995). Since those dramatic events the complete sequences of more than 40 bacterial genomes have been published and at least 70 more are known to be nearing completion. The sequencing of ﬁve eukaryotic genome sequences—those of Saccharomyces cerevisiae (Goffeau et al., 1997), the nematode Caenorhabditis elegans (The C. elegans Consortium, 1998), the fruit ﬂy Drosophila melanogaster (Adams et al., 2000), the plant Arabi- dopsis thaliana (The Arabidopsis Initiative, 2000), and the alga Guillardia theta (Douglas and Penny, 1999) has been achieved and the sequences of other model eukaryotes are nearing completion. Large-scale sequenc- ing of the genome of the laboratory mouse is well under way in the United States, Japan, and Europe. The sequences of several important protozoan parasites are close to being ﬁnished. In addition, the com- plete genomes of many mitochondria and plastids have been deter- mined. The ‘‘Holy Grail’’ of large-scale sequencing is, however, the determination of the sequence of the human genome, estimated at around 3 billion base pairs. The completion of the ‘‘ﬁrst draft’’ of this

32.
sequence was announced on 26 June 2000 by an international consor- tium of public laboratories. Various proteomics and large-scale functional characterization proj- ects in Europe, Japan and the United States complement the large-scale nucleotide sequencing efforts. These projects have all produced large amounts of sequence data lacking experimental determination of the biological function. To cope with such large data volumes and to provide meaningful information, new approaches to characterize and annotate the biological data in a faster and more effective way are required. One promising but still error-prone approach is automatic functional analysis, which is generated with limited human interaction. AUTOMATIC ANNOTATION The Pitfalls of Automatic Functional Analysis Several solutions of automatic functional characterization of unknown proteins are based on high-level sequence similarity searches against known proteins. Other methods collect the results of different pre- diction tools in a simple (http:/ /pedant.gsf.de/; Frishman and Mewes, 1997) or a more elaborate (http:/ /jura.ebi.ac.uk:8765/ ext-genequiz/; Tamames et al., 1998; Hoersch et al., 2000) manner. However, some of the currently used approaches have several draw- backs, including the following: . Since many proteins are multifunctional, the assignment of a single function, which is still common in genome projects, results in the loss of information and outright errors. . Since the best hit in pairwise sequence similarity searches is fre- quently a hypothetical protein, a poorly annotated protein, or simply a protein that has a different function, the propagation of wrong annota- tion is widespread. . There is no coverage of position-speciﬁc annotation, such as active sites. . The annotation is not constantly updated, and thus is quickly outdated. It is also important to emphasize that a single sentence describing some predicted properties of an unknown protein should not be re-20 Rolf Apweiler et al.

33.
garded as annotation. It may be regarded as an attempt to characterize a protein, but not as an attempt to annotate the protein. Annotation means the addition to a protein sequence of as much reliable and up-to- date information as possible describing properties such as function(s) of the protein, domains and sites, catalytic activity, cofactors, regulation, induction, subcellular location, quaternary structure, diseases associated with deﬁciencies in the protein, the tissue speciﬁcity of a protein, de- velopmental stages in which the protein is expressed, pathways and processes in which the protein may be involved, similarities to other proteins, and so on. The Annotation Concept of SWISS-PROT and TrEMBL The SWISS-PROT protein sequence database (Bairoch and Apweiler, 2000) strives to provide extensive annotation as deﬁned above. The increased data ﬂow from genome projects to the protein sequence databases, however, challenges this time- and labor-intensive method of database annotation. Maintaining the high quality of annotation in SWISS-PROT requires the careful and detailed annotation of every entry with information retrieved from the scientiﬁc literature and from rigorous sequence analysis. This is the rate-limiting step in the produc- tion of SWISS-PROT. It is of paramount importance to maintain the high editorial standards of SWISS-PROT because the exploitation of the sequence avalanche is heavily dependent on reliable data sources as the basis for automatic large-scale functional characterization and annota- tion by comparative analysis. This, then, sets a limit on how much the SWISS-PROT annotation procedures can be accelerated. Recognizing that it is also vital to make new sequences available as quickly as pos- sible, in 1996 the European Bioinformatics Institute (EBI) introduced TrEMBL (Translation of EMBL nucleotide sequence database). TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS al- ready included in SWISS-PROT. To enhance the annotation of uncharacterized protein sequences in TrEMBL, the SWISS-PROT/TrEMBL group at the EBI developed a novel method for automatic and reliable functional annotation (Fleischmann et al., 1999). This method selects proteins in the SWISS- PROT protein sequence database that belong to the same group of proteins as a given unannotated protein, extracts the annotation shared21 Automation of Protein Sequence Characterization

34.
by all functionally characterized proteins of this group, and assigns this common annotation to the unannotated protein. Automatic Annotation of TrEMBL To implement this methodology for the automated large-scale functional annotation of proteins, three major components are required. First, a reference database must serve as the source of annotation. SWISS-PROT makes an excellent reference database due to its highly reliable, well- annotated, and standardized information. Second, a highly reliable, di- agnostic protein family signature database must provide the means to assign proteins to groups. Initially, PROSITE (Hofmann et al., 1999) was used, and in future, InterPro, described below, will be used. The third component needed for the implementation of the automated large-scale functional annotation methodology is a database (RuleBase) that stores and manages the annotation rules, their sources, and their usage. The Reference Database The basis for the automatic annotation of TrEMBL is the functional information in the SWISS-PROT protein sequence database. Many other annotation approaches try to predict functions by comparative analysis with SWISS-PROT and other protein databases like TrEMBL and Genpept. There are three main reasons for using only SWISS-PROT annotation in automatic approaches. First, SWISS-PROT is a comprehensive protein sequence database. This may seem surprising, since as of October 2000 SWISS-PROT con- tains only 88,000 proteins. Although these sequences represent—taking redundancy into account—less than one-third of all known protein sequences, SWISS-PROT contains around 60% of all proteins found in comprehensive protein sequence databases (like SWISS-PROTþ TrEMBL [SPTR] or protein entries in Entrez) with annotation of at least basic experimentally derived functional characterization. This percent- age was estimated from the number of papers (70,000) cited in SWISS- PROT records compared with the number of papers in all SPTR or Entrez protein entries (110,000) together. The calculation was based on the assumption that the proportion of papers reporting sequencing to papers reporting characterization is the same in SWISS-PROT records as in TrEMBL records or in non–SWISS-PROT Entrez protein records. However, an inspection of citations from SWISS-PROT compared with citations from TrEMBL shows that SWISS-PROT contains a higher pro-22 Rolf Apweiler et al.

35.
portion of papers representing biochemical citation than do TrEMBL papers. This observation, together with the sequence redundancy in TrEMBL and the non–SWISS-PROT records of Entrez proteins, indicates that SWISS-PROT probably contains more than 60% of all annotated pro- teins with at least basic biochemical characterization. Even more strik- ing is the fact that more than 80% of all functional annotation found in the comprehensive protein sequence database records (such as SPTR or protein entries in Entrez) is SWISS-PROT annotation. SWISS-PROT an- notation is, for the most part, stored in the CC (Comment), FT (Feature Table), KW (Keyword) and DE (Description) lines. As of August 2000, there are more than 410,000 CC lines, 460,000 FT lines, and 110,000 DE lines in SWISS-PROT. This information in SWISS-PROT is abstracted from more than 70,000 literature citations reporting sequencing and/or characterization. Another important reason is the standardization of annotation in SWISS-PROT. This unique feature of SWISS-PROT allows the extrac- tion of the ‘‘common annotation’’ described above. Using the stan- dardized SWISS-PROT annotation leads eventually to the standardized annotation of TrEMBL. The last and perhaps most important reason is the fact that SWISS- PROT distinguishes experimentally determined functions from those determined computationally. InterPro InterPro (Apweiler et al., 2001) is an integrated resource for protein families, domains, and functional sites, developed as an in- tegrative layer on top of the PROSITE, PRINTS (Attwood et al., 2000), Pfam (Bateman et al., 2000), and ProDom (Corpet et al., 2000) data- bases. The different approaches integrated in InterPro (hidden Markov models [HMMs], proﬁles, ﬁngerprints, regular expressions, etc.) have different strengths and weaknesses. The combination of the strengths of the different signature recognition methods, coupled with a statistical and biological signiﬁcance test, overcomes drawbacks of the individual methods. InterPro reliably classiﬁes proteins into families and recog- nizes the domain structure of multidomain proteins. The use of In- terPro should facilitate increased coverage of target sequences with enhanced reliability (reduction of false positives and false negatives). InterPro can currently classify around 60% of all known protein sequences.23 Automation of Protein Sequence Characterization

36.
RuleBase RuleBase stores the common annotation extracted from a group of SWISS-PROT entries. The common annotation is linked to the conditions and to the set of proteins from which the annotation was derived. The concept of a rule is used so that every rule has one or more conditions and one or more actions associated with it. If the con- ditions hold for a target TrEMBL entry, then all the actions are applied to that entry (Fleischmann et al., 1999). Implementation The actual ﬂow of information during automatic annotation can be divided into ﬁve steps. 1. Use InterPro and additional a priori knowledge to extract the infor- mation necessary to assign proteins to groups (conditions) and store the conditions in RuleBase. 2. Group the proteins in SWISS-PROT by the stored conditions. 3. Extract from SWISS-PROT the common annotation shared by all functionally characterized proteins from each group. Store this com- mon annotation together with its conditions in RuleBase. Every rule consists of conditions and the annotation common to all proteins of the group characterized by these conditions. 4. Group the unannotated, target TrEMBL entries by the conditions stored in RuleBase. 5. Add the common annotation to the unannotated TrEMBL entries. The predicted annotation will be ﬂagged with evidence tags, which will allow users to recognize the predicted nature of the annotation as well as the original source of the inferred annotation. Because the reliability of the conditions is crucial to the reliability of the methodology, measures are taken to minimize false-positive automatic annotation. The InterPro database that is used to extract conditions and to assign proteins to groups integrates different com- putational techniques for the recognition of signatures that are diag- nostic for different protein families or domains. In addition, every rule ensures that the taxonomic classiﬁcation of the unannotated protein sequences lies within the known taxonomic range of the experimentally characterized proteins. This automatic annotation approach should overcome some limi- tations of some existing automatic annotation methods in the following ways:24 Rolf Apweiler et al.

37.
. By using only the annotation from a reliable reference database for the predictions, the propagation of wrong annotation, one of the core problems in functional annotation, is drastically reduced (Bork and Koonin, 1998). . By using the ‘‘common annotation’’ of multiple entries, the imple- mented methodology will produce a signiﬁcantly lower number of overpredictions than methods based on the best hit of a sequence simi- larity search. . Using the ‘‘common annotation’’ from a reliable reference database with standardized annotation and nomenclature ensures the stan- dardized annotation of uncharacterized, target proteins by avoiding the use of wrong nomenclature and of different descriptions for the same biological fact. . Since the method takes all potential annotation available in the refer- ence database into account, a much higher level of annotation, includ- ing position-speciﬁc annotation such as active sites, is possible. . The ‘‘common annotation’’ approach can be used not only with pro- tein families but also with conditions aiming at a higher level in the protein family hierarchy. Only the annotation common to all members of this (for instance) superfamily will be copied over. . Our methodology is independent of the multidomain organization of proteins. If a certain condition aims at a single domain that occurs with various other domains, it can be expected that only the annotation re- ferring to this single domain will be found in all relevant characterized proteins. On the other hand, if the single domain always occurs with another domain, the information for the other domain will be picked up as well. . Evidence tags will allow the automatic update of the predicted an- notation if the underlying conditions or the ‘‘common annotation’’ in RuleBase changes. WHOLE PROTEOME ANALYSIS A Four-Layer Approach to Whole Proteome Analysis It is no longer ludicrous to envisage collecting vast amounts of genomic data, although it remains a massive task. The challenge is in developing the tools and methods required to analyze the data. In the sections25 Automation of Protein Sequence Characterization

38.
above, we described how the SWISS-PROT group at the EBI combines manual annotation and sequence analysis of SWISS-PROT entries with rule-based automatic annotation of TrEMBL entries to provide a com- prehensive, reliable, and up-to-date protein sequence database. With existing methodology we are able to improve the annotation of ap- proximately 25% of the incoming data. Exploiting this approach to the full will enable us to annotate approximately 40–50% of the new and existing sequence data in a reasonable way within a few years. How- ever, tools developed by our group and by others make possible the preliminary classiﬁcation and characterization of many more sequences. Capitalizing on these achievements, we developed a new four-layer strategy for protein analysis: 1. Automatic protein classiﬁcation 2. Automatic protein characterization 3. Rule-based automatic annotation 4. SWISS-PROT-style manual annotation. From level 1 to level 4 there is an increase in the manual intervention required and a decrease in both the computational power needed and the number of protein sequences affected. The rule-based automatic annotation of TrEMBL entries and the SWISS-PROT-style manual annotation (levels 3 and 4) were described above. In the following sections we will describe automatic protein classiﬁcation and charac- terization, and their application to provide statistical and compara- tive analysis, as well as structural and other information, for complete proteome sequences. Whole Proteome Analysis at EBI The EBI proteome analysis initiative aims to provide comprehensive, easily accessible information as quickly as possible to the user commu- nity. Proteome analysis data have been produced for all the completely sequenced organisms spanning archaea, bacteria, and eukaryotes. Com- plete proteome sets for each organism have been assembled from the SPTR (SWISS-PROTþTrEMBLþTrEMBLnew) database (Apweiler, 2000) to be wholly nonredundant at the sequence level. These proteome data have been used in the analysis, and are easily accessible and down- loadable from the proteome analysis pages (http:/ /www.ebi.ac. uk/proteome/).26 Rolf Apweiler et al.

39.
Automatic Protein Classiﬁcation For the automatic classiﬁcation of proteins, InterPro (Apweiler et al., 2001), CluSTr, HSSP (Sander and Schneider, 1991), TMHMM (Sonn- hammer et al., 1998), and SignalP (Nielsen et al., 1999) are used. Sig- nalP is used for the prediction of signal peptides and their cleavage sites in eukaryotes and prokaryotes in order to classify secreted pro- teins and transmembrane proteins with signal sequences. TMHMM predicts transmembrane helices in proteins and is used for the iden- tiﬁcation and classiﬁcation of transmembrane proteins. A list of nonredundant proteins from the reference proteome with HSSP (homology-derived secondary structure of proteins) links has been generated from current releases of SWISS-PROT and TrEMBL. These proteins, together with those having a corresponding PDB (Berman et al., 2000) entry, represent the proteins with structural classiﬁcation. The resources with the highest information content are InterPro and CluSTr. InterPro (http:/ /www.ebi.ac.uk/interpro/) classiﬁes 50–70% of all proteins in a proteome into distinct families. In addition, InterPro provides insights into the domain composition of the classiﬁed proteins. The proteome analysis pages (http:/ /www.ebi.ac.uk/ proteome/) make available InterPro-based statistical analysis that includes the following, among other information: . General statistics—lists all InterPro entries with matches to the refer- ence proteome. The matches per genome and the number of proteins matched for each InterPro entry are displayed. . Top 30 entries—lists the 30 InterPro entries with the highest number of protein matches for the reference proteome. . 15 most common domains—lists the InterPro entries with the largest number of Pfam and proﬁle matches (deﬁned as domains) for the ref- erence proteome. The matches per genome and the number of proteins matched for each InterPro entry are shown. ClusTr There are several databases that focus on the analysis of complete pro- tein sequences. The COG database (Clusters of Orthologous Groups of proteins) is a phylogenetic classiﬁcation of proteins encoded in 21 complete genomes of bacteria, archaea, and eukaryotes (http:/ /27 Automation of Protein Sequence Characterization

40.
www.ncbi.nlm.nih.gov/COG; Tatusov, 2000). ProtoMap offers a hierarchical classiﬁcation of proteins in the SWISS-PROT and TrEMBL databases (http:/ /www.protomap.cs.huji.ac.il/; Yona et al., 2000) based on analysis of all pairwise similarities among the pro- tein sequences. The searching algorithm SYSTERS (SYSTEmatic Re- Searching) applies an iterative method for database searching to cluster sequences from a number of databases that store protein sequences (http:/ /www.dkfz-heidelberg.de/tbi/services/ cluster/systersform; Krause et al., 2000). CluSTr (http:/ /www.ebi.ac.uk/clustr/), the database of clus- ters of SWISS-PROT and TrEMBL proteins developed at EBI, will be discussed in some detail in this chapter. It offers an automatic classiﬁ- cation of SWISS-PROTþTrEMBL (SPTR) proteins into groups of related proteins. The clustering is based on analysis of all pairwise compar- isons between protein sequences. Analysis has been carried out for dif- ferent levels of protein similarity, yielding a hierarchical organization of clusters. Methodology The clustering approach is based on two steps. First, a similarity matrix of ‘‘all-against-all’’ protein sequences is built. The similarity matrix is computed using the Smith-Waterman algorithm (Smith and Waterman, 1981). A Monte Carlo simulation, resulting in a Z-score (Comet et al., 1999), is used to estimate the statistical signiﬁcance of similarity be- tween potentially related proteins. Initially, a Smith-Waterman score between sequences A and B is calculated. If this score is higher than a certain threshold, sequence A is compared with N shufﬂed sequences of B (B Ã ). Sequences B Ã have the same length and amino acid composition as the initial sequence B. The Z-score is calculated as SWðA; BÞ À M ZðA; BÞ ¼ ; s where, SWðA; BÞ is the initial Smith-Waterman score, M is the average Smith-Waterman score between sequence A and sequences B Ã , and s is the standard deviation. Sequence B is then compared with N shufﬂed sequences of A (A Ã ) and ZðB; AÞ is calculated. The ﬁnal Z-score is Z-score ¼ minðZðA; BÞ; ZðB; AÞÞ:28 Rolf Apweiler et al.

41.
The Z-score depends only on the compared sequences, not on the size and composition of the sequence database. By storing all the scores of unchanged sequences and calculating only ‘‘new-against-new’’ and ‘‘new-against-unchanged,’’ the CluSTr database can be updated incre- mentally, avoiding time-consuming recalculations. Second, clusters are built using a single linkage algorithm (Sneath and Sokal, 1973) for different levels of protein similarity. There are two major complications in automatic clustering procedures: different pro- tein families have different levels of sequence similarity, and clusters of proteins with different domains get pulled together by multidomain proteins. One of the approaches to tackling these problems uses hier- archical clustering that works with clusters at different levels of se- quence similarity. The LASSAP package (Glemet and Codani, 1997) has been used to calculate similarities and to build clusters. Data Structure Clusters for mammalian proteins, plant proteins, and the three com- plete eukaryotic genomes (Caenorhabditis elegans, Saccharomyces cere- visiae, and Drosophila melanogaster) have been built. The CluSTr data are stored in a relational database that comfortably handles large amounts of data and facilitates comprehensive data updates. Multiple users have direct access to the database via Java servlets. The main building blocks of the schema underlying the CluSTr are Proteins, Groups, Similarities, and Clusters. The Proteins table de- scribes SPTR entries, Groups describes protein sets for which clusters were built and the history of comparison runs, Similarities contains the pairwise scores between proteins, and Clusters represents the informa- tion about and relationships between different clusters. Keeping the data up-to-date has been another big challenge in the design and implementation of the CluSTr database. The aim is to up- date the CluSTr data incrementally, in a synchronized manner with the weekly updates of SPTR. There are additional Oracle tables to facilitate this. The Protein_New table gets populated with new protein data. New, changed, and deleted proteins are checked for, using SPTR ac- cession numbers and the circular redundancy check sum (CRC64). An algorithm to compute the CRC64 is described in the ISO-3309 standard (ISO-3309, 1993). While, in theory, two different sequences could have the same CRC64 value, the likelihood that this would happen is quite29 Automation of Protein Sequence Characterization

42.
low. A list of new and changed proteins is created, and the calculation of similarities for this new set against itself and against unchanged proteins is carried out. User Interface The CluSTr database is available for querying and browsing at http:/ /www.ebi.ac.uk/clustr. A query can be made using SPTR accession numbers, SWISS-PROT ID entry names, sequence annotation, key words, and taxonomic information. The result of the query is a graphical presentation of corresponding clusters at different levels of protein similarity. For example, the results for a text query of ‘‘human sodium transport’’ proteins are shown in ﬁgure 2.1. On the right of the Figure 2.1 Searching the CluSTr database. Results for a query of ‘‘human sodium trans- port’’ proteins. The table contains accession numbers of proteins with the words ‘‘human’’ and ‘‘sodium transport’’ in their annotation and the corresponding clusters at different z-levels.30 Rolf Apweiler et al.

43.
table are accession numbers of proteins with the words ‘‘human’’ and ‘‘sodium transport’’ in their annotation, and on the left is the cluster structure which these proteins form at different Z-levels. Bigger groups of clusters of size 16, 9, and 5 correspond to Sodium:neurotransmitter symporter family (IPR000175), Sodium:dicarboxylate symporter family (IPR001991) and Naþ dependent nucleoside transporter (IPR002668), respectively. The next group of proteins is not well described. At the bottom of the table are the sodium bile acid symporter family (IPR002657) and sodium-dependent phosphate transport proteins. A cluster of interest can be further investigated by clicking on its ID number. For each cluster the list of proteins, their descriptions, and the domain composition is provided (ﬁgure 2.2). Links to the Sequence Retrieval System (SRS) (Etzold et al., 1996) allow users to download the list of proteins from a cluster. The domain composition is deﬁned using InterPro. Links to the InterPro graphical view allow users to see at a glance whether proteins from a particular cluster share common domains or functional sites (ﬁgure 2.3). For each cluster a list of sec- ondary structure cross-references from the Homology-derived Second- ary Structure of Proteins (HSSP) database (Sander and Schneider, 1991) is generated dynamically. The database also provides links to the Pro- tein Data Bank (PDB) (Berman et al., 2000). It has already been mentioned that the clusters are built for speciﬁc taxonomic groups. For each of the organisms that have been studied, the following information is displayed: . General statistics—the number of clusters with two or more proteins, the total number of proteins in these clusters, the number of singletons, and the number of distinct families at different levels of protein simi- larity. . List of singletons—proteins that form clusters of size 1 at the lowest studied protein similarity level. . 30 biggest clusters—the 30 biggest protein clusters and their InterPro- based functional classiﬁcation. . Clusters without InterPro links—clusters of size 5 or more which have no matching InterPro families, domains, or functional sites. . Clusters without HSSP links—clusters of size 5 or more for which there are no HSSP matches.31 Automation of Protein Sequence Characterization

44.
Figure 2.2 A cluster of the human sodium: neurotransmitter symporter proteins. The presentation contains general information, a list of proteins, their description, and an InterPro-based domain description of the cluster. At the bottom of the page are links to the SRS-generated list of clustered proteins, the InterPro graphical representation, and links to the HSSP and PDB databases.32 Rolf Apweiler et al.

45.
Figure 2.3 Part of the InterPro graphical view for the cluster of the human sodium: neurotransmitter symporter proteins (ID 53435) from ﬁgure 2.2). AUTOMATIC PROTEIN CHARACTERIZATION The Gene Ontology (GO) (Ashburner et al., 2000), created by FlyBase (The FlyBase Consortium, 1999), Saccharomyces Genome Database (SGD) (Ball et al., 2000), and Mouse Genome Database (MGD) (Blake et al., 2000) is gaining acceptance as a universal controlled vocabulary to annotate genes and gene products. GO terms are currently being as- signed to proteins in SWISS-PROT and TrEMBL, and to InterPro do- mains and families. Before the GO mapping began, each InterPro entry was assigned a functional classiﬁcation in the form of a three- letter code with the categories based on top-level GO terms. Using this basic classiﬁcation, SWISS-PROT key words, and manual inspection of annotation of protein families, speciﬁc GO terms of all levels were mapped to each InterPro entry. This mapping is in progress, and of the three organizing principles of GO (details can be found at http:/ /www.geneontology.org/), bi- ological processes and molecular function have been taken up ﬁrst. If33 Automation of Protein Sequence Characterization

46.
the number of proteins with a known or reliably predicted subcellular location becomes signiﬁcant, the cellular component of GO will also be included in the classiﬁcation. There are cases where InterPro entries describe nonspeciﬁc protein domains or families that cannot be assigned a speciﬁc GO term, and in addition there are cases where GO terms do not yet exist (e.g., when the domain or family is speciﬁc to prokaryotes). These InterPro entries, however, still contain the three-letter functional classiﬁcation code, which may be more general and includes the cate- gories Unknown Function and Multifunctional Proteins. Using the classiﬁcation data and selecting only the top-level terms in the GO hierarchy, a table has been created for each completed pro- teome that lists the GO terms and the number of proteins mapped to each term. These tables can be found through links from the proteome analysis pages for each organism. For Drosophila melanogaster, for ex- ample, the page is located at http:/ /www.ebi.ac.uk/proteome/ DROME/go/function.html. A functional classiﬁcation of the pro- teins within each proteome set has thus been generated to show the percentage of proteins involved in, for example, metabolism, transcrip- tion, and so on. This is represented graphically for three eukaryotes in ﬁgure 2.4. The functional classiﬁcation and mapping to GO of InterPro families and domains is a simple method for determining whole pro- teome composition and provides a basis for comparative analysis. It also provides a framework for the mapping to GO of all proteins in SWISS-PROT and TrEMBL that have matches in InterPro, and for any new or previously uncharacterized protein sequences searched for in InterPro. In addition, the CluSTr database has links to InterPro, and from there to the corresponding functional classiﬁcation codes and GO terms, thus making it possible to identify protein functions within clusters. COMPARATIVE PROTEOME ANALYSIS Some of what are likely to be the most frequently requested compari- sons are available from the index page for each of the reference pro- teomes. Comparisons of the proteomes are based on InterPro statistics and are precomputed. The proteomes for which such comparisons are currently available are the archaea; Pyrococcus abyssi and Pyrococcus horikoshii; various groups of bacteria: Bacillus subtilis and Escherichia coli; Chlamydia pneumoniae, Chlamydia trachomatis, and Chlamydia mur-34 Rolf Apweiler et al.

47.
Figure 2.4 Relative representation of different protein functions in the three complete eukaryotic proteomes based on the InterPro classiﬁcation system. GPCRs ¼ the G-protein coupled receptors. idarum; the two Helicobacter pylori strains (26695 and J99); and Myco- plasma genitalium and Mycoplasma pneumoniae; and the three complete eukaryotic proteomes Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae. The incomplete proteome of Homo sapiens has been compared with three complete eukaryotic proteomes, and the resulting data are available from the index page for Homo sapiens (http:/ /www.ebi.ac.uk/proteome/HUMAN/). Interactive InterPro-based comparisons can be made using the Inter- Pro proteome comparisons program to select the proteomes of the organisms to be compared and the type of comparative analysis to be carried out (http:/ /www.ebi.ac.uk/proteome/comparisons. html). Comparisons that can be made include general statistics, top 30 entries, top 200 entries, 10 biggest protein families, and 15 most com- mon domains. An additional feature is the option to compute a list of shared InterPro entries that are common to all the selected proteomes (similar in concept to the overlapping region of a Venn diagram).35 Automation of Protein Sequence Characterization

48.
SOME OBSERVATIONS FROM THE COMPARATIVE PROTEOME ANALYSIS OF CAENORHABDITIS ELEGANS, DROSOPHILA MELANOGASTER, AND SACCHAROMYCES CEREVISIAE A comparative analysis of the three complete eukaryotic proteomes was the ﬁrst application of some of the resources described here. The InterPro analysis plus manual data inspection enabled the assignment of just over 50% of the proteins of the proteomes of Drosophila mela- nogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae (Rubin et al., 2000). The Proteome Analysis Database (http:/ /www.ebi.ac.uk/ proteome/) now contains the comparative analysis of all complete proteomes. The analysis is carried out using complete nonredundant proteome sets that comprise records taken from the SWISS-PROT, TrEMBL, and TrEMBLnew databases (Apweiler, 2000) corresponding to the complete proteome. The proteome sets are wholly nonredundant at the sequence level (http:/ /www.ebi.ac.uk/proteome/CPhelp. html). The average protein length and size range of full-length proteins (excluding fragments) for each of the three eukaryotic proteomes are presented in table 2.1. The average length of the proteins is similar in all three organisms, and higher than the average length of bacterial proteins (unpublished observation). The smallest proteins in Saccharomyces cerevisiae are the 60S ribosomal protein L41 (P05746) and the leader peptide CPA1 (P08521), while the largest is a hypothetical 560 kDa protein (Q12019). The largest protein from Caenorhabditis elegans is a 1368.6 kDa uncharacterized protein (Q09165) that contains several domains and motifs, including ﬁbro- nectin type III repeats, a Von Willebrand factor type A domain, and EGF-like domains. The smallest Drosophila melanogaster protein Table 2.1 A comparison of the average protein length and size range of full-length pro- teins for each of the three eukaryotic proteomes Number of amino acid residues Proteome Average protein length Size range Saccharomyces cerevisiae 476.69 G 375.13 25 to 4910 Caenorhabditis elegans 434.76 G 384.38 20 to 13,055 Drosophila melanogaster 486.91 G 451.66 8 to 718236 Rolf Apweiler et al.

49.
(Q9VRD2) is predicted to be just 8 amino acids long and is known as the CG11666 protein. For each of the proteomes SignalP (Nielsen et al., 1999), a signal peptide prediction program, was run to ﬁnd all proteins that are local- ized in the membrane or secreted. The transmembrane proteins were identiﬁed using the transmembrane prediction program TMHMM ver- sion 1.0 (Sonnhammer et al., 1998), and the secreted soluble proteins were classiﬁed based on the prediction of signal proteins adjusted to remove those that are membrane proteins. The percentage of the pro- teome found to be secreted proteins was 21.6% for Caenorhabditis ele- gans, 20.1% for Drosophila melanogaster, and 12.7% for Saccharomyces cerevisiae; the percentage of the proteome predicted to be transmem- brane proteins was 25.6% for Caenorhabditis elegans, 16% for Drosophila melanogaster, and 17.1% for Saccharomyces cerevisiae. Caenorhabditis ele- gans and Drosophila melanogaster have a similar representation of secreted proteins, while Saccharomyces cerevisiae has a signiﬁcantly lower pro- portion of these proteins, a ﬁnding that may be explained by the fact that Saccharomyces cerevisiae is unicellular. Surprisingly, Caenorhabditis elegans has nearly double the proportion of transmembrane proteins compared with the other two eukaryotes. The comparative analysis is carried out using sequence similarity searches against the InterPro database. Data for each of the three com- plete eukaryotic proteomes are available along with the data for the other complete proteomes. This data, updated weekly, is available at http:/ /www.ebi.ac.uk/proteome/. The InterPro database cur- rently enables the characterization of 7293 of the 13,613 Drosophila mel- anogaster proteins (53.6%), 9041 of the 16,606 Caenorhabditis elegans proteins (54.4%), and 3231 of the 6174 Saccharomyces cerevisiae proteins (52.3%) as belonging to a certain protein family or as possessing a cer- tain domain or functional site. In total, 1673 of the 3208 InterPro entries were found in the three eukaryotic proteomes: 1423 in Drosophila mela- nogaster, 1291 in Caenorhabditis elegans, and 1073 in Saccharomyces cer- evisiae, of which 823 were common to all three species. Protein kinases belonging to a very extensive family of proteins which share a conserved catalytic core (ﬁgure 2.5) with both serine/ threonine and tyrosine protein kinases are highly represented in the proteomes all three organisms, accounting for around 2% of the pro- teome. The C2H2-type zinc ﬁnger domain also is abundant in the pro- teins of all three eukaryotes, making up about 1% of the proteomes of37 Automation of Protein Sequence Characterization

50.
Figure 2.5 InterPro graphical view of representative sequences from the extensive pro- tein kinase family described by InterPro entry IPR000719. Caenorhabditis elegans and Saccharomyces cerevisiae, and accounting for about 2.5% of the proteome of Drosophila melanogaster. The high abun- dance of these protein types in all three eukaryotes would indicate that these proteins are systematically conserved, are likely to have ortho- logues across species, and are likely to be involved in a shared core biology. Several of the most abundant families or domains show striking dif- ferences in abundance across the three eukaryotic proteomes. The WD repeat is present in a large family of eukaryotic proteins that are impli- cated in a wide variety of crucial functions (Smith et al., 1999), and the RNA-binding motif, RNP-1, is found in a variety of eukaryotic RNA binding proteins. Proteins of both these types are comparatively underrepresented in Caenorhabditis elegans. On the other hand, proteins that belong to the rhodopsin-like G-protein-coupled receptor (GPCR) are unknown in Saccharomyces cerevisiae. In fact, only two families are found on all three top 10 lists that number a total of 26 families across the three organisms (table 2.2). A number of protein types that are apparently unique to a particular species may well deﬁne the species. Striking examples are the insect cuticle protein (IPR000618), present only in Drosophila melanogaster; the probable olfactory, nematode 7-helix G-protein coupled receptor (IPR000168), present only in Caenorhabditis elegans (ﬁgure 2.6), and the fungal transcription regulatory protein (IPR001138) and the yeast transposon, Ty (IPR001042), in the Saccharomyces cerevisiae proteome. The nematode 7-helix G-protein coupled receptor is the most abundant protein family in the proteome of Caenorhabditis elegans, accounting for 3.3% of the proteome. Together with the rhodopsin-like G-protein-coupled receptor (GPCR) there are a further substantial number of the top ten InterPro families of38 Rolf Apweiler et al.