Figures

Abstract

An increasing number of evidences show that genes are not distributed randomly across eukaryotic chromosomes, but rather in functional neighborhoods. Nevertheless, the driving force that originated and maintains such neighborhoods is still a matter of controversy. We present the first detailed multispecies cartography of genome regions enriched in genes with related functions and study the evolutionary implications of such clustering. Our results indicate that the chromosomes of higher eukaryotic genomes contain up to 12% of genes arranged in functional neighborhoods, with a high level of gene co-expression, which are consistently distributed in phylogenies. Unexpectedly, neighborhoods with homologous functions are formed by different (non-orthologous) genes in different species. Actually, instead of being conserved, functional neighborhoods present a higher degree of synteny breaks than the genome average. This scenario is compatible with the existence of selective pressures optimizing the coordinated transcription of blocks of functionally related genes. If these neighborhoods were broken by chromosomal rearrangements, selection would favor further rearrangements reconstructing other neighborhoods of similar function. The picture arising from this study is a dynamic genomic landscape with a high level of functional organization.

Author Summary

We describe here the most extensive functional cartography of the genomes of multiple species carried out to date. Our study shows, for the first time, how neighborhoods of functionally related genes arise and how they are maintained through evolution following a pattern that is fully consistent with the evolutionary trees of the analyzed species. Contrary to what would be expected, such neighborhoods are not composed of the same genes in different species but rather by genes unrelated, annotated, however, with the same function. Our analysis also reveals that such neighborhoods are dynamically rebuilt in a way that, while the particular genes often change, it is the function of the genes present in the neighborhood, as the ultimate target of selection, that is preserved.

Funding: This work is supported by grants BIO2008-04212, BFU2009-13409-C02-02, and BFU2006-15413-C02-01 from the Spanish Ministry of Science and Innovation, and PROMETEO/2010/001 from the GVA-FEDER. The National Institute of Bioinformatics (www.inab.org) and The CIBER de Enfermedades Raras are initiatives of the ISCIII. This work is also partially supported by a grant (RD06/0020/1019) from Red Tematica de Investigacion Cooperativa en Cancer (RTICC), Instituto de Salud Carlos III (ISCIII), and the Spanish Ministry of Science and Innovation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Gene activity, in terms of both intensity [1] and coexpression [2]–[5], does not occur randomly across eukaryotic chromosomes, but in many cases it clusters in certain genomic regions. Nevertheless, the driving force that originated and maintains co-expression neighborhoods is still a matter of controversy [3], [4], [6]–[10]. Several hypotheses have been put forward in order to explain the co-expression of neighboring genes which include the selection for co-regulation of genes with similar functional roles [9], [11], the reduction of gene expression noise in co-localized (but not necessarily functionally related) genes [6], [12] or the formation of clusters of paralogous genes with related functions and expression patterns by tandem duplication [2], [4], [13]. Co-regulation seems to be behind a significant part of the observed coexpression [14],[15] and other features, such as protein interactions seem also be correlated to coexpression [16]–[18]. The emerging portrait from different studies suggests that coexpression in clusters of genes might have both a functional and a neutral (non-functional) component [19].

In order to understand the real extent of this phenomenon we have produced a detailed functional cartography of the genomes of eight eukaryotic model species: Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. A sliding window (see Materials and Methods section) was moved along all chromosomes and the enrichment in Gene Ontology [20] (GO) functional terms within each window was analyzed [21].

Results

Functional neighborhoods in eukaryotic genomes

One of the most remarkable results of our analysis is the rich functional landscape that it unveils. When the distribution of the functional annotations of the genes is analyzed by a sliding window (see Materials and Methods) it becomes apparent that genomes are formed by a large amount of functional neighborhoods. These range in the well annotated species from a 3% (Arabidopsis thaliana) to a 12% (Mus musculus) of the genes (Table 1). For example, in Homo sapiens, chromosomes 11 and 19 show a high number of genes in functional clusters (17.6% and 25.6%, respectively), which agrees with previous observations about the special properties of these chromosomes [22]. In mammals, sex chromosomes present a significant deviation in percentages when compared to autosomes (e.g. below 2% in X chromosomes or an extreme value of 35% in the Y chromosome of Mus musculus). See also Figure S1 that depicts the distribution of functional neighborhoods across the chromosomes of the species studied and Figure S2 with more details on the functions found in the neighborhoods. Table S3 list the genes contained in the functional neighborhoods found. Differences between human and chimpanzee in the mean gene density and percentage of genes in functional neighborhoods seem to be greater than it might be expected from their phylogenetic proximity. However, the considerable amount of chromosomal rearrangements between the genomes of humans and chimpanzees, most of which happened in the chimpanzee lineage [23] and were caused by repetitive elements [24] and lineage-specific segmental duplications [25], can provide an explanation for the observed differences. These differences strongly suggest the existence of selective pressures acting differentially on the respective functional neighborhoods. Our results are also in agreement with indirect evidences from inbred strains of mice based on linkage disequilibrium, which indicate that a quarter or more of the mammalian genome could consist of chromosome regions containing clusters of functionally related genes [26].

Functional neighborhoods are conserved across the phylogeny

Some of the functional categories found in functional neighborhoods are unique, and, as previously suggested [9], probably account for lineage or species-specific characteristics. Nevertheless, many GO term clusters were consistently shared by different species. When the GO terms found within the functional clusters are mapped over the eukaryotic phylogeny the distribution across species of the vast majority of them is fully compatible with the tree topology. Table 2 shows how different functional modules are distributed across species, arranged according their relative positions in the phylogeny. Figure 1 shows the most parsimonious phylogenetic positions of the functions consistently found in neighborhoods. Most trends are clear, despite some discrepancies in G. gallus or D. rerio, probably due to the preliminary stage of the annotations in these organisms. Thus, for example, Response to biotic stimulus, Response to stress and Localization seem to define functional neighborhoods common to all the eukaryotes. Actually, clustering of stress-related genes was described to occur during evolution of the S. cerevisiae genome [27]. Other terms, such as Organismal physiological process, Regulation of physiological process, Regulation of cellular process and Sensory perception, are characteristic to all animals. In plants (at least in its unique representative, A. thaliana) we found different terms, such as Cell growth, Viral infectious cycle, Regulation of gene expression epigenetics, as apomorphisms. Shared by all vertebrates are GO terms such as Coagulation, Response to external stimulus, Response to abiotic stimulus, Cell adhesion, Organ development and Sex differentiation (with the exception of chicken, as already mentioned). Invertebrates share clusters with the GO term Embryonic development. Finally, mammals share functional neighborhoods with GO terms such as Reproductive physiological process, Physiological interaction between organism and Behaviour, most of them making reference to more complex, social or interactive behaviors displayed by these animals. Human and chimpanzee are the closest species and share almost all the GO terms in functional neighborhoods.

Figure 1. Distribution of functions present in functional neighborhoods along the phylogeny.

The point at which a function makes up a functional neighborhood has been deduced from the species sharing functional clusters with this particular GO term. Boxes in yellow contain GO terms unique to taxa, boxes in blue contain GO terms common to clades and boxes in pink contain GO terms lost in these lineages. In the figure, terms labeled with P were not found in ape, with G: were not found in chicken, with R were not found in rat and with F were not found in fish.

Relationship between gene coexpression and common annotations in functional neighborhoods

Our results support a strong causal relationship between local coexpression and local co-functionality. For each GO term, the correlations among gene expression profiles of genes located within the functional neighborhood were compared to the corresponding correlations among the rest of genes belonging to the same GO term located outside the neighborhood (see Materials and Methods section). Table 1 shows that there is a significantly higher degree of coexpression in genes belonging to a given functional class when they are packed within a functional neighborhood than when they are elsewhere in the genome. This result, along with the lack of a significant relative enrichment of tandem duplications (see below), points out to coexpression as the most plausible driving force for the existence of functional neighborhoods.

Functional neighborhoods do not mainly result from duplication

If functional neighborhoods are originated as a simple result of tandem duplications of genes, different copies may or may not acquire different functions, but, in any case, they might share GO terms. A simple analysis of the number of paralogous contained in the regions shows that the percentages range from 14% (human) to 28% (rat), which corresponds almost exactly to the percentage of paralogous in the corresponding genomes. The number of paralogous among the GO genes is higher (around the 40%) which, again, corresponds to the percentage of paralogous within the GO categories. Given that functional annotations could be transferred by similarity [20], an artifactual accumulation of identical GO terms could be observed in this scenario. To discard this possibility we performed two different analyses. Firstly, we examined whether functional neighborhoods are enriched with segmental duplications. For every species for which appropriate information is available (see Materials and Methods section), a binomial test was used to determine if the number of segmental duplications within functional clusters is larger than what be expected according to their size. Our results allow rejecting the hypothesis that recent segmental duplications are contributing to the formation of functional clusters (Table 3: p-values always n.s.). Still, the possibility remains that the clusters we observe are the result of ancient duplication events that have diverged too much to be detected as such. In that scenario, different paralogous copies of a gene may still be similar at the protein level and form functionally related gene families. Our second analysis accounts for the effect of ancient duplications by examining the distributions of the average number of BLASTP [28] hits within the regions containing functional clusters and comparing them to the genome-wide background (that is the rest of equivalent chromosomal regions). When compared, both distributions of BLASTP hits were indistinguishable for all the studied organisms (See Figure S3 and Table S1). Thus, our general multispecies analysis demonstrates that the emergence of new genes by tandem duplications cannot be the general explanation behind the origin of functional neighborhoods.

Functional neighborhoods shared by clades are not composed of ortholog genes

Surprisingly, the genes found in functional neighborhoods shared by different organisms are not necessarily orthologous (see Table 2). That is, when two species share functional neighborhoods, the genes forming these clusters may be different in each species. One might expect that if such functional neighborhoods emerged in a particular period of the evolution and apparently were maintained since then (given that they are shared by all the descendant species), these clusters were essentially composed by ortholog genes. Nevertheless, this is not the case.

Table 2 shows three columns for each organism. The first one is the number of genes annotated with the GO functional categories shared by the different relevant clades (mammalians, vertebrates and animals), the second one is the proportion of such genes that were found in functional clusters where this particular GO category was significantly over-represented and the third column is the proportion of such genes with a human ortholog. The two most important observations that can be made from Table 2 are: i) the proportion of genes in functional neighborhoods in each functional category tend to be approximately constant across taxa (with a few exceptions). For example, a large proportion of genes belonging to sensory perception cluster in neighborhoods (over 20%, except in the case of Drosophila) across the species in Table 2 while behavior genes keep their proportions approximately between 10 and 20%. The results found in Gallus gallus and Danio rerio are less conclusive probably because of the preliminary of the functional annotation. And ii) the genes found in the shared functional neighborhoods in different organisms do not have a relationship of orthology. That is, the proportion of ortholog genes with respect to their human counterparts is significantly lower than expected from an evolutionary event in which groups of functionally related genes gathered in the genome and were subsequently maintained along evolution. The presence in the functional clusters of mammals of a significantly high number of repetitive elements (SINE), which are known to be involved in rearrangement processes [29], [30], suggest that such regions may be undergoing a continuous process of rearrangement and selection is ultimately favoring the presence of genes belonging to the functional categories required by the organisms. In fact we observed a significant enrichment in SINEs in the functional regions of human (p<0.0001), mouse (p = 0.0057) and rat (p = 0.0002). From this point of view, a number of functional categories would require to have a minimum number of genes clustered together in the genome for optimal transcriptional activity, but not necessarily the same set of ortholog genes. Our findings actually suggest that it is the fraction of genes of a given function, and not the particular genes, which is relevant from the point of view of the transcriptional efficiency. This is in agreement with previous suggestions of other authors about the existence of a functional component reflected in the physical proximity of the genes that would be favoring their simultaneous co-expression [3], [4], [7]. This observation is also compatible with a dynamic scenario in which function, understood as a system of genes at work, rather than a particular static set of orthologous genes, is the target of natural selection [31].

Functional neighborhoods shared between species are significantly enriched with breaks of synteny

The comparative study of synteny conservation can throw some light on a scenario in which phylogenetically consistent functional neighborhoods composed by non orthologous genes occur. Synteny data are available for a number of species, but since the highest quality information has been obtained for the human-chimpanzee synteny relations [23], [32], [33], we have focused in functional neighborhoods shared between these two species. Humans and chimpanzees are separated by 10 major chromosomal rearrangements [34] and many small ones that imply many breaks of synteny between the two species. We observed that, as an average, functional neighborhoods shared between these species are significantly enriched with such breaks of synteny (Table 4). This is another surprising result: not only functional clusters are not particularly conserved, but they seem to be highly reorganized. These clusters appear to be enriched with rearrangement breakpoints relative to the rest of the genome (for example, using the synteny information from the Newman et al. Dataset [23] this means ∼0.15 Breakpoints/Mb in neighborhoods vs. ∼0.09 Bkp/Mb in the rest of the genome, Chi-square test, p-value = 4.2×10−6, see Table 4). This renders further support to the idea that there are strong selective pressures that maintain a minim number of genes with certain functions within clusters and is consistent with the observation reported above of clusters shared between species that, in spite of having the same functions, do not share the same ortholog genes.

Moreover, when functional neighborhoods are classified according to the percentage of orthologous genes they contain, highly orthologous neighborhoods present significantly less synteny breaks than low-orthology neighborhoods (∼0.1 vs. ∼0.2 Bkps/Mb, p-value = 0.000231 in the Newman's Dataset [23]) and a synteny conservation that is similar, or even stronger, than the genome average. The situation is the opposite in low orthology neighborhoods (see Table 5 and Table 3). In both cases functional neighborhoods present an internal level of coexpression that is significantly higher than the level observed in genes belonging to the same functional categories when dispersed across the genome (high orthology p-value = 1.197×10−21 low orthology p-value = 1.696×10−17; see also Table 1), defining in this way functional, coexpression neighborhoods. The fact that the degree of coexpression is lower in low orthology neighborhoods than in the case of high orthology ones would be compatible with a dynamic scenario of continuous reconstitution of low orthology domains where the expression process was not fully optimized yet. Without entering in the detail on where the conservation of the neighborhoods came from, the observation that genes with altered neighborhood are more likely to undergo expression divergence than genes with conserved neighborhood was already made [35]. This scenario has also some similitude to the one proposed by Poyatos and Hurst for yeast [16], in which selection for high levels of co-expression would correlate with high levels of recombination rates, which, in turn mean high levels of chromosomal rearrangement and increase the probability of breakage of the co-expressed cluster. It has also been observed that co-expression between adjacent genes is positively correlated with the probability that those genes would be apart in the genome of a different species [36]. The fact that, in our case, highly orthologous clusters present the highest co-expression levels and lowest rearrangement rates suggests, however, a different cycle: cluster would contain many rearrangement breakpoints because natural selection would favor the reconstruction of clusters via chromosomal reorganization. In addition, this is consistent with the fact that the rearrangement breakpoints tend to reduce and not to increase recombination while segregating in a population [37].

Discussion

Results presented here demonstrate that a large fraction of the genome is arranged in neighborhoods of functionally related genes that are not the result of tandem duplications but of reorganization. Coexpression has systematically observed to occur within functionally related genes defining the functional neighborhoods. The fact that functions shared across species analyzed is compatible with the evolutionary pattern of speciation constitutes strong evidence in favor of the existence of a selective force that produced and maintained the observed functional neighborhoods, even if different sets of genes make them up in different species. Moreover, in an apparent paradox, functional neighborhoods, which in one hand are conserved across evolution, appear to be enriched with rearrangement breakpoints when compared to the rest of the genome. Both observations suggest that selection is operating at the level of functional neighborhoods, no matter their particular genic composition. In this scenario, when a functional neighborhood is broken by a chromosomal rearrangement, selection would favor new rearrangements that tended to reconstitute a neighborhood with the same function, although the gene composition may differ from the ancestral one.

Actually, the number of functional neighborhoods found constitutes, most probably, an underestimation of its real number because of two facts: i) the testing scheme used is conservative and ii) this study considers only neighborhoods collinear in the chromosomes but no spatial neighborhoods formed by the tridimensional conformation of the nucleus. The real spatial conformation of the nucleus is still unknown but new data are continuously arising [38], [39] and the relationship of physical proximity with gene expression [40] and their possible functional implications [41] are becoming increasingly clear. As new information is available this extreme will be studied in more detail.

Materials and Methods

Data

The genomes of Homo sapiens, Mus musculus, Pan troglodytes, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, were taken from Ensembl [42] and the genome of Arabidopsis thaliana was obtained from AtEnsembl, (release 29, http://atensembl.arabidopsis.info). All the microarrays were chosen to represent conditions as normal and as non-pathological as possible. The following datasets, taken from the ArrayExpress database (http://www.ebi.ac.uk/microarray-as/ae/), were used: Human: E-AFMX-5; Mouse: E-AFMX-4; Fish: E-TABM-33; Fly E-MEXP-127, E-MEXP-152, E-MEXP-202, E-MEXP-493, E-MEXP-88; Worm:E-SMDB-1398, E-SMDB-3540, E-SMDB-3539, E-SMDB-3592 and Plant: E-TABM-17. No comparable data were found for chicken and, thus, it was excluded from the analysis of expression data. More information on the data used and the results obtained can be found at: http://bioinfo.cipf.es/publications/additionaldata/functional_clusters.

Synteny analysis

The analysis of density of breakpoints in windows of functional enrichments was performed using the breaks of synteny between Humans and Chimpanzees from Newman et al [43] and Feuk et al [32]. For the later, only the set of rearrangements >25Kb was used. The first dataset (Newman's dataset), was built blasting Fosmid pair-end sequences into the human genome, so it does not depend on the quality of the chimpanzee assembly. The second dataset (Feuk's dataset), was constructed comparing the order of genes between assemblies of the two species, and thus, it is likely to be affected by the lower quality of the assembly of the chimpanzee genome available at the time of the publication of the paper by Feuk et al [32].

Sliding windows approach

All the chromosomes of the studied species were scanned by means of a sliding window. In order to be compliant with previous studies [1] window size was adjusted in each species to contain, on average, approximately 50 genes (see Table S2). The windows are moved along all chromosomes in steps of half a window. A conventional method of functional enrichment implemented in the FatiGO program [21], which is part of the Babelomics (http://www.babelomics.org) suite for functional analysis [44], [45], was used to study the significant over-representation of GO terms in each window. Briefly, the method builds a 2×2 contingency table for each functional term checked for each window and applies an exact Fisher's test. The p-values obtained for all the windows were FDR-corrected [46] taking into account all the tests conducted in all the organisms. Figure S4 shows a schema of the procedure followed to detect functional neighborhoods.

Testing for duplication events

Available data on segmental duplications were downloaded from the Eichler Lab databases (http://eichlerlab.gs.washington.edu/database.html). To avoid coordinate translation biases, only species for which the segmental duplications and the gene annotation assemblies were concordant were used. Thus, the segmental duplication analysis involves only human, chimpanzee, chicken and mouse. The proportion of segmental duplications contained in windows containing functional neighborhoods was compared to the proportion of segmental duplications in windows without clusters (i.e. in the rest of the genome) after removing from analysis ambiguously located segmental duplications. A binomial test was used to determine whether the number of segmental duplications inside windows with functional clusters was larger than expected under the null hypothesis of random distribution of segmental duplications with respect to functional neighborhoods.

A further way to infer the number of recent and ancient duplication events in a window is using the number of BLASTP [28] hits that any of the genes contained in it produces when searched against a all the genes within the region. A region including only single-copy genes not belonging to a gene family and no ancient or new tandem duplications will theoretically produce only one hit per gene (the gene against itself). A region consisting of a group of genes amplified N times will produce N BLASTP hits per gene. Situations in between these extremes will produce more than one BLASTP hit for some genes. For each window we constructed a BLASTP database with the corresponding proteins. Then, all the proteins corresponding to the genes in the window were blasted (using BLASTP) against the corresponding database, and the total number of BLASTP hits with a percentage of similarity over a threshold of T% was normalized and recorded. Values of 98% and 95% were used as thresholds. Again, for each organism, windows containing functional neighborhoods were compared to a background consisting of the rest of windows without significant functional neighborhoods inside.

Coexpression analysis

For each GO term the pairwise Pearson correlations among genes located within the functional neighborhood is compared to the corresponding correlations among the rest of genes not located in the neighborhood, by means of a Kolmogorov-Smirnov test.