Abstract

BACKGROUND:

The neuronal synapse is a fundamental functional unit in the central nervous system of animals. Because synaptic function is evolutionarily conserved, we reasoned that functional sequences of genes and related genomic elements known to play important roles in neurotransmitter release would also be conserved.

RESULTS:

Evolutionary rate analysis revealed that presynaptic proteins evolve slowly, although some members of large gene families exhibit accelerated evolutionary rates relative to other family members. Comparative sequence analysis of 46 megabases spanning 150 presynaptic genes identified more than 26,000 elements that are highly conserved in eight vertebrate species, as well as a small subset of sequences (6%) that are shared among unrelated presynaptic genes. Analysis of large gene families revealed that upstream and intronic regions of closely related family members are extremely divergent. We also identified 504 exceptionally long conserved elements (> or =360 base pairs, > or =80% pair-wise identity between human and other mammals) in intergenic and intronic regions of presynaptic genes. Many of these elements form a highly stable stem-loop RNA structure and consequently are candidates for novel regulatory elements, whereas some conserved noncoding elements are shown to correlate with specific gene expression profiles. The SynapseDB online database integrates these findings and other functional genomic resources for synaptic genes.

CONCLUSION:

Highly conserved elements in nonprotein coding regions of 150 presynaptic genes represent sequences that may be involved in the transcriptional or post-transcriptional regulation of these genes. Furthermore, comparative sequence analysis will facilitate selection of genes and noncoding sequences for future functional studies and analysis of variation studies in neurodevelopmental and psychiatric disorders.

Evolutionary analysis of proteins involved in synaptic transmission. (a) The empirical cumulative distribution of protein evolutionary rate, measured by dN/dS, was calculated for human-mouse orthologs. Data for 139 human-mouse orthologs of mainly presynaptic genes is shown in red whereas a comprehensive survey of more than 15,000 homologous pairs of human-mouse orthologs is shown in black. (b) The distribution of dN/dS calculated for human-mouse orthologs was grouped by gene family. All family members are shown in red and extreme members outside whiskers are labeled. Black boxes showing the 25% quantile, the median, and 75% quantile are superimposed, and whiskers extend to the most extreme data point that is no more than the interquartile range in both directions from the median in the box. (c) The distribution of dN/dS calculated for human-rat orthologs was grouped by gene family. (d) The distribution of dN/dS calculated for mouse-rat orthologs grouped by gene family. dN, nonsynonymous rate of change; dS, synonymous rate of change.

SYT protein trees with superimposed expression profiles. (a) The SYT1-SYT2-SYT5 clade of the SYT protein tree is shown for human and mouse orthologs with the expression profile for human genes superimposed. (b) Two closely related paralogs of the SYT family (SYT4 and SYT11) are shown with superimposed expression profiles.

Comparative analysis of presynaptic genes. (a) Gene names from SynapseDB were used to query RefSeq and ENSEMBL transcript annotations, which were then clustered into gene models defined as groups of overlapping transcripts in the same orientation. The region around the synaptic gene model was extended up to the next annotated upstream and downstream gene models to define gROIs. MCEs were selected and characterized based on their relative genic position into exon-associated and non-exon-associated elements. Exon-associated elements were further subdivided into those that are completely exonic, those that are partially exonic and span exon-intron boundries, and those associated with UTRs; whereas non-exon-associated elements were divided into those that are intergenic and those that are intronic. (b) Individual bases were annotated as CDS, UTR sequence (UTR), intronic (intron), or intergenic (inter) based on gene model annotations. The coverage of MCEs (the proportion of most conserved bases in a gROI) across different annotations is shown. (c) The composition of MCEs (the proportion of MCEs with a given annotation) across CDS, UTR, intronic, and intergenic annotations is shown. CDS, coding sequence; gROI, genomic region of interest; MCR, most conserved element; UTR, untranslated region.

Duplicated most conserved elements. (a) A schematic illustration of three classes of dMCEs in a hypothetical two-exon gene is shown. The blue rectangles represent exons of three different two-exon genes, and the red arrows represent the relationship between pairs of duplicated MCEs relative to their gROIs. GeneA1 and GeneA2 are paralogs in the same gene family, whereas GeneB represents an unrelated gene. The figure shows a local dMCE pair in the same gROI upstream from GeneA1, an intronic pair of dMCE elements between the paralagous gROI of GeneA1 and GeneA2, and an intergenic pair of dMCE elements downstream unrelated genes GeneA2 and GeneB. (b) Example of a dMCE pair between unrelated genes CAST1 (chromosome 3) and SNAP25 (chromosome 20) is shown. The pair involves an element in the first intron of CAST1(.789) and an element in the last intron of SNAP25(.157). Orthologous species shown in the alignments include chimpanzee (Pan troglodytes [pt]), dog (Canis familiaris [cf]), mouse (Mus musculus [mm]), rat (Rattus norvegicus [rn]), chicken (Gallus gallus [gg]), and zebra fish (Danio renio [dr]). Both elements are conserved in mammals, and SNAP25 element exhibits conservation in chicken and zebrafish. Both genes related to these elements exhibit increased expression in brain tissues, and reduced expression in immune tissues and cell types. Both genes also show increased expression in hippocampus and throughout the cortex, although they differ in cerebellum expression as shown by in situ expression patterns courtesy of Allen Brain Atlas [19]. dMCE, duplicated most conserved element; gROI, genomic region of interest.

Analysis of coexpressed sets of genes across human tissues and cell lines. The figure shows five clusters of genes with distinct expression profiles from Genomics Institute of the Novartis Research Foundation SymAtlas [17]: (a) transcripts with widespread and low-level expression in most tissues/cell types; (b) transcripts expressed in brain and immune tissues and cell types but under-expressed in other tissues; (c) transcripts with enriched expression in brain tissues and low levels of expression in other tissues; (d) transcripts or splice forms enriched in hematopoietic derived immune cell types; and (e) transcripts or splice forms under-expressed in immune tissues and cell types. The tables to the right of each expression cluster shows the five most enriched TFBSs found in that cluster, and lists the TFBS name, the observed count number of hits of that TFBS in intergenic and intronic MCEs, the fold increase over that expected by chance, and the significance of enrichment in the cluster. Available PWM logos for all significantly enriched TFBSs (P < 0.05) are also displayed. MCE, most conserved element; PWM, positional weight matrix; TFBS, transcription factor binding site.

Conservation of large most conserved elements across species. (a) The red data points show conservation in the LMCEs (defined as MCEs ≥360 base pairs) plotted as pair-wise identities across all species in the underlying seven-way vertebrate whole-genome alignments with human. The blue lines indicate the mean and standard errors of the mean for each species relative to human. Orthologous species include chimpanzee (Pan troglodytes [pt]), dog (Canis familiaris [cf]), mouse (Mus musculus [mm]), rat (Rattus norvegicus [rn]), chicken (Gallus gallus [gg]), zebra fish (Danio renio [dr]), and puffer fish (Fugu rubripes [fr]). (b) Species are plotted against the proportion of total LMCE length showing homologous sequence devoid of insertions or deletions in the underlying whole-genome multiple alignment. LMCE, large most conserved element.

Evidence for transcription and RNA stability in LMCEs. A complete representation of the positions and analysis of MCEs within all 46 megabases analyzed is available via custom tracks in the UCSC Genome Browser [26] through supplemental data. (a) A view depicting a transcribed LMCE identified by both DoTS and tiling array data upstream from EXOC4(.84) is shown. The LMCE is shown by the red track towards the top and is highly conserved to zebra and puffer fish, whereas DoTS transcripts from clustered mRNA and EST sequences is shown as the next brown track below, and evidence for significant transcription by tiling array data is shown by blue bars on the next track down. (b) Expression patterns of LMCEs across tissues were compared with patterns obtained for RT-PCR products generated by priming at exons upstream and downstream of the LMCE. The PCR products were visualized by gel electrophoresis to show similar patterns of expression to the nearby genes. Shown are elements upstream from CAST(.694), downstream from RAB3C(.306) in the neighboring PDE4D gene, and in an internal intron of NBEA(.708). (c) A view depicting a LMCE with significant stable RNA secondary structure spanning alternatively spliced exons of SNAP25(.159) is shown along with the optimal minimum free energy RNA structure. Intronic portion of this LMCE are also highly conserved to zebra and puffer fish. DoTS, Database of Transcribed Sequences; EST, expressed sequence tag; LMCE, large most conserved element; RT-PCR, reverse transcription polymerase chain reaction.