Ensembl Glossary

Glossary

Accession number

A unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ).

agp file (A Golden Path)

A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.

Algorithm

A sequence of computational tasks or actions that carry out a specific function.

Alignment

A comparison between two or more sequences by matching identical and/or similar residues and assigning a score to the match.

Allele

An allele is an alternative form of a nucleotide sequence, a gene or a locus
in the genome. The term was originally used to describe variation
among protein coding genes, but it also refers to variation among non-coding
genes or DNA sequences.

Alternate sequence

Genomic sequence that differs from the genomic DNA on the primary assembly. The alternate sequences come in two types: allelic sequence (haplotypes and novel patches) and fix patches. Novel patches represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Both haplotypes, novel patches and fix patches are determined by the GRC, not by Ensembl. When using the API, the primary assembly is referred to as reference sequence and alternate sequence is referred to as non-reference sequence.

Alu

A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. The Alu universal primer sequence is as follows: 5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer).

ambiguity code

The standard ambiguity codes for nucleotides are provided by IUPAC (INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY) and indicate the possible nucleotides that can occur at a given position. The symbols are valid for both DNA and RNA and are shown below:

A = adenine

C = cytosine

G = guanine

T = thymine

R = G A (purine)

Y = T C (pyrimidine)

K = G T (keto)

M = A C (amino)

S = G C (strong bonds)

W = A T (weak bonds)

B = G T C (all but A)

D = G A T (all but C)

H = A C T (all but G)

V = G C A (all but T)

N = A G C T (any)

ambiguity code

The standard ambiguity codes for nucleotides are provided by IUPAC (INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY) and indicate the possible nucleotides that can occur at a given position. The symbols are valid for both DNA and RNA and are shown below:

A = adenine

C = cytosine

G = guanine

T = thymine

R = G A (purine)

Y = T C (pyrimidine)

K = G T (keto)

M = A C (amino)

S = G C (strong bonds)

W = A T (weak bonds)

B = G T C (all but A)

D = G A T (all but C)

H = A C T (all but G)

V = G C A (all but T)

N = A G C T (any)

Ambiguous ORF

Ambiguous Open Reading Frame. A non-coding transcript believed to be protein coding, with more than one possible ORF.

Antisense

Non-coding transcript believed to be an antisense product used in the regulation of the gene to which it belongs.

API (Application Programming Interface)

A series of routines that applications can use to make the operating system request and carry out lower-level services.

Artifact ((in the context of a transcript))

Error in the sequence in a public database (for example UniProtKB, NCBI RefSeq). Annotation is by the VEGA/Havana project.

Assembly

The genomic assembly refers to the complete genome of an organism, after fragmentation for sequencing experiments and then reassembly. Ensembl imports genomic assemblies from sources listed on the home page of each species.

When the genome of a species is to be sequenced, the chromosomes from many cells are broken at random positions into small fragments, which are sequenced, and reassembled into long sequences (contigs). Alternatively, clones representing genomic regions are sequenced and strung together to form the genomic assembly.

ATV (A Tree Viewer)

An application (Java tool) for the visualisation of phylogenetic trees. Allows the possibility to edit and export data. See Zmasek et al.

BAC (Bacterial Artificial Chromosome)

A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria.

Base pairs (number of base pairs in the genome)

The base pairs length on pages such as the whole genome display (next to the golden path length) is based on the assembled end position of the last seq_region in each chromosome (from the AGP file), or if there is a terminal gap it is set to the assembled end location of that terminal gap.

A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).

Canonical transcript

For human, the canonical transcript for a gene is set according to the following hierarchy:
1. Longest CCDS translation with no stop codons.
2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons.
3. If no (2), choose the longest translation with no stop codons.
4. If no translation, choose the longest non-protein-coding transcript.

CCDS

A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, Vega, UCSC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.

cDNA (Complementary DNA)

DNA obtained by reverse transcription of a mRNA template. In bioinformatics jargon, cDNA is thought of as a DNA version of the mRNA sequence. Generally, cDNAs are denoted in coding or 'sense' orientation.

CDS (Coding sequence)

The portion of a gene or an mRNA that codes for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The coding sequence in a cDNA or mature mRNA includes everything from the start codon through to the stop codon, inclusive.

Centimorgan (cM)

A unit of genetic distance, determined by how frequently two genes on the same chromosome are inherited together. One centimorgan equals 1% recombinant offspring. In humans, 1 cM is about 1 x 10^6 bp

Chr:bp

The chromosome location and coordinates in base pairs.

CIGAR (Compact Idiosyncratic Gapped Alignment Report)

Defines the sequence of matches/mismatches and deletions (or gaps). The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:

Clinical significance is reported to dbSNP from the submitter. Variants from OMIM may have the value probable-pathogenic. Other assignations are unknown, untested,non-pathogenic, probable-non-pathogenic, drug-response, histocompatibility, and other.

Clone

A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.

codon

Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).

Contig

A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information.

Short sequences (reads) from a fragmented genome are compared against one another, and overlapping reads are merged to produce one long sequence. This merging process is iterative: overlapping reads are added to the merged sequence whenever possible and so the merged sequence becomes even longer. When no further reads overlap the long merged sequence, then this sequence - called a contig - has reached its maximum length.

Contig can be used in other contexts: A contig can be the sequence corresponding to only one clone. A contig map shows the regions of a chromosome where contiguous DNA segments overlap.

Cosmid

DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.

Coverage

Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information.

cytogenetic map

A banding pattern on a chromosome resulting from staining and examination by microscopy. Cytogenetic abnormalities such as deletions or inverted nucleotide sequences may be detected by examining and comparing banding patterns.

DAS (Distributed Annotation System)

A protocol for requesting and returning annotation data for genomic regions. See the
BioDAS site for more information.

DDBJ (DNA Data Bank of Japan)

DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. Data is exchanged with EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share virtually the same data at any given time.

Disrupted domain ((in the context of a transcript))

Coding region omiitted due to a splice variation. Annotation is by the VEGA/Havana project.

Domain

A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.

Dotter

Ensembl DotterView is based on the program Dotter, a dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. The Dotter tool provides a visual display of the sequence alignment it represents. The dotplot displays detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window that runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of grayscales - higher peaks are indicated by darker grays. Dotter was written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10 (1995)

DUST

A standalone application that looks for low complexity sequences.

DWGA (Derived from Whole Genome Alignments)

Human versus Chimpanzee exception: The human versus chimpanzee orthologue predictions were obtained in a completely different manner. Since the current chimpanzee genome sequence assembly is the result of low-coverage sequencing, the assembled sequence is of too poor quality to generate a gene set on the classical Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl has rather been generated by "projecting" human genes to the chimpanzee genome through whole genome BLASTz alignments between both species and filtering for orthologue sequence alignments. The result of this procedure is de facto the human - chimpanzee orthologue set that has been Derived from Whole Genome Alignments (DWGA). See the Prediction Method section on a relevant Ensembl Gene Report page.

EMBL (European Molecular Biology Laboratory)

Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications.

ENCODE (ENCyclopedia Of DNA Elements)

The ENCODE project uses defined regions of the Human genome to test and evaluate different methods and technologies for finding various functional elements in Human DNA. The two main criteria for manually selected regions were presence of well-studied genes or other known sequence elements, and existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

Ensembl genes

Set of Ensembl gene predictions based on experimental evidence from protein sequences and/or near-full-length cDNA available from public sequence databases. "Ensembl known genes" are predicted on the basis of species-specific database entries from manually curated UniProt/Swiss-Prot, partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions of "Ensembl novel genes" are based on other experimental evidence such as protein and cDNA sequence information from related species.
Golden genes are the result of a merge between a Havana transcript (manually curated) and an Ensembl gene prediction from the annotation pipeline. See "havana transcript".

Eponine

Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS.

EST (Expressed Sequence Tags)

Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.

EST genes

Set of Ensembl gene predictions solely based on EST evidence. The process of EST gene prediction uses a combination of Exonerate, BLAST and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest found ORF. See Eyras et al. for a more complete explanation of the EST gene prediction process.

Exon

The part of the genomic sequence that remains in the transcript (mRNA) after introns have been spliced out.

Exonerate

A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins.

Feature

Any annotation on a specific location in the genomic sequence.

Fgenes

FGENES, also known as Find Genes, is a Human gene predictor that is based on pattern recognition of different types of exons, promoters and poly A signals. It is built based on linear discriminant functions of internal, 5'-coding, and 3'-coding exon recognition. It is designed to find the optimal combination of these components and to construct a set of gene models along a given sequence.

Flanking sequence

Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).

Frameshift intron

Frameshift introns are the length of 1, 2, 4, or 5 basepairs. They are introduced by the Ensembl genebuild in order to fit the cDNA sequence to the genome.

Frequency

A measure of how prevalent an allele or genotype is in a population. In Ensembl, it is displayed ranging from 0 (zero) to 1 (one).

GENCODE

The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human genome at a high accuracy.
The GENCODE gene set is equivalent to the Ensembl/HAVANA merged gene set displayed on our website.

GeneWise

GeneWise is sequence analysis tool for comparing proteins or profile HMMs to DNA sequences allowing for introns and frameshifts. The Wise2 package was written by Ewan Birney. More information about the package can be obtained at: www.ebi.ac.uk/Wise2/

Genotype

Specific alleles present in an individual's genome, or the genetic makeup of one organism.

An organized hierarchy of terms produced by the Gene Ontology Consortium, used to describe biological processes, cellular component, and molecular function. Specific GO terms are as follows: Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene may be indexed under many GO terms depending on GO classification system. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

Golden path length

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

Haplotype

Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL). In Region in Detail, the haplotype regions are coloured with a red background.

Haplotypes

A set of genes or markers on one chromosome that are inherited together. Often refers to SNPs that are closely linked (i.e. have a high linkage disequilibrium (LD) value, and are inherited together.) In Region in Detail, the haplotype regions are coloured with a red background.

Havana transcript

A transcript resulting from manual curation of genome annotation for vertebrate species. The Havana team is a subset of Vega (See "Vega genes".)

HGVS names

Nomenclature for a given variant according to the Human Genome
Variation Society (HGVS). A guide to HGVS names can be found on their website.

High-coverage genome

Refers to the number of overlapping sequences used to build the genomic assembly. High coverage, such as human and mouse genomes, indicates a good amount of sequence information. This is also referred to as deep-coverage. Low coverage reflects a low amount of sequence information.

homologues

Specific sequences that are descended from the same common sequence in an ancestor. See orthologues or paralogues.

Identity

A measure of how similar two sequences are, specifically, what percent of amino acids are the same in type and position between the two sequences.

A mutation or polymorphism in which one or more base pairs have been inserted into or removed from a genomic sequence.

InterPro

InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. InterPro IDs are linked to the summary of information about that domain or family. InterPro is managed by EBI. A number of databases (SwissProt, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY) with different approaches to biological information are used to derive protein signatures. ProteinView, GeneView and DomainView provide links to the relevant InterPro entries.

Intron

The part of the genomic sequence that is transcribed and then spliced out of the transcript (mRNA). Noncoding.

Jalview

Jalview is a multiple alignment editor, used by the EBI clustalw server and the PFAM protein domain database and is available as a general purpose alignment editor.

Known gene

A known gene is an Ensembl gene for which at least one known transcript has been annotated.

Known transcript

A known Ensembl transcript matches to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

LD (Linkage Disequilibrium)

A measure of how often two SNPs or specific sequences are inherited together.

Length (aa)

The number of amino acids in, for example, a protein.

Length (bp)

The number of base pairs in, for example, a transcript.

LincRNA

Large intergenic non-coding RNAs, usually associated with open chromatin signatures such as histone modification sites.

Linkage

A measure of how often features (genes, specific sequences) on a chromosome are inherited together.

Low-complexity region

A region in the sequence with a biased composition (i.e. repeated sequences or residues.)

Low-coverage genome

Refers to the number of overlapping sequences used to build the genomic assembly. High coverage, such as human and mouse genomes, indicates a good amount of sequence information. This is also referred to as deep-coverage. Low coverage, such as the lesser-known mammals, reflects a low amount of sequence information. 2X genomes are low coverage.

Marker

A short sequence whose placement on the genome is known.

MBRH (Multiple Best Reciprocal Hit)

When due to gene duplications there are multiple 'best' hits with identical score, E-value, % identity, %positivity, one is unable to pick a unique orthologue for a gene. This results in more complex graphs of 'best' relationships. This often occurs when different genes have identical translations, which could be due to a duplication event, an assembly error, or chance. On average 3% of the genes have an identical translation to some other gene either within it's genome or in another genome.

MBRH / DUP 1.# - MBRH set where in one genome there is only one gene, but the other genome has multiple genes, all on the same chromosome and within 1.5 megbases of each other. This could be due to recent gene duplication events where sequences have not diverged or a mis-assembly of the genome sequence leading to artificial, apparent gene duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4)

MBRH / SYN - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. The one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the RHS search criteria.

MBRH / COMPLEX - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. This MBRH pair does not satisfy the RHS criteria.

MGI (Mouse Genome Informatics)

Houses a database that provides integrated access to data on the genetics, genomics, and biology of mouse (Mus musculus).

Microsatellite

A region in the genomic sequence containing short tandem repeats.

MiRNA

MicroRNA is single-stranded RNA, typically 21-23 base pairs long, that is thought to be involved in gene regulation (especially inhibition of protein expression).

A modification (insertion, deletion, or alteration) in the genomic or amino acid sequence.

ncRNA (non-coding RNA)

Short non-coding RNAs such as tRNA, rRNA, scRNA, snTNA, snoRNA and miRNA are annotated by the Ensembl project (see article). Long intergenic ncRNAs have only been annotated for human and mouse.

Non coding

Transcript does not result in a protein product.

Nonsense mediated decay

Transcript is thought to undergo nonsense mediated decay, a process which detects nonsense mutations and prevents the expression of truncated or erroneous proteins.

Novel gene

A novel gene is an Ensembl gene for which only one or more novel transcripts have been annotated.

Novel transcript

A novel Ensembl transcript does not match to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

OMIM (Online Mendelian Inheritance in Man)

A genetic knowledge database that focuses on the relationship between phenotype and genotype, (Mendelian Inheritance in Man (MIM) was first published in 1966 and is updated daily. Ensembl links to OMIM entries in the gene tab (under External references), and in the variation tab (under Phenotype Data).

ORF (Open Reading Frame)

A DNA sequence that possesses a start codon and a large window of sequence with no stop codon that could potentially code for a protein.

Orthologue

Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. In contrast, paralogues are genes within the same genome that have evolved by duplication.

PAR (Pseudoautosomal region)

Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked.
The Genome Reference Consortium defines two PARs for the human genome assembly. The first pseudoautosomal region, PAR1, is located at the tip of the short arm and consists entirely of N's. The second pseudoautosomal region, PAR2, is located at the tip of the long arm.
In the Ensembl human database, DNA for the complete X chromosome is stored and annotated. Only the two unique regions of the Y chromosome are stored and annotated. We are able to represent the complete Y chromosome by filling the 'gaps' with the two PAR regions from the X chromosome. This is done on-the-fly using our assembly_exceptions table.
Please note that when using the API, SliceAdaptor by default will fetch only the unique regions of the genome. This means that the PARs on chromosome X will be fetched but only the unique regions on Y will be fetched. To fetch the full length of the Y chromosome using the SliceAdaptor, set the 4th argument to '1' as shown:
my $slices = $slice_adaptor->fetch_all( 'toplevel', 'GRCh37', 0, 1 );

Paralogues

Sequences (homologues) that have evolved by duplication.

Patch

These alternate sequences are with regard to the genomic DNA on the primary assembly. Novel patches represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Currently available for human, they are provided by the GRC. By default, our browser displays the unchanged primary assembly (eg. GRCh37 chromosomes). In order to apply a novel (red) or fix (green) patch to a chromosome, click on the "Assembly Exception" track from the Region In Detail window.

PDB (Protein Data Bank)

A repository for 3-D biological macromolecular structure data. PDB archives protein structures deduced from crystallography and nuclear magnetic reasonance (NMR) experiments on protein structures. The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the Department of Energy, and the National Institutes of Health.

Pfam

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam can be used to view the domain organization of proteins, to view multiple alignments, protein domain architectures, protein structures, and species distributions.

Pmatch

Pmatch is a fast, exact matching program for aligning protein sequences with either protein or DNA sequence.

Polymorphic pseudogene

Pseudogene owing to a polymorphism in the reference genome, translated in other individuals/haplotypes/strains.

PolyPhen

A tool which predicts the variation effect on protein function based on
physical and comparative considerations. See the PolyPhen website for more information.

Pre-release site

Initial annotations of upcoming Ensembl genomes, usually without gene predictions or validation, are regularly made available on the pre-release site, pre.ensembl.org

Prints

The PRINTS protein fingerprint database is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

Processed pseudogene

Noncoding pseudogene produced by integration of a reverse transcribed mRNA into the genome.

Processed transcript

Noncoding transcript that does not contain an open reading frame (ORF).

Projected gene (or known by_projection)

A projected Ensembl gene has only one or more novel transcripts annotated, and has a known gene from human or mouse as an orthologue. The gene symbol and description are projected from the human or mouse orthologue.

Prosite

PROSITE is a database of protein families and domains run by the (Expert Protein Analysis System (ExPASy) proteomics server of the Swiss Institute of Bioinformatics (SIB). It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Protein coding

A protein coding transcript is a spliced mRNA that leads to a protein product.

Protein ID

Ensembl protein IDs are unique for differing translations.

Pseudogene

A noncoding sequence similar to an active protein.

QTL (Quantitative Trait Locus)

Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). The presence of QTL is inferred from genetic mapping. Total variation is partitioned into components linked to a number of discrete, mapped chromosome markers described by statistical association to quantitative variation in a particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci.

Query %id

Query %id indicates the percentage of the query sequence matching the target sequence.

Reference SNP (Reference Single Nucleotide Polymorphism)

A SNP assigned to eliminate redundancy in the NCBI dbSNP database. All SNPs submitted at the position of a reference SNP are given the reference SNP identifier (a number preceded by 'rs').

A noncoding pseudogene produced by integration of a reverse transcribed mRNA into the genome.

RH map (Radiation Hybrid map)

Technique for identifying landmarks (STS) every 100 kb in the human genome, the ordering is relative to the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human-hamster hybrid cell lines.

Seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Segment lengths and the number of segments per sequence are determined automatically by the algorithm.

SGD (Saccharomyces Genome Database)

Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae.

Shotgun method

(also whole genome shotgun) Semi-automated sequencing method that involves randomly sequenced cloned pieces of the genome (size selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two "mate pairs" can be inferred if the library size is known and has a narrow window of deviation. This approach can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced.

Shotgun sequencing

A method in which small, random DNA sequences are generated that overlap. The fragments are sequenced and the full, connected sequence determined through the overlaps.

SIFT

A tool which predicts the variation effect on protein function based on
sequence homology and the physico-chemical similarity between the alternate
amino acids. See the SIFT website for more information.

SignalP

The SignalP application predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms:
Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes.
The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction
based on a combination of several artificial neural networks.
Signal peptides indicate a protein that will be secreted.
Prediction of signal peptides is quite accurate however care must be exercised and these regions should be verified by other means.
(Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne.
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Protein Engineering 10, 1-6 (1997)

Similarity

How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues.

SNAP

(Synonymous/Non-synonymous Analysis Program) A program which calculates synonymous and non-synonymous substitution rates based on a set of codon-aligned nucleotide sequences, based on the method of Nei and Gojobori, incorporating a statistic developed in Ota and Nei.

An ab initio gene prediction program developed by Ian Korf that models protein coding sequences in genomic DNA by means of hidden Markov models.

SnoRNA

Small nucleolar RNA, involved in modifications of other RNAs.

SnoRNA pseudogene

Small nucleolar RNA pseudogene, involved in modifications of other RNAs.

SNP (Single Nucleotide Polymorphism)

SNPs are common variations that occur in DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP, (the SNP repository maintained by NCBI; The Human Genic Bi-Allelic Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC).

A search designed to detect exact matches, or nearly exact matches, in DNA or protein databases. The SSAHA search has been optimized for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence.

STS markers

STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome.

Supercontig

Supercontigs or scaffolds are sets of ordered, oriented contigs. They are longer sequences than contigs, but shorter than full chromosomes.

supercontigs

Assemblies consist of sequence contigs combined into scaffolds, also known as supercontigs. Supercontigs are combined and ordered according to their orientation and linking information provided by mated sequences from the ends of genomic sub-clones. For some species, supercontigs are combined into ultracontigs, in which neighboring supercontigs are organized into their proper order and orientation using linking information provided by the physical map of BAC clones independently assembled using restriction fragment patterns and the FPC program.

Synteny

The term synteny was originally defined to mean that two gene loci share the same chromosome. In a genomic context we refer to syntenic regions if both sequence and gene order is conserved between two (closely related) species.

tandem repeats

Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping.

Target % id

Target %id indicates the percentage of the target sequence matching the query sequence.

TEC

To Be Experimentally Confirmed. Non-spliced EST clusters with polyA features.

Toplevel

The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.

Nucleotide sequence resulting from the transcription of the genomic DNA to
mRNA. One gene can have different transcripts or splice variants resulting
from the alternative splicing of different exons in genes.

A non-profit foundation to provide public SNP-related information available to the public without intellectual property restrictions.

Unigene

UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

UniProt/Swiss-Prot

(Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).

UniProt/TrEMBL

SPTrEMBL is a subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the EMBL EMBL nucleotides that are not yet incorporated into the UniProt/SwissProt database.

UniSTS

UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.

The 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the position of the last codon that is used in translation to the 3' end.

Validation status

A measure of confidence that a variant is a true polymorphism. It includes
1000 Genomes, HapMap and other validation statuses from dbSNP such as frequency and cluster. See a detailed description on the dbSNP website.

Vega genes

Vega genes from the Vertebrate Genome Annotation (VEGA) database include manual annotation of specific Human, Mouse, and Zebrafish clones. Annotation is performed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, ab initio gene prediction applications (genscan, Fgenes),. Comparative analysis using vertebrate datasets is used to aid novel gene discovery. The data gathered in these steps is then used to manually annotate the clone adding gene structures, descriptions and poly-A features. The annotation is based on supporting evidence only.

YAC (Yeast Artificial Chromosome)

Originated from a bacterial plasmid, a YAC contains a yeast centromeric region (CEN), a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.