Microarray Probe Mapping

Ensembl annotates expression microarrays on the genome sequences if manufacturers
disclose probe sequences for a given micro array. The mapping process is a two-step procedure.

Step One: Genome/Transcript Sequence Alignment

In the first step individual probes (oligonucleotides) are mapped to both the
genome sequence and the cDNA sequence. Transcript alignments are performed to capture probes which span
introns. All alignments are stored with reference to the genome sequence i.e. transcript alignments are
reconsituted as gapped alignments against the genome. Alignments are stored as ProbeFeatures using the
extended cigar format as defined by the SAMTools group.
Alignments are performed using the Ensembl analysis pipeline, implementing the
Exonerate sequence
comparison and alignment tool (Slater et al., 2005). A default 1 bp mismatch is permitted
between the probe and the genome sequence assembly. Probes that match at 100 or more
locations (e.g. suspected Alu repeats) are discarded and not stored in the database.

Step Two: Ensembl Transcript Annotation

In the second step, we aim to associate microarray probes or probe sets with Ensembl transcript
predictions (ENST...) using the ProbeFeatures generated from step one. For arrays with probe sets
(e.g. Affymetrix®) it is normally required that more than 50% of the probes in a probe set
hit a given transcript sequence. Probe set sizes are determined dynamically on a per probe set basis,
rather than taking the array-wide value documented by the manufacturer. Arrays which do not contain
probesets as part of their design have transcript annotations assigned directly to individual probes.

A ProbeFeature is matched to a transcript if it overlaps with an exon or UTR region with a minimum of 1bp mismatch.
To account for conservative UTR estimation, transcript cDNA sequences are extended by the length of the UTR.
Where annotated UTRs are absent a default UTR length is used, calculated for both five and three prime UTRs
as the highest of either the mean or the median of all annotated UTRs for a given species.

Data Access

In the Ensembl browser, individual probe alignments from step one can be displayed in the 'Region in detail' view.
Probes that match to a transcript can be seen in the 'Oligo probes' view, accessible via the transcript page.

The probe mappings and transcript annotations now reside in the functional genomics(funcgen) database. As such
programatic access requires the use of the ensembl-functgenomics API.
POD documentation is available here:

Probe and ProbeSet level transcript annotations are now stored in the funcgen databases, along with information on individual ProbeFeatures and objects which fail the mapping criterion are stored as UnmappedObjects. An an example script for access to these data is available here:

The transcript annotations generated from the Ensembl array mapping pipeline are also available in BioMart. These data are currently incorporated into the main Ensembl genes mart, see the 'Microarray' section in the 'Attributes' panel.

Running The Pipeline

Fancy running your own custom array array through the Ensembl array mapping pipeline? Further documentation about the efg array mapping environment can be found here: