Exon Numbering—Not As Easy As 1, 2, 3...

Genes with multiple transcripts cause confusion

Different transcripts for the same gene may differ from one another not just by the addition or deletion of exons, but also the presence of alternative, internal splice sites within an exon. Researchers sometimes ask us why the exon location data we provide on our PrimeTime® qPCR Assay ordering page is different from what they see when using other tools, such as the gene database at NCBI (www.ncbi.nlm.nih.gov/gene). Until recently, NCBI numbered exons within each transcript individually, and there was no consistent gene-based numbering system across these transcripts for identifying exons. Thus, transcript variants could be annotated to each have the same number of exons, but due to alternative splicing or different transcriptional start and stop sites, the size and location of these exons would not be the same in all cases.

Simplifying Exon Numbering

To improve upon this situation, IDT has been working towards generating a consensus exon-numbering system that will be meaningful across these inconsistently annotated transcripts. This approach provides naming consistency for the purposes of identifying the appropriate exons and the IDT PrimeTime qPCR Assays used to detect them.

In the example shown in Figure 1, 6 splice variants of human HMGA1 have been de­scribed. The NCBI numbers exons sequentially for each individual RefSeq entry (Figure 1A). So exon 3 of NM_145899.2 is equivalent to exon 2 of NM_145901.3. When designing PrimeTime qPCR Assays targeting these sequences, IDT design tools consolidate the exon data for the different variants into a single numbering system, as shown in Figure 1B. This is the numbering system displayed on the Results page when identifying an amplicon region in the IDT PrimeTime qPCR Assay Library. The exon numbering scheme used by NCBI (based on specific transcripts) is still retained under the RefSeq # tab for each assay ID.

Figure 1. Exon Structure of Splice Variants for Human HMGA1. The exon positions of HMGA1 transcripts NM_145899.2 and NM_145905.2 are used to demonstrate the numbering systems used by NCBI. Refseq ID num­bers (in blue in left and right columns) indicate independent transcripts, green boxes show exons, green lines denote introns. Numbers indicate the different exons, with (A) those in blue demonstrating how exons would be labeled by NCBI, and (B) those in black showing the consolidated exon numbering system used by IDT.

The results of a search for the human HMGA1 gene are shown in Figure 2. Assay ID# Hs.PT.58.38699366 spans exons 4–6, as per the consolidated numbering system used by IDT and shown on the Results page of the assay selection tool, in the Exon Location column of Figure 2 (blue). Referring to Figure 1, these are the exons nearest the 3’ end of the gene. Note that in the Transcript Locations pop-up window, there are exon naming differences across the transcripts.

Figure 2. Exon Numbering for a Single PrimeTime® qPCR Assay. An example of the PrimeTime qPCR Assay Library Results page on the IDT website shows the results of a search for the human high mobility group AT-hook 1 gene, HMGA1 gene. The specific genomic exon loca­tion for a qPCR assay for the HMGA1 gene is shown. The qPCR assay location based on the consolidated exon num­bering system used by IDT is indicated in blue. Location of the qPCR assay in specific NCBI transcript entries is shown in green. The red labeling shows an example where transcript variants did not include exon data.

When Consolidation Cannot be Performed

There are occasions when it is not possible to give a single, uniform consolidated exon location. In these instances, IDT reverts to giving transcript exon numbers (following the NCBI systems) and marks them with a superscripted “1”. The superscripted 1 means that the researcher needs to review the assay to confirm that it is recognizing the desired sequence/location because there are conflict­ing exon numbers for that gene within the different NCBI databases.

Figure 3 shows an example of an assay with this type of notation (Hs.PT.58.4968362). In this case the assay will amplify from two splice variants, NM_002131.3 and NM_145899.3. The forward primer binds in the first exon and the reverse primer binds in the third exon of each transcript. Figure 1B shows that although these transcripts have identical first exons, they have slightly different third exons which in our consolidated numbering system are numbered 3a and 3b.

Figure 3. An Assay Where Exon Consolidation is Not Possible. The specific genomic exon location for a qPCR assay for human HMGA1 gene is shown. The exon location, highlighted in the blue oval, has a superscripted “1”, indi­cating exon consolidation was not possible for this assay. The exon location is copied from the transcription location highlighted in red.

Missing Data

Occasionally, when we download the RefSeq transcript data, we find there is no description of exon locations within the record. For assays that will amplify these transcripts we desig­nate the exon numbers 0-0 to indicate the lack of data. In addition, because we cannot verify exon boundaries in these transcripts, we will designate all assays associated with such transcripts as ".g" (i.e., not genomically protected).

Coming Changes

NCBI is currently generating a consolidated exon numbering system for each gene in the human genome. At the time of writing (July 2013) they had covered ~20% of the genes. Details of NCBI exon numbering may be found in the GenBank file for RefSeq Genes (those with a Genbank accession number starting with “NG_”). IDT will likely adopt the NCBI system once it is completed.