Home » Help » FAQs » How does VectorBase annotate repeats and mask sequences?

How does VectorBase annotate repeats and mask sequences?

Submitted by admin on Mon, 2011-08-29 17:33

Answer:

It is standard practice to annotate repetitive regions of a genome as repeat features; this may be interesting in its own right, but is also a pre-requisite for subsequent analyses, such as gene prediction or whole genome alignment. VectorBase uses the same methodology for almost all species; exceptions are noted below.

TEfam libraries are only available for some of our mosquito species, and the Repbase library has relatively few arthropod repeats for non-reference species (i.e. anything other than Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus); so most of our annotation of complex repeats is derived from the de novo, species-specific, libraries.

Repeat annotation: exceptions

Occasionally, species are provided to VectorBase with repeat features already annotated; in these cases we only annotate with Dust, TRF, and the Repbase library, and do not generate de novo libraries. This currently applies to Aedes albopictus and Musca domestica.

The de novo, species-specific, libraries are sometimes augmented with additional repeat libraries from other species, from sources such as GenBank or FlyBase. This currently applies to Aedes aegypti, Anopheles gambiae, Glossina morsitans, and Ixodes scapularis.

Genome sequences available for download are in "softmasked" format. This means that regions annotated with repeat features have lower case characters. Contrast this with the strict definition of "masking" (often called "hardmasking"), in which repetitive regions are converted to 'N' characters.

The very simple example below demonstrates the difference between the two types of masking. Note that unmasked or hardmasked sequence can be derived from softmasked sequence.

This project has been funded in whole or in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN272201400029C.