Abstract

Background

Long terminal repeat (LTR) retrotransposons make up a large fraction of the typical
mammalian genome. They comprise about 8% of the human genome and approximately 10%
of the mouse genome. On account of their abundance, LTR retrotransposons are believed
to hold major significance for genome structure and function. Recent advances in genome
sequencing of a variety of model organisms has provided an unprecedented opportunity
to evaluate better the diversity of LTR retrotransposons resident in eukaryotic genomes.

Results

Using a new data-mining program, LTR_STRUC, in conjunction with conventional techniques,
we have mined the GenBank mouse (Mus musculus) database and the more complete Ensembl mouse dataset for LTR retrotransposons. We
report here that the M. musculus genome contains at least 21 separate families of LTR retrotransposons; 13 of these
families are described here for the first time.

Conclusions

All families of mouse LTR retrotransposons are members of the gypsy-like superfamily of retroviral-like elements. Several different families of unrelated
non-autonomous elements were identified, suggesting that the evolution of non-autonomy
may be a common event. High sequence similarity between several LTR retrotransposons
identified in this study and those found in distantly-related species suggests that
horizontal transfer has been a significant factor in the evolution of mouse LTR retrotransposons.

Background

Retrotransposons are mobile genetic elements that make up a large fraction of most
eukaryotic genomes. All retrotransposons are distinguished by a life cycle involving
an RNA intermediate. The RNA genome of a retroelement is copied into a double-stranded
DNA molecule by reverse transcriptase, which is subsequently integrated into the host's
genome. Retrotransposons fall into two main categories: those with long terminal repeats
(LTRs), such as retroviruses and LTR retrotransposons, and those that lack such repeats,
for example, long interspersed nuclear elements (LINEs).

Retrotransposons are particularly abundant in plants, where they are often a principal
component of nuclear DNA. In corn, 50-80%, and in wheat fully 90%, of the genome is
made up of retrotransposons [1,2]. This percentage is generally lower in animals than in plants but it can still be
significant. For example, about 8% of the human genome is now known to be composed
of LTR retrotransposons [3]. In the mouse genome this figure has been estimated at 10% [4].

This article presents the results of a recent survey (December 2002) of the GenBank
mouse (M. musculus) database (GBMD) and the 2.9 Gbp Ensembl [5] mouse dataset (EMD) for the presence of LTR retrotransposons. We have employed a
new search program, LTR_STRUC (LTR retrotransposon structure program), as the initial
data-mining tool in our survey [6]. Identified elements were subjected to sequence analyses to identify open reading
frames (ORFs) encoding reverse transcriptase (RT) and other retroviral proteins. LTR_STRUC
finds only full-length elements, that is, ones having two LTRs and a pair of target
site duplications (TSDs). We therefore augmented our search approach by conducting
BLAST searches using reverse transcriptase queries. These queries are of two types:
previously known RTs in the public database from mouse and other mammals, and RTs
obtained from our initial scan of the EMD with LTR_STRUC. Subsequent RT sequence alignments
were carried out, followed by construction of phylogenetic trees.

An LTR retrotransposon 'family' is defined as a group of elements with RTs at least
90% similar at the amino acid level [7]. Experience has shown that when two elements have RTs that are 90% similar, their
LTRs are typically about 60% similar. Thus, non-autonomous elements, lacking an RT
ORF, are assigned to the same family if their LTRs are at least 60% similar. Many
LTR retrotransposons replicate non-autonomously. Four different families of murine
LTR retrotransposons have non-autonomous members. (MalR elements, ETn elements, VL30 elements and a new type identified in this study, related to IAP elements). These non-autonomous elements are discussed below. Non-autonomous elements
can reach a high copy number even though they lack an RT ORF [4,8-11].

Currently there is no standard mouse retrotransposon nomenclature. In our system of
classification for mouse, LTR retrotransposons are specified by the acronym Mmr (M. musculus retrotransposon). Distinct families are indicated by number (for example, Mmr1, Mmr2, Mmr3). We have chosen to adopt the Mmr nomenclature in this study because it is consistent with the systematic logic ('Mm'
indicative of the genus and species of the host organism; 'r' indicates retrotransposon)
used in previous articles [8,12]. In each case where we use the Mmr acronym in this article to refer to a previously named family, we also include any
pre-existing name for the family.

Results and discussion

RTs from elements identified in our survey fall into numerous distinct families. All
autonomous LTR retrotransposons identified were of the gypsy-like elements (Classes I, II, and III). Autonomous retroviral-like elements in the
mouse genome usually have an overall length of between 6,000 and 9,000 bp. Results
of our study indicate that the TSDs of mouse LTR retrotransposons are four to six
base pairs long and that within each of the three major classes of these elements
a single TSD length is characteristic (see below). With the exception of a few mutated
copies, mouse LTR retrotransposons seem to have the same canonical dinucleotides terminating
the LTRs as are typically found in other species (TG/CA). The LTRs of murine retroviral-like
elements are generally 300-600 bp long, with the exception of mouse mammary tumor
virus (MMTV) where the LTRs are some 1,300 bp in length. Our survey shows that at
least 21 distinct LTR retrotransposon families exist in the mouse genome, 13 of which
have not been described previously.

LTR retrotransposon families of the murine genome

Overview

To date, LTR retrotransposon diversity has been rigorously classified into families
for only a few organisms (for example, Oryza sativa [8], Drosophila melanogaster [7] and Caenorhaditis elegans [12]). This article represents a first attempt to establish a similar uniform classification
and nomenclature for the domestic mouse. Previous studies have classified murine retrotransposons
into broad categories only, which ignore the standard definition of 'family' (see
above). For example, the term 'intracisternal type A particle' (IAP) has been used
to refer to elements that belong to several distinct LTR-retrotransposon phylogenetic
groups. The autonomous elements identified in our survey of the GBMD and EMD fall
into 20 families on the basis of degree of RT divergence (greater than 10% denotes
family). In addition, we have classified MalR elements, which are non-autonomous, into a twenty-first family that is closely related
to MuERV-L elements, because these two types of transposons have similar LTRs. MusD and ETn elements form a second pair of related autonomous and non-autonomous elements; MmERV and VL30 elements constitute a third. These three paired families are discussed in more detail
below.

Our analysis supports previous categorization [4] of mouse LTR retrotransposons into three distinct classes (Figure 1): Class I, containing elements related to retroviral leukemia viruses in mouse (MuLV) and other species (for example, gibbon: GALV and cat: FeLV); Class II which contains the IAP elements, mouse mammary tumor virus (MMTV) and the MusD2/ETn family; and Class III which comprises the MalR and MuERV-L elements. In using these names for the three main categories of murine LTR retrotransposons
we follow the usage of the Mouse Genome Sequencing Consortium [4], but the reader is cautioned that the same terminology has been used to designate
RNA-based transposons (Class I) and DNA-based transposons (Class II). Here, however,
all three classes are RNA-based LTR retrotransposons.

Figure 1. Unrooted RT-based neighbor-joining tree for all three classes of murine retrotransposons.
RT sequences from host species other than mouse are included for comparison.

Class I (families 1-4)

Members of this class make up 0.68% of the mouse genome (copy number about 34,000)
[4]. They have 4-bp TSDs and are related to murine leukemia virus (MuLV; AF033811), a C-type retrovirus that occurs only in mice and is a major cause of cancer
in that genus. Class I, to which MuLV belongs, contains at least three other families: Mmr1_MmERV, Mmr3_MuRRS, and Mmr4. In this article, MuLV is referred to as family Mmr2_MuLV. Class I endogenous retroviruses are more closely related to elements in other species
than to mouse retroelements belonging to Classes II or III. RTs from endogenous retrovirus
in pig (PK15; AF038601) and koala (KoRV; AAF15098), as well as from leukemia viruses in gibbon (GALV; AAA466810) and cat (FeLV; L06140), group with this class; their RTs are all about 80% similar at the amino acid
level to those of murine Class I elements. One member of Class I is found in two different
mouse species, M. musculus and M. dunni, and has previously been referred to as either MmERV (in M. musculus) or MDEV (in M. dunni) [13]; here it is referred to as Mmr1_MmERV. The identity of this family in these two species is demonstrated by the presence
of an element (AAC31805) in the M. dunni (Indian pigmy mouse) genome, which is 96% similar (at the amino acid level) to members
of Mmr1_MmERV resident in M. musculus (Figure 2). This finding is consistent either with a recent common origin of these two mouse
species or with a horizontal transfer of this retrovirus. This virus may be infectious
since an envelope protein sequence is present in the GenBank database (AAC31806) for
the M. dunni retrovirus and has also been detected in copies of this family during our own survey
of M. musculus. Mmr4 is a previously unrecognized Class I family, with members about 80% similar to those
of Mmr2_MuLV. Family Mmr3_MuRRS includes the so-called murine retroviral related sequences (MuRRS). A known human endogenous retrovirus type C oncoviral sequence (AAA73090) is approximately
56% similar at the amino acid level to members of Class I. BLAST searches with RT
queries from Class I indicate that at least some elements in the human genome are
even more similar (>65%) to Class I elements in mouse (for example, HSAP-2; Figure 2 and Table 1).

Figure 2. RT-based neighbor-joining tree for Class I murine retrotransposons. The distances
(uncorrected 'p') appear next to each of the branches. RT sequences from host species
other than mouse are included for comparison. The outgroup is the Class II element
GH-H18 (from golden hamster, Mesocricetus auratus; see Table 3 and Figure 3).

Class II (families 5-19)

Class II retroviral-like elements make up 3.14% of the mouse genome (copy number approximately
127,000) [4]. This class contains 15 of the 21 murine LTR families. Its members have 6 bp TSDs
and are related to MMTV (NC_001503), an oncogenic B-type retrovirus that causes breast
cancer in mice. Our survey has revealed only three full-length copies of a member
of this family (Mmr11_MMTV) in the mouse genome. MMTV contains an ORF coding for envelope protein (BAA03768). Mmr11_MMTV RTs are also 75% similar to those of a separate endogenous mouse family, Mmr16. For the most part, Mmr16 seems to be represented in the mouse genome by fragmentary elements, but the full-length
element Mmr16-1 described in Table 2 has a full complement of retroviral genes, including an envelope ORF, as is the case
with MMTV.

Another family in Class II, Mmr19_MusD, has been previously described under the name MusD. Mager and Freeman [9] who discovered this family, showed that the non-autonomous mouse ETn retroelements (early transposons) are deletion derivatives of Mmr19_MusD. They are so closely related to MusD elements that we have assigned them to the same family. Most copies of the former
are around 5,500 bp long, while those of the latter are usually around 7,400 bp in
length. ETn elements (Y17107; AB033509), first reported by Brulet et al. [14], are a moderately repetitive family of murine retrotransposons that lack most of
the usual retroviral ORFs. Our survey with LTR_STRUC suggests that full-length copies
of ETn elements are about half as common again as full-length MusD elements. Family Mmr12 is about 80% similar to Mmr19_MusD. Both of these families are 70% similar to Mason-Pfizer Monkey Virus (MPMV; NC_001550).
The RTs of MusD elements have an unusual active site sequence: FTDDVLM ('T' is not canonical for an
active site) [14]. Class II contains an additional clade (See Figure 3), comprising at least eight additional families (Mmr6, Mmr7, Mmr9, Mmr10_IAP, Mmr14, Mmr15, Mmr17, and Mmr18) with no two families differing from any other by more than 70%. The major constituents
of this clade are the IAP retrotransposons, the second most abundant family in the
mouse genome, here referred to as family Mmr10_IAP. They lack complete env genes [15] and thus are considered non-infective. Murine elements identified in GenBank as IAP
(for example, GNPSIP and GNMSIA) are restricted to family Mmr10_IAP. Nevertheless, members of any of the eight families listed above have been described
as IAP by various authors. In addition, a family of retroelements in golden hamster
(GH-G18 }; Figure 3) have been described as 'IAP' but do not actually belong to the Mmr10_IAP family (their RT ORFs differ from those of Mmr10_IAP by about 18% at the amino acid level). Thus, in mice, the term IAP might best be restricted
to Mmr10_IAP. Numerous IAP elements share a common, 1,800-bp deletion that includes the upstream
end of the RT. Yet these elements were, and perhaps still are, capable of transposing
as evidenced by the fact that copies with the same deletion were found on many different
chromosomes. Even shorter, internally-deleted elements, with two LTRs and ostensibly
capable of transposition, can be assigned to Mmr10_IAP on the basis of LTR similarity (down to about 2,700 bp in overall length).

Figure 3. RT-based neighbor-joining tree for Class II murine retrotransposons. The distances
(uncorrected 'p') appear next to each of the branches. RT sequences from host species
other than mouse are included for comparison. The outgroup is the Class I element
MDEV (from house/rice field mouse, M. dunni; see Table 3 and Figure 2).

Class III (families 20 and 21)

Members of this class make up 5.40% of the mouse genome (copy number about 442,500)
[4]. They have 5 bp TSDs and Class III has two constituents: murine ERV-L elements, which have an estimated copy number of 37,000 [4]; and the non-autonomous MalRs (mammalian apparent LTR retrotransposons), which are the most common retroviral
element in the mouse genome, making up 4.8% of the mouse genome [4]. MuERV-L elements are closely related to human endogenous retrovirus L (HERV-L). In BLAST searches we have identified a human element (HSAP-1; Table 1 and Figure 4) that is 85% similar at the amino acid level to MuERV-L RTs. Because alignments show that their LTRs are 51% similar, we conclude that murine
MalRs and MuERV-L elements share a recent common ancestor. However, as they are not quite sufficiently
similar to be members of the same family, we have assigned these families the names
Mmr20_ MuERV-L and Mmr21_MaLR.

Figure 4. RT-based neighbor-joining tree for Class III murine retrotransposons. Distances (uncorrected
'p') appear next to each of the branches. RT sequences from host species other than
mouse are included for comparison. The outgroup is the Class II element GH-G18 (from golden hamster, Mesocricetus auratus; see Table 3 and Figure 3).

Like MalRs in other species, murine MalRs are all internally deleted. The internal region contains only non-coding repetitive
DNA. Nevertheless they have typical LTRs, primer binding site and polypurine tract.
Members of Mmr21_MaLR are of two types: MT MalRs - the most common type of LTR retrotransposon in the mouse genome (mean length approximately
1,980 bp); and ORR1 MalRs (mean length approximately 2,460 bp). Our survey suggests that in the mouse genome,
MT MalRs are about ten times as common as their longer relatives, the ORR1 MalRs. Non-truncated copies of Mmr20_ MuERV-L elements have an overall length of about 6,400 bp.

Length variation in murine LTR retrotransposons

Although all copies of family Mmr10_IAP found by LTR_STRUC have two LTRs and recognizable TSDs (as required by the search
algorithm employed by the program), the individual members of this abundant family
vary widely in overall length (2,700-7,200 bp) due to the presence of internal deletions
of varying length. On the other hand, the two abundant types of non-autonomous Class
III elements (MT and ORR1 MalRs) exhibit a markedly different pattern of variation from that of Mmr10_IAP elements. Lengths of ORR1 MalRs peak sharply at 2,300 bp and those of MT MalRs at 1,980 bp, with very few elements in either case differing from these peak frequencies
by more than 100 bp (<1%). Moreover, most copies of Mmr10_IAP, from the shortest to the longest, are preponderantly represented by copies with
a high level of LTR-LTR identity (>99%), a finding consistent with recent transposition.
The ability of internally truncated Mmr10_IAPs to complete their replication cycle is consistent with the fact that a number of
Mmr10_IAP copies bearing the same 1,800-bp deletion (affecting the polyprotein ORF) were found
in our survey on a variety of different mouse chromosomes. A similar dispersed distribution
of lengths was observed in two other families Mmr19_MusD and Mmr1_MmERV. Comparison of a VL30 element (AF486451) with our data revealed a high degree of
LTR-LTR similarity (>90%) to elements in family Mmr1_MmERV and therefore are members of that family (VL30s are non-autonomous and cannot be compared
with other elements on the basis of RT similarity).

Interspecific considerations

Certain families of mouse LTR retrotransposons are more closely related to elements
present in other species than to other classes of mouse elements. For example, murine
Class I elements are more similar to viruses in gibbon, pig, cat, and koala, than
to murine retrotransposons of Classes II or III (Figure 2). Among Class II murine endogenous retroviruses (Figure 3), family Mmr10_IAP is more closely related to the golden hamster element GH-G18 than it is to any other family of murine retroviral elements. Similarly, the amino
acid sequences (RT ORFs) of members of Mmr20_MuERV_L (mouse Class III elements, Figure 4) differ from a human element (for example, HSAP-1, Table 3) by only 15%, but differ from those of any non-Class III element by more than 60%.
Such findings suggest that horizontal transfer may have been a source of new mouse
LTR retrotransposon families over evolutionary time.

Our purpose in using LTR_STRUC to begin our survey of the mouse genome was to obtain
a broadly representative sample of murine retrotransposons. Since the algorithm it
employs is not dependent upon sequence homology, as in standard search methods such
as BLAST, the initial results of our survey presumably were not biased toward a particular
set of queries. Also, since the current version of LTR_STRUC now categorizes the elements
it locates and assigns a new name to any element that differs sufficiently from any
found earlier in the search, the chances of overlooking low-copy families has been
reduced. The thoroughness of our BLAST search can only have been augmented by using
LTR_STRUC because, in the BLAST phase of our survey, the queries used were a combination
of those element types already recognized, prior to our investigation, with those
found by LTR_STRUC. We believe this approach is the reason we were able to identify
the 13 previously unreported families listed above.

Materials and methods

Using a new data-mining program, LTR_STRUC [6], we have mined the Ensembl mouse (M. musculus) dataset [5] for LTR retrotransposons. We have used elements found in this initial search, as
well as murine LTR retrotransposons identified by previous workers, to conduct BLAST
searches of the GenBank mouse database.

Automated characterization of LTR retrotransposons

The methods used in our survey of the mouse genome are essentially the same as those
used in our earlier study of the rice genome and are described elsewhere [8]. Briefly, we began our survey by using a new computer program, LTR_STRUC, which identifies
new LTR retrotransposons based on the presence of characteristic retroelement features
[6]. Additional elements were identified by BLAST searches using the RTs, both of elements
located by LTR_STRUC and of ones previously recognized in earlier studies by previous
researchers.

Datasets scanned

Initial scans with LTR_STRUC were conducted on a dataset consisting of the 2.9 Gbp
of M. musculus sequence data available in the Ensembl database at the time of the initial scan (December
2002). The dataset (EMD) was obtained from the Ensembl website [5]. In an effort to identify additional elements not picked up in the initial survey
with LTR_STRUC, we have used representative sequences from each retrotransposon family
identified in this study as queries to conduct BLAST searches against the GenBank
mouse database (GBMD). Thus, the results reported here constitute a reasonably unbiased
survey of LTR-retrotransposon diversity in mouse. RT sequences were identified according
to previously described criteria [16,17].

Multiple sequence alignments and phylogenetic analyses

The RT domains of the various Mmr elements were aligned, as described elsewhere [8], with previously reported RT sequences (Table 3). In the case of elements lacking an RT sequence because of fragmentation or internal
truncation, the LTR sequences were used to assign them the proper family.