Retro(trans)posons are genetic elements that can amplify themselves in eukaryotic genomes via an RNA intermediate, which requires their transcription and reverse transcription. Retroposons are divided into three classes: LTR elements, LINEs, and SINEs. The elements that encode reverse transcriptase (RT), an enzyme providing for the reverse transcription and integration of the DNA copy into the genome, are called autonomous transposons. Nonautonomous retroposons rely on the RTs of autonomous transposons. LTR transposons and LINEs can be autonomous or nonautonomous; and their genomic copies are transcribed by the cellular RNA polymerase II.

Short interspersed elements (SINEs) are defined as relatively short (< 700 bp) nonautonomous retroposons transcribed by the cellular RNA polymerase III (pol III) from an internal promoter, while their reverse transcription depends on the RT of partner LINEs. Eukaryotic genomes can harbor hundreds thousands (sometimes more) of SINE copies; copies originating from a common ancestral SINE can differ from each other by single-nucleotide alterations as well as by longer internal deletions or duplications (SINEs with such duplication are called quasidimeric). Some of them can become founders of new SINEsubfamilies.

SINEs consist of two or more modules; typically, head, body, and tail. The 5'-terminal head originates from the cellular RNAs synthesized by pol III: tRNA, 7SL RNA, or 5S rRNA. The origin of the body is either unknown or it descends from a partner LINE. SINEs with a LINE-derived region mimic LINE RNA in the reverse transcription (such SINEs belong to the stringent group). It can also contain a domain shared by distant SINE families (CORE and similar domains). The 3'-terminal tail is a sequence of variable length consisting of simple (often degenerate) repeats. In addition, two SINEs can combine into a dimericSINE, thus, giving rise to a new SINE family. SINEs consisting of the head and tail only are called simple, while dimeric, trimeric, etc. are complexSINEs.

present in at least 100 copies per genome (except certain genomes where repetitive elements are not abundant, e.g., Arabidopsis thaliana)

with at least 60% identity with a tRNA species, 5S rRNA, or 7SL RNA in at least 60-nt overlap (unless the element transcription by pol III was confirmed experimentally). The identification of pol III promoters (e.g., boxes A and B) can serve only as an indication (but not a proof) that the sequence belongs to SINEs.

SINEs should be distinguished from RNA pseudogenes: the pseudogenes are generated by the reverse transcription of the cellular RNAs (e.g., 5S rRNA) rather than of SINE RNAs transcribed from their genomic copies. In practical terms, most SINEs have extra (body) sequences, while simple SINEs have characteristic substitutions/indels shared with their source gene but not with the cellular RNA gene. In addition, SINEs significantly outnumber RNA pseudogenes.

The notion of ‘SINE family’ is widely used but not clearly defined. We consider SINE family as a set of SINEs

of a common origin and

consisting of the same modules in the same order (except the tail, which can vary even in the same species).

Thus, similar SINEs with different LINE-derived regions belong to different families. Long insertions are considered as modules. At the same time, internal deletions or duplications within modules do not give birth to a new family; although a combination of complete or almost complete SINEs (complex SINEs) is considered as a new family (thus, pB1 and quasidimeric B1 are subfamilies of the same family, while dimeric Alu represents a distinct family). Finally, there are а few SINEs with quite similar structure but of independent origin (certain simple SINEs), which are considered as different families.

tRNA shows human tRNA genes with >70% (black), >75% (green), or >80% (red) identity with SINE family consensus in at least 60-nt overlap; in complex SINEs, similar tRNAs for each tRNA-derived monomer are separated by dotted lines.

LINE indicates the putative partner LINE; and the LINE clade is specified in square brackets (the partners were identified by the similarity with the 3’-terminal regions of SINEs and LINEs except mammalian L1, which was identified by the A-rich tail).

Tail is the repeat unit of the tail (sometimes degenerate; in particular, ‘A’ and ‘AT’ correspond to A- and AT-rich sequences).

Make sure that the genomic element analyzed is repetitive and nontandem. Try to evaluate the number of copies per genome if long genomic sequences are available. This can be not as important when the sequence analyzed belongs to a species where presumably all SINEs have been described (e.g., mouse).

Define the boundaries of the element. Usually, these boundaries are clearly seen on SINE multiple alignments where similarity ends. Another way to define the limits of an individual SINE sequence is to find (degenerate) short direct repeats (commonly 8-16 nt) generated in the course of SINE reverse transcription/integration. The SINE sequence should lie between these repeats called target site duplications (TSDs). TSDs can be identified using our TSDSearch tool. Exclude the flanking sequence from further analysis. Truncate very long tails; 10-20 nt is enough. Whenever possible, use consensus rather than individual sequences.

If the element is longer than 1 kb, it is not a SINE. You can try to to identify it by searching for similarities with other transposons.

Run SINESearch against the SINEBase using ~90% of the element length as the Min overlap lenght. If the search was successful, (i) there were no long gaps in the alignment, and (ii) the lengths of the query element and the hit consensus sequence in the SINEBase are similar, the genomic element analyzed can be assigned to the found SINE family. If the search was not successful, try to slightly decrease the Min overlap lenght. If it doesn't help, proceed to module analysis.

If the studied element consists of an RNA-derived region and a tail only, it can be the RNA pseudogene. Simple SINEs can be identified by characteristic substitutions/indels shared with their source SINE copy but not with the cellular RNA gene. In addition, SINEs significantly outnumber RNA pseudogenes.

Module analysis targets to the identification of individual modules of a putative SINE.

Run SINESearch against the RNABase with 60% identity and 60 nt overlap. No results strongly indicate that the element analyzed is not a SINE.

Exclude the whole RNA-derived region and run SINESearch with the remainder sequence against RNABase (complex SINEs contain two or more RNA-derived regions), COREBase, and LINEBase in an attempt to identify known SINE modules. A search against SINEBase can also give a clue to the module nature. Adjust the search parameters to correspond to the query sequence and bank; try to decrease the values if the search was negative.

Exclude identified module(s) and repeat the previous step.

Note that SINEs of the same family have the same modules in the same order; at the same time, they can have relatively small deletions or internal duplications. The tail length and even sequence is not a marker of SINE families.

SINESearch is a FASTA-based search tool that utilizes simple parameters to select sequences of interest instead of the internal FASTA's statistical significance test. This obviates two limitations of FASTA (as well as BLAST etc.) in the case of relatively short and degenerate similarities between nucleotide sequences of SINEs:

bias to short (almost) perfect matches, while the goal is to find full-length and significant similarities, and

missing significant hits when the bank includes many sequences similar to query.

SINESearch is simple to use and fast. Specify the search parameters, bank to search, and query sequence, and press the ' Submit Query ' button to start search. If error message appears, press the ' Back to Previous Page ' button, enter correct data (' Reset All ' button can be used to reset all fields), and press the ' Submit Query ' button again. The results are sorted by the best fit coefficient (reflecting correspondence between the total lengths of the sequences and the overlap length; note that it does not directly depend on the sequence identity). However, the results can be sorted by other parameters (sequence name, identity, or overlap) by clicking on column headers marked with . In the case of the SINEBase bank, the output contains links to the SINE Table, where you can find details about the SINE families found. If you are not satisfied with the results, try to adjust parameters or redefine the query sequence limits.

SINESearch input fields:

Sequence identity. Allowed range: 40-100%. Generally, it is not recommended to decrease this value below 65%. The default value is 65% but it automatically changes to 60% for the RNA banks.

Min overlap lenght. Allowed range: 30-1000 nt. Use 90% of the query sequence length as the starting point; 60 nt is recommended for the RNABase bank; use common sense: some modules can be as short as ~30 nt, while others are longer. The default value is 70 nt but it automatically decreases to 60 and 40 nt for the RNA and CORE banks, respectively.

SINEBank is our bank of consensus sequences of SINE families. The consensus sequences specify the source and some other significant information (such as the previous name) in square brackets, which is followed by the distribution range (as in the SINETable).

Sequence Entry. Query sequence can be entered manually (typed or pasted) or uploaded from a local file. The sequence must be in FASTA format. Only IUPAC nucleotide symbols are allowed. The maximum sequence length is 2 kb. Multiple query sequences are not supported. Either enter or upload query sequence; when a sequence file is uploaded the manual sequence entry field is disabled and vice versa (use the ' Clear ' buttons to reset the file upload box or the manual sequence entry fields; while ' Reset All ' resets all fields). The query sequence can be shortened from both ends using the Offset parameters (notice that the numbering of the full-length sequence is preserved).

TSDSearch is a tool to search for relatively short target site duplications of genomic DNA that commonly frame retrotransposons including SINEs. Found repeats are shown as arrows below the sequence ruler and as sequences with coordinates. TSDs are sorted by a compromise between total length and length of matches, so that 'best' TSDs are list first.

Technically, TSDSearch is implemented in JavaScript, which means that the calculation is performed by your web browser/computer. Clearly, JavaScript should be enabled in your browser. Finally, execution time significantly varies bewteen different browsers. Overall, more recent browsers execute TSDSearch faster. Chrome showed the best performance among popular browsers tested.

Search Area. A typical task in SINE analysis is to identify TSDs framing SINE sequence(s). In this context, the region where TSDs are searched should not include the proper SINE sequence as well as areas too distant from it. Blind analysis of the whole region not only substantially increases the computation time but also complicates data intepretation. Setting the 5' and 3'offset values as well as the lengths of regions to analyze (ranges) makes it possible to focus on the desired areas. Bear in mind that the tails can substantially vary in length, so it is a good practice to increase the 3' range relative to the 5' range. On the other hand, the 5' range has the greatest impact on calculation time so do not increase this value unless nessesary.

TSD parameters. The search algorithm considers TSD as three blocks of nucleotides (subrepeats) identical between the 3' and 5' TSDs (i.e., subrepeat 1 in 5' TSD is identical to subrepeat 1 in 3' TSD etc.). Subrepeats can be separated by variable spacers. For instance, 5' TSD: (ACCT)a(GGG)(TAC) and 3' TSD: (ACCT)(GGG)ac(TAC); subrepeats are shown in parentheses and spacers are in lowercase. Subrepeats cannot be shorter that the subrepeat min lengths specified, while spacers cannot be longer than the max length of spacers. Min length of total match (without gaps) and max total mismatch length allow fine tunung of TSD length and similarity, respectively. Finally, the number of displayed TSDs for a query sequence is limited by the max number of TSDs (specify '1' to show the best TSD only).

Sequence Entry. Query sequence can be entered manually (typed or pasted) or uploaded from a local file (if you use a recent browser that supports HTML5 file operations). The sequence must be in FASTA format. Only A, C, G, and T nucleotide symbols are allowed in the search area. Notice that U, N, and X are not allowed. Gaps ('~', '–', & ' ') are allowed and ignored. Multiple sequence query can be analyzed, but all sequences must have unique names. For good reason, all sequences should be longer than the sum of the left and right offsets and ranges. Either enter or upload query sequence; when a sequence file is uploaded the manual sequence entry field is cleared and vice versa (use the ' Clear ' buttons to reset the file upload box or the manual sequence entry fields; while ' Reset All ' resets all fields).

All parameters and sequences are checked prior to TSD search. Analysis will not start if any parameter or sequence does not conform to the requirements. In this case, balloon message(s) appear near the field to be corrected.

We encourage the submission of new data on SINE families. Please, make sure that your SINE comply with the requirements and provide all nessesary information, which includes submitter's data (name, affiliation etc.), SINE data, and publication (if any). SINE data includes the SINE family name, consensus sequence, taxonomic distribution, copy number, tail repeat unit, and comments.

Please, avoid 'SINE' in the SINE family name; we recommend the first letters of the taxon limiting its distribution (e.g., Gli-1 for a SINE found in dorimice (Gliridae) rather than GliSINE1, SINE2_DOR, etc.).

Only IUPAC nucleotide symbols are allowed in the consensus and tail sequences. The consensus sequence should contain from 60 to 1000 symbols.

Please, evaluate the taxonomic range and copy number of the family. Even rough estimates are better than nothing. Reports based on a single sequence are unacceptable and will not be considered.

You may send any supplemental data (e.g., a multiple alignment or a PDF of the publication) as an attachment file. Please, provide a description of the attachment in the comments field. Do not send files larger than 5 Mb as well as executables (.exe, .com etc.; potentially dangerous files will drive the submission to spam).

As you proceed to the next field (as well as when you press the ' Validate & Send Data ' button), error prompts may appear (e.g., 'This field is required' or 'Invalid email address'). You won't be able to send data without fulfilling all the requirements. Please, contact us if you find some requirements irrelevant (e.g., you have discovered a 55-nt SINE). If no (more) errors, pressing the ' Validate & Send Data ' button sends your data to the SINEBase. You will promptly recieve an automatic confirmation e-mail. Please, allow some time for us to review your submission.

Place mouse cursor over elements of interest (e.g., references in SINETable or abbreviations or nucletides in consensus sequences) for additional information.

SINETable contents can be filtered using boxes below certain fields. E.g., enter 7SL to the Structure box to view only SINEs with a 7SL RNA-derived region. The filters are case-sensitive and support Perl-like regular expressions. For instance, ^7SL in the Structure box will show all SINEs with a 7SL RNA-derived region at the 5' end; tRNA.+LINE will show SINEs with a tRNA-derived region and a downstream LINE-derived one (more examples can be found here). Empty the box to remove filtering.

Similarly, SINETable contents can also be filtered by a high-rank taxon (or taxa) in the Taxon box; both Latin and common names can be used (e.g., Aves or birds ).

Click on column headers (marked with ) in SINETable to sort by the column; second click reverses the order.

SINEs can be selected manually by ticking the checkbox left of the family name in the SINETable and clicking the SINE header to show selected items first.

SINETable filtering and sorting can be combined; e.g., you can view all bony fish SINEs with an L2-derived region sorted by taxon. All filtering/sorting can be reset by reloading the page (F5 in many web browsers). The number (and proportion) of selected elements is immediately shown below the SINE header.

Clicking a SINE family name or a reference in the Table will redirect you to the consensus sequence or the reference, respectively. Clicking items in the Features column will redirect you to their descriptions.

Rainbow animation in the top frame can be stopped by clicking the animated text.