RNA-seq Bioinformatics

Trinotate

Now we have a bunch of transcript sequences and have identified some subset of them that appear to be biologically interesting in that they’re differentially expressed between our two conditions - but we don’t really know what they are or what biological functions they might represent. We can explore their potential functions by functionally annotating them using our Trinotate software and analysis protocol. To learn more about Trinotate, you can visit the Trinotate website.

Again, let’s make sure that we’re back in our primary working directory called ‘trinity_workspace’:

pwd
/home/ubuntu/workspace/trinity_workspace

If you’re not in the above directory, then relocate yourself to it.

Now, create a Trinotate/ directory and relocate to it. We’ll use this as our Trinotate computation workspace.

Below, we’re going to run a number of different tools to capture information about our transcript sequences.

Identification of likely protein-coding regions in transcripts

TransDecoder is a tool we built to identify likely coding regions within transcript sequences. It identifies long open reading frames (ORFs) within transcripts and scores them according to their sequence composition. Those ORFs that encode sequences with compositional properties (codon frequencies) consistent with coding transcripts are reported.

Running TransDecoder is a two-step process. First run the TransDecoder step that identifies all long ORFs.

There are a few items to take notice of in the above peptide file. The header lines includes the protein identifier composed of the original transcripts along with ‘|m.(number)’. The ‘type’ attribute indicates whether the protein is ‘complete’, containing a start and a stop codon; ‘5prime_partial’, meaning it’s missing a start codon and presumably part of the N-terminus; ‘3prime_partial’, meaning it’s missing the stop codon and presumably part of the C-terminus; or ‘internal’, meaning it’s both 5prime-partial and 3prime-partial. You’ll also see an indicator (+) or (-) to indicate which strand the coding region is found on, along with the coordinates of the ORF in that transcript sequence.

This .pep file will be used for various sequence homology and other bioinformatics analyses below.

Sequence homology searches

Earlier, we ran blastx against our mini SWISSPROT datbase to identify likely full-length transcripts. Let’s run blastx again to capture likely homolog information, and we’ll lower our E-value threshold to 1e-5 to be less stringent than earlier.

Preparing and Generating a Trinotate Annotation Report

Generating a Trinotate annotation report involves first loading all of our bioinformatics computational results into a Trinotate SQLite database. The Trinotate software provides a boilerplate SQLite database called ‘Trinotate.sqlite’ that comes pre-populated with a lot of generic data about SWISSPROT records and Pfam domains (and is a pretty large file consuming several hundred MB). Below, we’ll populate this database with all of our bioinformatics computes and our expression data.

Preparing Trinotate (loading the database)

As a sanity check, be sure you’re currently located in your ‘Trinotate/’ working directory.

pwd
/home/ubuntu/workspace/trinity_workspace/Trinotate

Copy the provided Trinotate.sqlite boilerplate database into your Trinotate working directory like so:

The above file can be very large. It’s often useful to load it into a spreadsheet software tools such as MS-Excel. If you have a transcript identifier of interest, you can always just ‘grep’ to pull out the annotation for that transcript from this report. We’ll use TrinotateWeb to interactively explore these data in a web browser below.

Let’s use the annotation attributes for the transcripts here as ‘names’ for the transcripts in the Trinotate database. This will be useful later when using the TrinotateWeb framework.

Nothing exciting to see in running the above command, but know that it’s helpful for later on.

Interactively Explore Expression and Annotations in TrinotateWeb

Earlier, we generated large sets of tab-delimited files containg lots of data - annotations for transcripts, matrices of expression values, lists of differentially expressed transcripts, etc. We also generated a number of plots in PDF format. These are all useful, but they’re not interactive and it’s often difficult and cumbersome to extract information of interest during a study. We’re developing TrinotateWeb as a web-based interactive system to solve some of these challenges. TrinotateWeb provides heatmaps and various plots of expression data, and includes search functions to quickly access information of interest. Below, we will populate some of the additional information that we need into our Trinotate database, and then run TrinotateWeb and start exploring our data in a web browser.

Populate the expression data into the Trinotate database

Once again, verify that you’re currently in the Trinotate/ working directory:

pwd
/home/ubuntu/workspace/trinity_workspace/Trinotate

Now, load in the transcript expression data stored in the matrices we built earlier:

Note, in the above gene-loading commands, the term ‘component’ is used. ‘Component’ is just another word for ‘gene’ in the realm of Trinity.

At this point, the Trinotate database should be fully populated and ready to be used by TrinotateWeb.

Launch and Surf TrinotateWeb

TrinotateWeb is web-based software and runs locally on the same hardware we’ve been running all our computes (as opposed to your typical websites that you visit regularly, such as facebook). Launch the mini webserver that drives the TrinotateWeb software like so:

Epilogue

If you’ve gotten this far, hurray!!! Congratulations!!! You’ve now experienced the full tour of Trinity and TrinotateWeb. Visit our web documentation at http://trinityrnaseq.github.io, and join our Google group to become part of the ever-growing Trinity user community.