Pittsburgh Supercomputing Center

GeneDoc

A Full Featured Multiple Sequence Alignment Editor, Analyser and Shading Utility for Windows

GeneDoc provides tools for visualizing, editing, and analyzing multiple sequence alignments of protein and nucleic acid sequences. GeneDoc embeds these tools in an explicitly evolutionary context. This context is most directly expressed as the ability to divide the sequences into groups that reflect the division of superfamilies of genes (and proteins) into distinct families. GeneDoc can analyze and visualize these groups either separately or together. Groups can also be contrasted . GeneDoc's analysis capabilities include statistical tools that allow users to evaluate explicit biological or evolutionary hypotheses expressed in terms of specific groupings of sequences (Nicholas and Graves, 1983; Nicholas and McClain, 1995). The visualization tools are strongly integrated with the analysis tools and present the analysis results in a form that is easily comprehend and to use in presentations. GeneDoc provides an evolutionary context for alignment editing by evaluating changes to the alignment in terms of explicit evolutionary models. GeneDoc's analysis functions help users discover which sequence residues are important in the structural and functional roles carried out by biological macromolecules.

GeneDoc runs on Windows and as a Visual Studio project. See the Download section below for links and licensing.

Key Features

Data Analysis and Visualization

Score assisted alignment

Drag-and-drop manual alignment

Paginated Printouts

Windows Based

Highly Configurable

Exported Figures

Phylognetic Tree support

Additional Documentation

Video Tutorial

Downloads

GeneDoc is available as a Windows executable (EXE) file and also as a Visual Studio project. GeneDoc is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 (BioU) of the License, or (at your option) any later version.

Trinity Usage on Blacklight

The information in this page was taken directly from the Trinity-use-on-Blacklight wiki page formerly hosted on Wikispaces. It was created by Brian Couger of Oklahoma State University, and we thank him for his time, his expertise, and his permission to use it.

Trinity Background

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-Seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.

Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.

Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that correspond to paralogous genes.

Blacklight Background

The Blacklight resource is hosted by the Pittsburgh Supercomputing Center (www.psc.edu).

Blacklight is an SGI Altix UV 1000 supercomputer designed for memory-limited scientific applications in fields as different as biology, chemistry, cosmology, machine learning and ecnomics. Funded by the National Science Foundation (NSF), Blacklight carries out this mission with partitions with as much as 16 Terabytes of coherent shared memory.

Blacklight's unique architecture allows computational jobs that require a large amount of memory overheard, such as de novo transcriptome/genomic assemblies to be completed. The very large amount of addressable RAM allows for very high read density assemblies, many of which would be outside the computational scope of many other HPC systems.

Obtaining an account

Blacklight is part of the XSEDE program (https://www.xsede.org/), the successor to the TeraGrid. XSEDE is an National Science Foundation funded collection of HPC resources, services and expertise that allows users to use national HPC infrastructure resources remotely. Instructions for obtaining a user account can be found here: https://www.xsede.org/web/guest/allocations. Requirements are that you or a member of your group are a current researcher in the United States of America or have a research partner who is currently working in the United States.

Logging on to Blacklight

There are three options for logging on to Blacklight once you have established a XSEDE user account.

The host name to use is: blacklight.psc.teragrid.org Upon log-in you will be prompted to enter a user name and password. Use the XSEDE username and password given to you when you received your XSEDE account.

Running Jobs on Blacklight

A highly detailed explanation of executing computational jobs on Blacklight and advanced usage can be found here: http://www.psc.edu/index.php/computing-resources/blacklight This section gives a brief overview of the basics for usage of Blacklight and a step by step how-to guide on using Trinity on Blacklight.

Blacklight OS Structure

Blacklight uses a custom Linux based kernel structure for the OS and a PBS-Torque like system for scheduling and managing jobs. Users who have experience with either should be in familiar territory. Helpful for Linux related questions: http://www.linuxquestions.org/questions/

Blacklight Queue Structure

There are 2 basic queues on Blacklight, the debug queue and the batch queue.

The debug queue has a limit of 30 minutes of wall time and 16 cores maximum, good for ensuring your command line execution arguments are correct. The debug queue is NOT to be used for production runs.

The batch queue is broken into subqueues based on the amount of cores and wall time requested. You submit jobs to the batch queue and they are automatically slotted into the appropriate subqueue based on the resources requested.

Jobs that ask for 256 or fewer cores can ask for a maximum wall-time of 96 hours.

Jobs that ask for more than 256 cores, to a maximum of 1440 cores, can ask for a maximum wall-time of 48 hours.

Jobs requesting more than 1440 cores are sent to a separate queue where they receive special handling.

What if I need more time?

For assemblies that would take longer then the wall time allowed by the queues, or for any other problems, please contact PSC support at This email address is being protected from spambots. You need JavaScript enabled to view it..

If you start a job, and then realize that you need more time, you can still send email to This email address is being protected from spambots. You need JavaScript enabled to view it. and PSC can extend the time of your running job.

Memory Allocation

The amount of memory that is allocated to your job is determined by the number of cores requested. The 16 cores on each blade share 128 Gbytes of RAM. This table shows the amount of RAM you have access to based on the number of cores that you request. Because there are 16 cores on a blade, and blades can not be shared among jobs, you must request cores in multiples of 16.

Cores

Memory (Gbytes)

16

128

64

512

256

2048

512

4096

1024

8192

1424

13952

Charges

On Blacklight, Service Unit charges (SUs) are based on the number of cores a job uses. One core-hour is one SU. Because jobs do not share blades, and there are 16 cores on a blade, a one hour job that uses one blade will be charged 16 SUs.

Job Submission

Jobs are executed on Blacklight using a Portable Batch System(PBS/Torque) system. Users submit jobs to a scheduler which determines when the job is executed based on a number of factors including: the resources required for the job, the number of jobs a user has currently in the queue, the job's specified wall-time, and how many jobs are currently running. For quickest turnaround, jobs should only request the amount of resources needed.

You must create a job script and submit it to run a job. A number of things are required. The following template script is an example of running Trinity. Each #COMMENT line provides an explanation of the next line of the script.

#!/bin/csh
#COMMENT ncpus must be a multiple of 16, the formula for total RAM used by number of cpus is ncpus/16*128 = X GB
#PBS -l ncpus=32#COMMENT The duration of time requested for the job, in this case 40 hours and 30 minutes
#PBS -l walltime=95:00:00#COMMENT Combines stdout and stderr in one file
#PBS -j oe#COMMENT Specifies the queue. Change this to 'debug' to access the debug queue (limit of ncpus=16 and walltime=00:30:00)
#PBS -q batch#COMMENT Emails you when the job starts, stops, or ends
#PBS -m abe -M This email address is being protected from spambots. You need JavaScript enabled to view it.
set echo#COMMENT Needed to load the module command
source /usr/share/modules/init/csh#COMMENT Set stacksize to unlimited
limit stacksize unlimited#COMMENT Move to your $SCRATCH directory, this directory should be where your read files are located
cd $SCRATCH#COMMENT Load most recent version of Trinity#COMMENT Run 'module avail trinity' on Blacklight command line to find name of latest Trinity module#COMMENT (unless need to continue a run started with a different version -- don't switch versions in the middle of an assembly!)
module load trinity#COMMENT Load latest versions of supporting modules required by Trinity
module load bowtie
module load samtools#COMMENT Run the Trinity command
Trinity --seqType fq --JM 100G --left reads.left.fq --right reads.right.fq --SS_lib_type RF --CPU 16 > trinity_output.log

MAKE SURE TO REDIRECT TRINITY OUTPUT TO A LOG FILE AS SHOWN ABOVE (> trinity_output.log) OR YOUR JOB WILL LIKELY GET KILLED!!!

If the output goes through the batch system, the job will be killed if the output exceeds 20 MB (which it usually does with Trinity).

Once you have copied the above template script and made the appropriate changes, you can create a job submission file and submit the job to the queue. You can use any text editor you are familiar with to do this.

If using GSI-SSHTerm to transfer files, upon logging in go to Tools > SFTP Session.

In the Address box, type in the full path to your $SCRATCH directory where you want to store the files. (Your screen will have your username rather than 'mbcougar'.)

Using Trinity

The batch script above requests 32 CPU (or cores) with 256 GB of RAM for 95 hours. This should be enough to run most small to medium Trinity jobs. If your job is small, you may consider using 16 CPU (or cores) which allocates 128GB of RAM for your job, but be warned that there are only a limited number of 16 core jobs allowed to run on the system, so turnaround may actually be slower than for 32 core jobs. You can check if your 16 core job is held up by other 16 core jobs by running qstat -s <pbsjobnumber>:

Only 16 core jobs are limited in this fashion. Other jobs will run based on available cores and the number of jobs ahead of yours in the queue.

If your job is large, consider altering the parameters as necessary to accommodate the data. If you believe that you need more wall-time remember that Butterfly can be run separately from Inchworm and Chrysalis (recommended for large data-sets on Blacklight).

Using Interactive Access

Interactive access on Blacklight is possible; however, it should only be used for short debugging jobs. This command will request an interactive session with 16 cores (allocating 128 GB of RAM) and 30 minutes of wall-time:

qsub -I -l ncpus=16 -l walltime=00:30:00 -q debug

This job uses the debug queue, which has a limit of 16 cores for 30 minutes. Larger jobs must be run with a batch script as in the Job Submission section above.

If your job is killed

If you encounter the following error (or one with slightly different numerical values) that causes the job to stop, you did not ask for enough memory. Request more memory (by requesting more cores) and resubmit the job. See the Memory allocation section of this document for details.

If you have a gigantic job that will exceed the standard queue's limits for wall-time or RAM please email This email address is being protected from spambots. You need JavaScript enabled to view it. to request help.

Module Command

PSC has installed the module software on Blacklight. You can load Trinity and all its dependencies with the module command and execute it anywhere as if it were contained in your path. To see what versions of Trinity are currently installed, type:

Before running Trinity, set stacksize to unlimited

Move to your $SCRATCH space

Your scratch directory is where all assembly files should be uploaded and where all large outputs should be kept on Blacklight (your $HOME space has a 5 GB quota). To move to your scratch space type:

cd $SCRATCH

If you need the location of this directory to transfer files with either WinSCP or GSI-SSHTerm type pwd. This will bring up the directory which should be:

/brashear/<your Blacklight User Name>

Just remember to backup any data on $SCRATCH either to $HOME (if it is on the order of megabytes) or to the archival system (if it is GBs or larger).

Execute Trinity Specific Commands

The following are examples of Trinity commands that can be used. A full list of options is available on Trinity's main site: http://trinityrnaseq.sourceforge.net/. We highly recommend that you read this list to see the correct usage for these commands.

Note: you must substitute your specific values for <Variable> in these examples. Do not include the '< >' symbols in your command. Be sure you are in your $SCRATCH directory and that your input files are located there also.

Requires bowtie module to be loaded, only recommended if you are assembling a transcriptome from a gene dense genome such as a fungal genome. If you have paired end reads, Trinity uses Bowtie to determine that consistent pairing is used, this is not recommended for large genomes. Ensure that your read names are properly labeled by ending with "/1" "/2

--kmer_method (required) <meryl> <jellyfish> or <inchworm>

These are the different methods that can be used for kmer creation with inchworm. More documentation can be found on the Trinity website or the meryl website listed above. For large to very large assemblies these parameters can be adjusted for improved performance at a trade off for the amount of RAM used.

--cpu <int>

Number of CPUS, this should be equal to the number of CPUs (cores) requested for the job

--bflyCPU <int>

Number of CPUS to use for Butterfly,should be equal to that of the amount of CPUs (cores) requested for the job

--bflyHEapSpaceInit <string>

This value is the amount of RAM initially each thread will use in the butterfly job, the product of this value and the thread count can not exceed the amount of RAM allocated for the job. An example of a acceptable value is 3G for 3GB of initial java heap space

--bflyHeapSpaceMax <string>

This is the amount of heap space butterfly will attempt to use if the initial amount is insufficient, if a job does not complete and exits with an error.

--no_run_chrysalis

Only Run Inchworm, can be useful when dealing with very large jobs that require a large amount of wall time

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

There are many search programs in the blast suite, depending on the type of analysis to be done:

To run blast using your own fasta formatted sequence collection as a database, make sure that the database is converted to blast format prior to running the blast command. The program within the blast suite that does blast database formating from a fasta sequence collection is called "makeblastdb". For example:

makeblastdb -in uniprot_sprot.fasta -dbtype prot

After the database is formatted, run the desired blast program. For example:

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

There are many search programs in the blast suite, depending on the type of analysis to be done:

To run blast using your own fasta formatted sequence collection as a database, make sure that the database is converted to blast format prior to running the blast command. The program within the blast suite that does blast database formating from a fasta sequence collection is called "makeblastdb". For example:

makeblastdb -in uniprot_sprot.fasta -dbtype prot

After the database is formatted, run the desired blast program. For example:

Running Trinotate

Make Trinotate programs available for use

The Trinotate process relies on a number of underlying program andsequence databases. To make all of these programs available, use the following module command:

module load trinotate

This module will load a number of modules including: trinotate_db ncbi-blast, signalp, tmhmm, hmmer, RNAmmer.

General Usage

The general Trinotate process is as follows. With the transcripts:

Run blastx with the transcripts against uniprot-swissprot

Run RNAmmer with the transcripts

Generate conceptual protein translations of the transcripts

Run blastp with the conceptual protein translations against uniprot-swissprot

Run hmmsearch with the conceptual protein translations against PFAM

Run signalp with the conceptual protein translations

Run tmhmm with the conceptual protein translations

When the above runs are done, load them into a pre-populated SQLite database that contains annotation information (including go terms) linked to the uniprot-swissprot and pfam identifiers.

Query the database and generate a report that can be viewed inspreadsheet software.

In general, the blastx and blastp steps will take the most amount of time.

PBS Examples

Here are two example jobs for Trinotate. The first runs most of the Trinotate process in one PBS job. The second runs the Trinotate processes as individual PBS jobs. This is the recommended method if you have a large number of transcripts (> 10,000).

One PBS job

Below is a simple example that runs most of the Trinotate process in one PBS job. If you have many transcripts (> 10,000), we do not recommend this method. Instead see the example for running the Trinotate processes as individual PBS jobs.