Introduction

The goal of this document is to provide a set of illustrative cookbook recipes for wet lab biologists. As there is more than one way of performing a given task, it does point to different tools, but it does not even try to become a yet another directory of links.

You are strongly encouraged to post your comments/ideas on the discussion page of this tutorial.

Before you start

use text files/plain e-mail whenever possible

give meaningful names to your files (i.e. humP53_XP12345.fas)

do not use spaces or Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)

create separate folders/directories for each project with meaningful names

do not waste paper for printing out your blast search results or sequences

if primary data needs to be archived burn it on CD using (if possible) non vendor-specific data format

Operating systems/General software

Once you move from worksation to bioinformatics application server (almost always Linux/Unix) then as long as you have a Xwindows connection (preferably ssh tunneled) operating system of client workstation plays a minor (if any) role.

DNA sequence processing

Sequence formats

There are multiple sequence formats commonly used. These can be divided into text-only and binary formats.

Text-only

Unaligned sequences

Raw sequence ( example) The text file contains nothing but sequence. The information that's inside has to be encoded in the name of the file. Some web servers still require it.

FASTA (description, example) The most commonly used format (by programs/servers). It can contain either single or multiple sequences, each having a name line, also known as the header, starting with ">" in the first column. Keep in mind that various programs/servers introduce extra restrictions how the FastA header should look like. The "safe" standard:

Alphanumeric characters only [a-z, A-Z, 0-9], avoid spaces!

Keep the names short [i.e. 30 characters are significant for clustalw, only 8 for older versions of t_coffee]

In multiple FastA format keep the names unique

Whenever you plan to analyze your sequence using web servers or command line programs, keep it as FASTA. It is easy to cut the raw sequence out of it, and if your file name/sequence names contain accession numbers to find the GenBank entry.

GenBank format of choice used by GenBank to display extra information about the sequence. It is fairly complex, but easily readable by humans. Keep your sequences as GenBank when you need extra annotations provided by this format.

GCG GCG package own format. If you need to use GCG, you will have to convert your sequences into it

XML eXtended Markeup Language. Standard language slowly taking over the Net. XML files are computer-readable (= information is fairly easy to extract from them) and possible to read by humans. Format of choice when complex sequence information need to be described and computer accessible.

Processing ABI files and sequence assembly

Before you even start sequencing, plan the way the sequences will be processed. The most important is to establish a proper naming scheme compatible with various programs you (or others) may use downstream for sequence processing. The basic rules for file names are:

Even small sized projects become cumbersome if file names are not unique (you can not assembly files having the same name!).
Once these requirements are satisfied, file name alone will give you enough information about:

Creating sequence out of ABI chromatogram file is done by a basecalling program. One can go along with ABI-provided basecaller or preferably use:

phred[2] giving more accurate calls for less accurate part of the sequence (like at the end of the run, say 600bp and more) . Phred also gives a probability/quality values for each of the bases allowing more accurate assembly.

Genome based

Genome annotation using ESTs assembly

Synthetic genes

Many genes due to an extreme codon bias can not be expressed in model organisms, such as E.coli. To fix this problem one can back-translate protein sequence into nucleotides, using codons preferably used by E.coli. Moreover one can construct sucgh gene from a set of overlapping oligos, introducing restriction sites or other features at will. Programs performing such tasks:

E-CAIHTML "The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences."

RNA

The Vienna RNA package has solutions for many classical RNA problems. It well worth taking some time to familiarize oneself with the algorithms in this package before checking out the alternative methods.

For the best accuracy (and evolutionary information) comparative approaches, in particular the alignment folding methods, have proven rather successful. Typically a researcher will align sequences using tools such as ClustalW, MAFFT and ProAlign and then fold the alignment using tools such as RNAalifold or Pfold.

RNA homology search

For the Wikiomics:RNA homology search problem (including secondary structure) a number of algorithms exist. E.g. Infernal and RaveNnA. A number of tools specializing in a specific RNA family are also available eg. tRNA, miRNA etc.

RNA gene-finding

Tools for Wikiomics:RNA gene-finding usually use as input sequence alignmments which are then scanned for potential conserved secondary structure signals. Useful examples of these are QRNA, RNAz and EvoFold.

Multiple sequences

There is a trade off between accuracy and speed. For speed select MUSCLE. Accuracy of algorithms differs, with the best results on average being achieved using PRALINE and t-coffee/M-coffee. For poteins with multiple/repeated domains you may try graph based algorithms: poa or aba.

T-Coffee has a limit of 50 sequences. The new version is called M-Coffee, it is a meta-aligner using output of several other algorithms (ClustalW, Poa, Muscle, ProbCons, MAFFT, Dialign-T, PCMA). To run it (bash):

Non-coding DNA sequences

Similarity plots (dot-plots)

Create dot-plots (DNA-DNA, protein-protein and DNA-protein) between one or more sequences. Ideal for quick estimation of sequence similarity, gene organization (genomic DNA vs cDNA), domain finding for small/medium size sequences. One can also zoom an area of interest.
Memory and CPU is the limit but one can compare 200kbp genomic DNA vs 6kbp cDNA in ca 2min on 2GHz Pentium 4.

Database searches (sequence similarity)

There is a tradeoff between finding all possible matches (like with Smith-Waterman) and speed. For most applications Blast (and Psi-Blast) is a first choice. For massive searches, where speed is essential (i.e. mapping all cDNAs to a mammalian genome) use blat

Regular expressions

Searches for short motifs, like [MC]-A-x-E(2)-{IE} (M or C followed by A followed by any followed by EE followed by anything but I or E)

Phylogeny

for more than 12 sequences tha space of all possible tree is to large to search one-by-one -> heuristic methods are used

Methods

maximum parsimony MP

sensitive to the order of sequences.

distance methods

Neighbour joining NJ

UPGMA (Unweighted Pair Group Method with Arithmetic mean) assumes rate of mutation constant along branches of a tree. Less acurate for sequence data than NJ.

maximum likehood (computationaly intense, but more accurate)

To check validy of a tree:

bootstrap

After creating a tree from MSA bootstraping program selects number of columns from MSA and calculates a tree. Thsi process is repeated 100-1000 times and resulting trees are compared with previously created one. Branches present in more than 70% of bootstrap samples are believed to have a good support.
These estimates are conservative

Mesquite collection of modules for performing a wide range of analyses from various branches of evolutionary biology (e.g., phylogenetics, molecular evolution, population genetics, geometric morphometrics)

fill in the form selecting the proper splice site matrix (so far: vertebrates, Drosophila, C. elegans, or plants)

you get a table with genomic coordinates/mRNA coordinates, exon length, % of identity, number ofmismatches, gaps, precence or abcence of splice donor and acceptor sites

Very good at finding exons in the same species (i.e human cDNA/human genomic sequence), more problematic with cross-species comparisons (i.e. human cDNA, bovine genomic sequence -> most of the exons OK, with few bad predictions).

Promoter prediction

Predicting promoters based on a single sequence is rather unaccurate. Transcription factor (TF) recognition sequences are fairly short, therefore any sequence 1kbp in lenght will contain hundreds of such motifs, only a handful of which will be a true positives. There are two major methods used for finding relevant sites

comparison of promoter regions of genes with similar expression profiles in the same species

We can look for either known TFs recognition sequences (quality of which depends on database used) or perform motif finding de novo, looking simply for any streach of DNA shared between species/co-expressed genes.

before you start

make sure that you got a real 5' end exon of the gene (a lot of longer genes in the databases may be missing the true 5'end). You verify quality of your gene of interest by:

comparing with orthologues ( Ensembl or Homologene) If close orhologues are longer on the protein level, look for the presumptive missing exon(s).

blasting masked genomic sequence, from 5' end exon to the start/end of the flanking gene

Interpretation of this one may be tricky, on the simplest level look for spliced matches with ORFs in the same orientation as your gene without clear stop codons.

find position of repeats

After checking the 5' end of the gene, always look for repetitive sequences using RepeatMasker. It does not mean that say Alu/LINE1 can not be a part of the promoter. Just that you may concentrate on parts which are a bit more specific for the given gene.

It is important to use few complementary programs as well as allow predictions of several motifs per program. There is threshold when it comes to number of sequences used, above which there is no improvement in sensitivity.

Be aware that simply listing all recognizable TF binding sites in a given stretch of DNA is almost of no value. Only 0.1% of the putative sites listed that way will be functional (Applied bioinformatics for the identification of regulatory elements WW Wasserman, A Sandelin - Nature Reviews Genetics, 2004)

Reviews & tutorials

For an extensive table of known programs used for finding motifs see [55]

Transcription start finding

There are various clases of such programs starting with completa ab initio promoter finding using genomic sequence only, through combining it with ab initio gene fining, finally taking into account ESTs data and/or gene annotations. Most of the programs were evaluated on human or mammalian sequences and may not be suitable for other groups of organisms. Masking repeats using RepeatMasker improves performance of McPromoter(strongly) and FirstEF (mildly), whereas has no effect on Eponine and DragonGSF.

Graphical plots of protein 2D structure

' 'One-step' data input procedure, including easy querying of the public databases and incorporation of output from various popular prediction tools available from the Web.'
allows creating graphs with transmembrane regions, signal peptides, disulfide-bridhes, glicosylated residues etc.

Metabolic networks

Protein-protein interactions

Large scale methods such as yeast-two-hybrids (Y2H), immuno-co-precipiatation (IP) produce a large number of false positives. Only about 30-50% of interactions seem to be relevant. While newer improvements of these procedures tend to be more accurate, existing databases struggle to separate good quality data.