Why Next Generation Sequencing

exome mapping: DNA fragments are preselected using microarrays/beads with attached exonic sequences, so only fragments hybridising to them are selected

RNASeq: just RNA, but specific protocols may target just 5- or 3-prime ends of transcripts or miRNA with small size selection. RNASeq is the application in which stranded libraries are used, i.e. to differentiate between sense and antisense transcripts

there are cases where extra-chromosomal DNA (40-70 chloroplasts per plant cell are often in 150kb range, with 40-70 copies per organelle) contributes non-trivial portion of total DNA. Select tissues/stages with less multiple copy DNAs

Technologies to consider

Simple bacteria can be sequenced using just Illumina data (still preferably paired 2x100bp or more).

If you sequence anything in the range of 30kb (fungi /kinetoplasts/etc.) often containing repeats exceeding length of reads, are unlikely to give you satisfactory assembly (we tried it with 2x100bp at >700x coverage => shattered genome, unusable for annotation

the likely solution is to use much larger read lengths (even 454 @400bp can help, but longer sequences are much better)

PacBio is a current leader, giving you bulk of your reads in range 3-4kbs, good enough for smaller genome

Moleculo from Illumina (not sure about availability in your region) is able to get high 10kb fragments using modification of Illumina technology

since the longest to date illumina reads are from low throughput MiSeq machine:

protocol V3, 2 × 300 bp ~55 hrs 13.2-15 Gb

it is possible to greatly improve scaffolding (which contigs comes after another and in which orientation) using lower coverage 2k, 5k 10k or sequencing fosmid ends

one can try to reduce complexity of the assembly by subselecting parts of the genome

use flow sorted chromosomes as input DNA

sequence pools of fosmid clones, likely delivered from different parts of the genome

Bioinformatics of genome assembly

The results of Asemblathon2, an attempt to evaluate performance of number of assemblers and teams were rather unsatisfactory. There is no clear winner, performing best on all 3 species used (fish, parrot and a snake), and the programs seem to make different choices: N50 vs accuracy of the assembly. You can not have both at this time.

This reflects rather poor quality of multiple sequenced genomes, from 30Mbp Leishmania to vertebrates. And this is often made worse by over-enthusiastic annotation (the "hypothetical unlikely" genes, retrotransposons, long ORFs in ribosomal DNA.

The idea is to perform multiple assemblies using different k-mer lengths, programs, error correction programs and then evaluate what makes most sense. Keep in mind that simply getting a longer assembly with longer N50 make lead to removing some coding parts, like was case of "improvement" of highly polymorphic sea squirt Ciona intestinalis genome.

A good shot from biology point of view, is determining to which level the set of extremely conserved genes is present in the different assemblies. This was pioneered by Ian Korf with program called CEGMA using ca 400 supposedly conserved in all Eukaryotes. Probably the better approach would be to create lineage specific conserved genes (i.e. genes having clear orthologues in all sequenced fungi) .

There is expansion of tasks performed by "assemblers", doing often their own read filtering, sometimes error correction, and scaffolding. Since these may expected intact FASTQ files as input, there is no one clear path what you should do with your sequences prior to assembly.

Interesting approach is to normalize the k-mer frequencies present in your reads prior to assembly, as proposed by Titus Brown with his khmer program: http://khmer.readthedocs.org/en/v1.1/

The existing assemblers often fail for regions showing either very high or very low coverage. By making the k-mer
numbers more uniform seems to improve some assemblies.

NGS file formats overview

There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:

text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG)

binary (BAM, BCF, SFF(454 sequencer data))

In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.

On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See i.e.

FASTA (.fa & .fai)

BAM and BAI formats (suffixes .bam & .bai),

VCF (.vcf & .vcf.idx).

The important thing for all files containing positions on chromosomes (mappings, features) is their numbering. To make things complicated, we have formats starting with 1, or with 0.

1 based formats: GFF/GFT, SAM/BAM, WIG,

0 BED, CN (copy number data, not covered)

Programs automatically deduce the correct numbering scheme, but whenever you are converting from one format to another watch for this "feature".

Fasta (just few tricks)

While most likely mature NGS programs will handle complex FASTA sequence names correctly, you may sometimes have problems with various scripts. To fix it and just keep the first, hopefully unique single word sequence name:

FASTQ

Format and quality encoding checks

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains nowadays 4 lines per sequence:

sequence name (should be unique in the file)

sequence string itself with ACTG and N

extra line starting with "+" sign, which contained repeated sequence name in the past

string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

CAVEAT: When planning to do SNP calling using GATK, do not try to modify part of the name describing location on the sequencing lane:

HWUSI-EAS1696_0025_FC:3:1:2892:17869

This part is used for looking for optical replicates.

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters.
For the inquisitive:
http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and
Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

offset 33 == Sanger / Illumina 1.9

offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

Types of data

read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-150bp.

single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

quality checking (FastQC)

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

trimming & filtering

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters.
All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:

Tagdust (for simple unpaired reads)

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

Error correction

For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.

Coral

It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM).

#Illumina reads
./coral -fq input.fq -o output.fq -illumina
#454 reads
./coral -fq input.454.fq -o output.454.fq -454
#correcting 454 reads with Illumina reads
Coral can not use more than one input file, therefore one has to combine Illumina & 454 reads into one FASTQ file, noting the number of Illumina reads used. To prepare such file:
cat 10Millions_illumina_reads.fq > input_4_coral.fq
cat some_number_of_454_reads.fq >> input_4_coral.fq
## run coral with the command:
coral -454 -fq input_4_coral.fq -o input_4_coral.corrected.fq-i 10000000 -j 10000000
#This will correct just the 454 reads,not the Illumina ones
#In real life you have to count the number of reads in your Illumina FASTQ file, i.e. (assuming you do not have wrapped sequence/qualities FASTQ 1 read=4lines ) :
wc -l illumina_reads.fq | awk '{print $1/4}'
#if in doubt, use fastqc to get the numbers

Public FASTQ data: Short Read Archive vs ENA

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies.
There are two main places to get such data sets:

* go there
* put Leishmania major
* click on the checkbox SRA Experiments
* click on Display button
* we got 58 public experiment, 7of which are RNA
* click on RNA (7) on the left
* note
"Whole Genome Sequencing of Leishmania major strain Friedlin"
Accession: SRX203187
CAVEAT: not everybody submits the right descriptions of their experiments. In case of doubt, download and map.

You can also get the multiple SRA files in one step and without 20 clicks from NCBI with configured ASPERA by writing a shell script (one line per file), preferably using a (Python/Perl/Ruby) script and a list of files to get:

Which one to use? ENA may be easier as you get gzipped fastq files directly. But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. Numbering starts from base 1, contains header, and it is line oriented (one sequence per line).

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf
Use BAMs instead of SAM files (speed of access, size, compatibility with tools.There are multiple tools to process and extract information from SAM and its compressed form, BAM files. The few used on daily/basis:

GTF/GFF

The most commonly used sequence annotation format with few flavors.

GTF Fields:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

IGV-centered CAVEAT:
GFF files downloaded from various sources often contain more annotations than a single track of IGV can handle. Typical offenders are chromosome/scaffold lines, causing non-visibility of gene annotations.
Since the authors are quite inventive in naming their tracks, plus key words like "chromosome" may be in column 9, you can:

As you can see, we can try to remove just 34+1 (chromosome + random_sequence) and check that we will get 908081-35 entries when running it the same command on the result. Just remember that we removed the header, which we may want to keep.

1: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2: chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

3: chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

7: thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.

8: thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).

9: itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.

10: blockCount - The number of blocks (exons) in the BED line.

11: blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

12: blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

VCF

Stands for Variant Call Format. Text file format for storing information about SNPs, insertions and deletions.
It is rather complex, see detailed description:
http://www.1000genomes.org/node/101

Obtaining variants listed in such form is a multistep procedure, but well standardised. See GATK section below.