There are many sources of errors that can influence the quality of your sequencing run [ROBASKY2014].
In this quality control section we will use our skill on the
command-line interface to deal with the task of investigating the quality and cleaning sequencing data [KIRCHNER2014].

Note

You will encounter some To-do sections at times. Write the solutions and answers into a text-file.

The data we are using is “almost” raw data as it came from the machine. This data has been post-processed in two ways already. All sequences that were identified as belonging to the PhiX genome have been removed. This process requires some skills we will learn in later sections. Illumina adapters have been removed as well already! The process is explained below but we are not going to do it.

PhiX is a nontailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides.
PhiX is used as a quality and calibration control for sequencing runs.
PhiX is often added at a low known concentration, spiked in the same lane along with the sample or used as a separate lane.
As the concentration of the genome is known, one can calibrate the instruments.
Thus, PhiX genomic sequences need to be removed before processing your data further as this constitutes a deliberate contamination [MUKHERJEE2015].
The steps involve mapping all reads to the “known” PhiX genome, and removing all of those sequence reads from the data.

However, your sequencing provider might not have used PhiX, thus you need to read the protocol carefully, or just do this step in any case.

Attention

We are not going to do this step here, as this has been already done. Please see the Read mapping section on how to map reads against a reference genome.

The process of sequencing DNA via Illumina technology requires the addition of some adapters to the sequences.
These get sequenced as well and need to be removed as they are artificial and do not belong to the species we try to sequence.

Attention

The process of how to do this is explained here, however we are not going to do this as our sequences have been adapter-trimmed already.

First, we need to know the adapter sequences that were used during the sequencing of our samples.
Normally, you should ask your sequencing provider, who should be providing this information to you.
Illumina itself provides a document that describes the adapters used for their different technologies.
Also the FastQC tool, we will be using later on, provides a collection of contaminants and adapters.

Second, we need a tool that takes a list of adapters and scans each sequence read and removes the adapters.
Install a tool called fastq-mcf from the ea-utils suite of tools that is able to do this.

# install
conda install ea-utils

Using the tool together with a adapter/contaminants list in fasta-file (here denoted as adapters.fa):

To assess the sequence read quality of the Illumina run we make use of a program called SolexaQA++[COX2010].
SolexaQA++ was originally developed to work with Solexa data (since bought by Illumina), but long since working with Illumina data.
It produces nice graphics that intuitively show the quality of the sequences. it is also able to dynamically trim the bad quality ends off the reads.

From the webpage:

“SolexaQA calculates sequence quality statistics and creates visual
representations of data quality for second-generation sequencing
data. Originally developed for the Illumina system (historically
known as “Solexa”), SolexaQA now also supports Ion Torrent and 454
data.”

curl-Ohttp://compbio.massey.ac.nz/data/203341/SolexaQA.tar.gz# uncompress the archivetar-xvzfSolexaQA.tar.gz# make the file executablechmoda+xSolexaQA/Linux_x64/SolexaQA++# copy program to root foldercp./SolexaQA/Linux_x64/SolexaQA++.# run the program./SolexaQA++

Should the dynamic trimming not work with SolexaQA++, you can alternatively use Sickle.

condaactivatengscondainstallsickle-trim

Now we are going to run the program on our paired-end data:

# create a new directorymkdirtrimmed# sickle parameters:sickle--help# as we are dealing with paired-end data you will be using "sickle pe"sicklepe--help# run sickle like so:sicklepe-g-tsanger-fdata/ancestor-R1.fastq.gz-rdata/ancestor-R2.fastq.gz-otrimmed/ancestor-R1.trimmed.fastq.gz-ptrimmed/ancestor-R2.trimmed.fastq.gz

Hint

Should you be unable to run Sickle or SolexaQA++ at all to trim the data. You can download the trimmed dataset here. Unarchive and uncompress the files with tar-xvzftrimmed.tar.gz.

FastQC is a very simple program to run that provides similar and additional information to SolexaQA++.

From the webpage:

“FastQC aims to provide a simple way to do some quality control
checks on raw sequence data coming from high throughput sequencing
pipelines. It provides a modular set of analyses which you can use
to give a quick impression of whether your data has any problems of
which you should be aware before doing any further analysis.”

The basic command looks like:

fastqc-oRESULT-DIRINPUT-FILE.[txt/fa/fq]...

-oRESULT-DIR is the directory where the result files will be written

INPUT-FILE.[txt/fa/fq] is the sequence file to analyze, can be more than one file.

Hint

The result will be a HTML page per input file that can be opened in a web-browser.

Hint

The authors of FastQC made some nice help pages explaining each of the
plots and results you expect to see here.