While
the accuracy and precision of deep sequencing data is significantly
better than those obtained by the earlier generation of
hybridization-based high throughput technologies, the digital nature
of deep sequencing output often leads to unwarranted confidence in
their reliability.

Next
generation sequencing platforms have their own share of quality
issues and there can be significant lab-to-lab, batch-to-batch and
even within chip/slide variations.

The
NGSQC pipeline provides a set of novel quality control measures for
quickly detecting a wide variety of quality issues in deep sequencing
data derived from two dimensional surfaces, regardless of the assay
technology used. It also enables researchers to determine whether
sequencing data related to their most interesting biological
discoveries are caused by sequencing quality issues. NGSQC can help
to ensure that biological conclusions, in particular those based on
relatively rare sequences, are not caused by low quality sequencing.

The following is a list of example graphic outputs of our pipeline and
their explanations:

Full
Sample View:
Provide sample level overview of several QC measures including the
distribution base/color code, genomic or other target hit count
under different mismatch counts and target hit levels (unique or
multiple), sequencing read density and quality score based on the
corresponding average values for each tile/panel used by a sample.
The results are presented in the same spatial layout as the deep
sequencing assay to facilitate quick identification of
trends/patterns of quality issues in the whole sample assay.

The Full Sample View
includes the following graphs:

Full sample
base/color code bias graphs: The heatmap color is determined by
the percent of the specific base/color code ((A, C, T, G, 0, 1, 2, 3
or N for unreadable base) in each tile of the corresponding graph.

Full sample
quality score graph: The
heatmap is created using the average quality score of all bases/color
codes from a tile/panelqual_mean

Full sample read
count graph: The
total sequence read count in each tile/panel is used to generate this
heatmap. read_count

All
Tiles/Panels Summary View:
The above quality measures from all tiles/panels based on individual
x-y locations on the two dimensional tile/panel surface. This set of
QC graphs is designed for detecting QC problems that are repeated
for every tile/panel such as optical setup issues.

The All Tiles/Panels
Summary View includes the following graphs:

All tiles/panels
base/color code bias graphs: The heatmap color is determined by
the percent of the specific base/color code ((A, C, T, G, 0, 1, 2, 3
or N for unreadable base) at each x-y coordinates from all
tiles/panels of the corresponding sample.

All tiles/panels
genome hit graphs: The heatmap color is based on the number of
genome hits of the specific type (multiple or unique hits with 0, 1,
2 mismatches) at each x-y coordinates from all tiles/panels. The
genomehit_overall graph includes both multiple and unique hits with
<=2 mismatches in the sample graph. The default multiple hit limit
is <=10 hits on the target genome.

All tiles/panels
quality score graph: The
heatmap is created using the average quality score of all bases/color
codes at each x-y coordinates from all tiles/panels in a
sample.qual_mean

All tiles/panels
read count graph: The
total sequence read count at each x-y coordinates from all
tiles/panels is used to generate this heatmap. read_count

Individual
tile/panel QC:
Individual tile QC maps can be used for identifying quality issues
in individual tiles. To facilitate quick identification of
problematic tiles/panels, we try to rank the unevenness of two
measures, the read count and the genomic hit on each tile/panel,
across x-y coordinates. Currently we use a simple fixed grid for
detecting unevenness.

Cycle-based
QC plot:
the average of quality measures from all tiles/panels as well as
rows and columns of tiles/panels plotted against the base/color
position in the sequence reads. The cycle-based plots for all
tiles/panels are designed to provide an overview of cycle-related
quality variations for all sequence reads in the sample. The plots
for individual columns and rows are for detecting outlier tile/panel
columns/rows. These graphs will not only help use to identify
sequencing cycle-specific issues but also spatial-related issues
based on tile/panel rows and columns.

Cycle-based
base/color bias plots: For detecting based biases in the
sequencing process.

Target
hit plot:
These graphs present sequence alignment results across the target
genome or transcriptome sequences. The x-axis is the target
locations scaled to the display. The y-axis is the sequence count at
each target locations on the positive strand (positive values) and
on the negative strand (negative values). There are useful for
identifying uneven distribution of sequences on the targets or help
to identify sequence structural differences between the sample and
the reference genome/transcriptome.

QC for user defined
sequence lists Link
1
and Link
2
If a user analyzes lists of sequences related to specific biological
conclusions, the resulting QC data will be listed under the above
categories in the output. The side-by-side presentation of the user
defined sequences with the corresponding QC graphs from all
sequences enables users to quickly detect whether sequences related
to a specific biological conclusion are from low quality regions of
sequencing.

Library
Overview:
The paired-end/mate pair library overview graph presents the
percentage of good pairs (correct orientation on the same
chromosome), unpaired reads, chimeric pairs from different
chromosome, chimeric pairs with wrong orientation from the same
chromosome, and chimeric pairs less than or greater than user defined
library fragment range in a bar chart.

Distance
Distribution:
The pair distance distribution plot can be used to judge whether the
matched paired end/mate pair reads exhibit the correct distance
distribution. The distance between each good pair is calculated by
the starting position of the first end minus the starting position of
the second end thus the distance values can be negative. We
also plot pairs hitting different strands of the target separately.
As a result, the pair distance distribution plot can also be used to
detect strand bias.

Following software are required to run NGSQC pipeline
1. gnuplot, Most linux distributions have it in repository, it just need be installed.
2. bowtie. Access http://bowtie-bio.sourceforge.net/index.shtml
3. sed and awk. Most linux distributions already have them in default installation.

A simple usage example is
1. download 'Pipeline with Sample data' and extract it
2. go to the folder 'ngsqc_<VERSION>'
3. run 'make check' to check if all required software are installed
4. run 'make'