Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery
and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
Learn more

Category
Metagenomics

Overview

PathSeq is a suite of tools for detecting microbial organisms in deep sequenced biological samples. It is capable of
(1) quantifying microbial abundances in metagenomic samples containing a mixture of organisms, (2) detecting extremely
low-abundance (<0.001%) organisms, and (3) identifying unknown sequences that may belong to novel organisms. The pipeline is based
on a previously published tool of the same name (Kostic et al. 2011), which has been
used in a wide range of studies to investigate novel associations between pathogens and human disease.

The pipeline consists of three phases: (1) removing reads that are low quality, low complexity, or match a given
host (e.g. human) reference, (2) aligning the remaining reads to a microorganism reference, and (3) determining
the taxonomic classification of each read and estimating microbe abundances. These steps can be performed individually using
PathSeqFilterSpark, PathSeqBwaSpark, and PathSeqScoreSpark. To simplify using the pipeline, this tool combines the
three steps into one. Further details can be found in the individual tools' documentations.

The filtering phase ensures that only high fidelity, non-host reads are classified, thus reducing computational costs
and false positives. Note that while generally applicable to any type of biological sample (e.g. saliva, stool), PathSeq
is particularly efficient for samples containing a high percentage of host reads (e.g. blood, tissue, CSF). PathSeq
is able to detect evidence of low-abundance organisms and scales to use comprehensive genomic database references
(e.g. > 100 Gbp). Lastly, because PathSeq works by identifying both host and known microbial sequences, it can also
be used to discover novel pathogens by deducing the sample to sequences of unknown origin, which may be followed
by de novo assembly.

Because sequence alignment is computationally burdensome, PathSeq is integrated with Apache Spark,
enabling parallelization of all steps in the pipeline on multi-core workstations and cluster environments. This
overcomes the high computational cost and permits rapid turnaround times (minutes to hours) in deep sequenced samples.

Reference files

Before running the PathSeq pipeline, the host and microbe references must be built. Prebuilt references
for a standard microbial set are available in the
GATK Resource Bundle.

To build custom references, users must provide FASTA files of the host and pathogen sequences. Tools are included to
generate the necessary files: the host k-mer database (PathSeqBuildKmers), BWA-MEM index image files of the host and
pathogen references (BwaMemIndexImageCreator), and a taxonomic tree of the pathogen reference (PathSeqBuildReferenceTaxonomy).

Output

Taxonomic scores table

Annotated BAM aligned to the microbe reference

Filter metrics file (optional)

Score metrics file (optional)

Usage example

This tool can be run without explicitly specifying Spark options. That is to say, the given example command
without Spark options will run locally. See
Tutorial#10060 for an example
of how to set up and run a Spark tool on a cloud Spark cluster.

Note that the host and microbe BWA images must be copied to the same paths on every worker node. The microbe FASTA,
host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.

References

PathSeqPipelineSpark specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

maximum number of bytes to read from a file into each partition of reads. Setting this higher will result in fewer partitions. Note that this will not be equal to the size of the partition in memory. Defaults to 0, which uses the default split size (determined by the Hadoop input format, typically the size of one HDFS block).

Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.

Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

maximum number of bytes to read from a file into each partition of reads. Setting this higher will result in fewer partitions. Note that this will not be equal to the size of the partition in memory. Defaults to 0, which uses the default split size (determined by the Hadoop input format, typically the size of one HDFS block).

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).

Minimum seed length for the host BWA alignment.
Controls the sensitivity of BWA alignment to the host reference. Shorter seed lengths will enhance detection of
host reads during the subtraction phase but will also increase run time.

Estimated reads per partition after quality, kmer, and BWA filtering
This is a parameter for fine-tuning memory performance. Lower values may result in less memory usage but possibly
at the expense of greater computation time.

Host alignment identity score threshold, in bp
Controls the stringency of read filtering based on alignment to the host reference. The identity score is defined
as the number of matching bases less the number of deletions in the alignment.

Identity margin, as a fraction of the best hit (between 0 and 1).
For reads having multiple alignments, the best hit is always counted as long as it is above the identity score
threshold. Any additional hits will be counted when its identity score is within this percentage of the best hit.

For example, consider a read that aligns to two different sequences, one with identity score 0.90 and the other with
0.85. If the minimum identity score is 0.7, the best hit (with score 0.90) is counted. In addition, if the identity margin is 10%,
then any additional alignments at or above 0.90 * (1 - 0.10) = 0.81 would also be counted. Therefore in this example the second
alignment with score 0.85 would be counted.

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION

Take the union of all intervals

INTERSECTION

Take the intersection of intervals (the subset that overlaps all intervals specified)

Max allowable number of masked bases per read
This is the threshold for filtering reads based on the number of 'N' values present in the sequence. Note that
the low-complexity DUST filter and quality filter mask using 'N' bases. Therefore, this parameter is the threshold
for the sum of:

Minimum length of reads after quality trimming
Reads are trimmed based on base call quality and low-complexity content. Decreasing the value will enhance pathogen
detection (higher sensitivity) but also result in undesired false positives and ambiguous microbe alignments
(lower specificity).

Alignment identity score threshold, as a fraction of the read length (between 0 and 1).
This parameter controls the stringency of the microbe alignment. The identity score threshold is defined as the
number of matching bases minus number of deletions. Alignments below this threshold score will be ignored.

Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.

Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:

Number of reads per partition for output. Use this to control the number of sharded BAMs (not --num-reducers).
Because numReducers is based on the input size, it causes too many partitions to be produced when the output size is much smaller.

Estimated reads per Spark partition for scoring
This parameter is for fine-tuning memory performance. Lower values may result in less memory usage but possibly
at the expense of greater computation time.

Skip pre-BWA repartition. Set to true for inputs with a high proportion of microbial reads that are not host coordinate-sorted.

Advanced optimization option that should be used only in the case of inputs with a high proportion of microbial
reads that are not host-aligned/coordinate-sorted.

In the filter tool, the input reads are initially divided up into smaller partitions (default size is usually
the size of one HDFS block, or ~64MB) that Spark works on in parallel. In samples with a low proportion of microbial
reads (e.g. < 1%), the steps leading up to the host BWA alignment will whittle these partitions down to a small
fraction of their original size. At that point, the distribution of reads across the partitions may be unbalanced.

For example, say the input is 256MB and Spark splits this into 4 even partitions. It is possible that, after
running through the quality filters and host kmer search, there are 5% remaining in partition #1, 8% in partition #2,
2% in partition #3, and 20% in partition #4. Thus there is an imbalance of work across the partitions. To
correct this, a "reparitioning" is invoked that distributes the reads evenly. Note this is especially important
for host-aligned, coordinate-sorted inputs, in which unmapped reads would be concentrated in the last partitions.

If, however, the proportion of microbial reads is higher, say 30%, then the partitions are generally more
balanced (except for in the aforementioned coordinate-sorted case). In this case, the time spent doing
the repartitioning is usually greater than the time saved by rebalancing, and this option should be invoked.