This uses a standard template (GATK best practice variant calling)
to automate creation of a full configuration for all samples. See
Automated sample configuration for more details on running the
script, and manually edit the base template or final output
file to incorporate project specific configuration. The example
pipelines provide a good starting point and the
Sample information documentation has full details on
available options.

Run analysis, distributed across 8 local cores:

bcbio_nextgen.pybcbio_sample.yaml-n8

Read the Configuration documentation for full details on
adjusting both the sample and system configuration files to match
your experiment and computational setup.

with the input configuration in the config directory, the outputs of the
pipeline in the final directory, and the actual processing done in the
work directory. Run the bcbio_nextgen.py script from inside the work
directory to keep all intermediates there. The final directory, relative to
the parent directory of the work directory, is the default location
specified in the example configuration files and gets created during
processing. The final directory has all of the finished outputs and you can
remove the work intermediates to cleanup disk space after confirming the
results. All of these locations are configurable and this project structure is
only a recommendation.

There are 3 logging files in the log directory within your working folder:

bcbio-nextgen.log High level logging information about the analysis.
This provides an overview of major processing steps and useful
checkpoints for assessing run times.

bcbio-nextgen-debug.log Detailed information about processes
including stdout/stderr from third party software and error traces
for failures. Look here to identify the status of running pipelines
or to debug errors. It labels each line with the hostname of the
machine it ran on to ease debugging in distributed cluster
environments.

bcbio-nextgen-commands.log Full command lines for all third
party software tools run.

This is a large whole genome analysis and meant to test both pipeline scaling
and validation across the entire genome. It can take multiple days to run
depending on available cores. It requires 300Gb for the input files and 1.3Tb
for the work directory. Smaller examples below exercise the pipeline with
less disk and computational requirements.

We also have a more extensive evaluation that includes 2 additional variant
callers, Platypus and samtools, and 3 different methods of calling variants:
single sample, pooled, and incremental joint calling. This uses the same input
data as above but a different input configuration file:

This example calls variants on NA12878 exomes from EdgeBio’s
clinical sequencing pipeline, and compares them against reference
materials from NIST’s Genome in a Bottle initiative. This supplies
a full regression pipeline to ensure consistency of calling between
releases and updates of third party software. The pipeline performs
alignment with bwa mem and variant calling with FreeBayes, GATK
UnifiedGenotyper and GATK HaplotypeCaller. Finally it integrates all 3
variant calling approaches into a combined ensemble callset.

This is a large full exome example with multiple variant callers, so
can take more than 24 hours on machines using multiple cores.

First get the input configuration file, fastq reads, reference materials and analysis regions:

This example calls variants using multiple approaches in a paired tumor/normal
cancer sample from the ICGC-TCGA DREAM challenge. It uses synthetic dataset 3 which has multiple
subclones, enabling detection of lower frequency variants. Since the dataset is
freely available and has a truth set, this allows us to do a full evaluation of
variant callers.

The configuration and data file has downloads for exome only and whole genome
analyses. It enables exome by default, but you can use the larger whole genome
evaluation by uncommenting the relevant parts of the configuration and retrieval
script.

This example simulates somatic cancer calling using a mixture of two Genome in a
Bottle samples, NA12878 as the “tumor” mixed with NA24385 as the background.
The Hartwig Medical Foundation
and Utrecht Medical Center generated this
“tumor/normal” pair by physical mixing of samples prior to sequencing. The GiaB
FTP directory has more details on the design and truth sets.
The sample has variants at 15% and 30%, providing the ability to look at lower
frequency mutations.

This example aligns and creates count files for use with downstream analyses
using a subset of the SEQC data from the FDA’s Sequencing Quality Control project.

Get the setup script and run it, this will download six samples from
the SEQC project, three from the HBRR panel and three from the UHRR
panel. This will require about 100GB of disk space for these input
files. It will also set up a configuration file for the run, using
the templating system:

This will run a full scale RNAseq experiment using Tophat2 as the
aligner and will take a long time to finish on a single machine. At
the end it will output counts, Cufflinks quantitation and a set of QC
results about each lane. If you have a cluster you can parallelize it
to speed it up considerably.

A nice looking standalone report of the bcbio-nextgen run can be generated using
bcbio.rnaseq. Check that repository for details.

Validate variant calling on human genome build 38, using two different builds
(with and without alternative alleles) and three different validation datasets
(Genome in a Bottle prepared with two methods and Illumina platinum genomes).
To run:

The test suite exercises the scripts driving the analysis, so are a
good starting point to ensure correct installation. Tests use the
pytest framework. The tests are available in the bcbio source code:

$ git clone https://github.com/chapmanb/bcbio-nextgen.git

There is a small wrapper script that finds the py.test and other dependencies
pre-installed with bcbio you can use to run tests:

Optionally, you can run pytest directly from the bcbio install to tweak more
options. It will be in /path/to/bcbio/anaconda/bin/py.test. Pass
-s to py.test to see the stdout log, and -v to make py.test output more
verbose. The tests are marked with labels which you can use to run a
specific subset of the tests using the -m argument:

$ py.test -m rnaseq

To run unit tests:

$ py.test tests/unit

To run integration pipeline tests:

$ py.test tests/integration

To run tests which use bcbio_vm:

$ py.test tests/bcbio_vm

To see the test coverage, add the --cov=bcbio argument to py.test.

By default the test suite will use your installed system configuration
for running tests, substituting the test genome information instead of
using full genomes. If you need a specific testing environment, copy
tests/data/automated/post_process-sample.yaml to
tests/data/automated/post_process.yaml to provide a test-only
configuration.