Bioconductor Newsletter

Welcome to the January 2015 newsletter. This is a quarterly look behind the
scenes at recent developments and exploratory work going on in the core group
and Bioconductor community. This issue covers Docker containers, work on
coordinate mapping, changes to the algorithm underlying overlap operations and
an overview of csaw, the most recent package contribution from the Smyth
group. We want this newsletter to be valuable to you so please share comments
and feedback.

Infrastructure Developments

Docker

Anyone who has attended the annual BioC conference or participated in a
class at the Hutchinson Center is familiar with the Bioconductor Amazon
Machine Images (AMI). Dan configures these images with the current
version of R and all necessary package dependencies and non-R software.
Providing a pre-configured environment to participants has eliminated
many “day of” problems such as internet overload due to concurrent downloads,
installation of the wrong package version and the inability to access common
sample data used in the course.

Recently Dan has been looking into the Docker software as another approach to
providing pre-configured environments. The Docker containers operate using
process level isolation and are therefore more lightweight than a virtual machine.
Additionally the containers have access to the full physical machine.

These containers are isolated, reproducible environments useful for development
or production. In contrast to the AMI, a Docker container could be used on a
desktop or laptop instead of the cloud or other remote hardware. We envision
them being useful for

reproducibility: providing identical, reproducible environments

convenience: deploying on any (Linux, Mac, or Windows) machine, or in
the cloud via Amazon’s Elastic Container Service (ECS)

time saver; instead of waiting for packages to compile on Linux, you can
download a container in which packages are already installed

This work is exploratory and not yet supported. There will be an announcement
on the mailing list when containers are ready for use. Those interested
in following the development can visit the
bioc_docker GitHub repository.

NCList

One of Hervé’s recent projects was replacing the interval tree-based
algorithm used in finding and counting overlaps. The change was motivated by
decreased performance observed when comparing many different chromosomes or many
overlapping ranges. He decided on an approach based on the Nested Containment
List algorithm by Alexander V. Alekseyenko and Christopher J. Lee.

In BioC 3.1 (current devel) overlaps operations on GRanges and/or GRangesList
objects are approximately 3x to 10x faster than in BioC 3.0 for a medium size
data set (e.g. 25 million reads). Memory usage is also reduced by ~ 25% or more.
Numbers will vary depending on the size of the data; the larger the data the
greater the improvement.

The user-visible change is the ‘algorithm’ argument added to most overlap-based
operations. This allows a choice between the new (algorithm=”nclist”) and the
old (algorithm=”intervaltree”).

At the time of this writing, there are 3 situations where “nclist” and
“intervaltree” produce different output:

With ‘nclist’, zero-width ranges are interpreted as insertion points and are
considered to overlap with ranges that contain them.

When using ‘select=”arbitrary”’, the ‘nclist’ algorithm will generally
not select the same hits as the old algorithm.

When using ‘intervaltree’, the ‘maxgap’ argument has a special
meaning if ‘type’ is “start”, “end”, or “within”. This is not yet
the case with the new ‘nclist’.

For a complete description of changes and future activity please see this
post
on the bioc-devel mailing list.

Coordinate Mapping

Translating (mapping) coordinates between genome, transcript and protein space
is a common task in bioinformatics. Over the past quarter a group of us have
been working to expand and harmonize mapping capabilities in the infrastructure.

Functions mapping between genome and the transcriptome will be added to
GenomicFeatures and methods for mapping via a CIGAR alignment will be added to
GenomicAlignments. The Pbase package has a mapping
vignette
that demonstrates the steps involved when mapping from protein to genomic
position. Once the API for the transcriptome and alignment mapping methods is
stable, similar functions can be added to Pbase to round out the suite of
mappers.

Others involved in the project are Michael Lawrence, Hervé Pagès,
Laurent Gatto, Robert Castelo and Martin Morgan. Discussions and progress can be
followed at the
biocCoordinateMapping
Google Group.

A related mapping task is migrating data from one assembly to another
either via alignment tool or by converting assembly coordinates. The
implementation of the UCSC LiftOver tool in rtracklayer
(thanks Michael) and the availability of the UCSC chain files in
AnnotationHub (thanks Sonali) make this a straightforward operation.

Another example of using liftOver() is in the
Changing genomic coordinate systems workflow. SNPS from the NHGRI GWAS catalogue are mapped from hg38
to hg19 resulting in a few lost loci.

Overview of the csaw package

Gordon Smyth is a professor at Walter and Eliza Hall Institute of Medical
Research in Victoria and has been active member in the Bioconductor project
since inception. Many know Gordon from his detailed responses on the
support site where he provides
statistical guidance and thoughtful discussion to the novice and advanced
user alike. The Smyth group has authored a number of
Bioconductor packages including cornerstone contributions such as limma and
edgeR. limma is consistently in the top 10 and edgeR in the
top 25 of the
package download stats.

A trademark of a Smyth group package is a well-written vignette with detailed
scientific and statistical background. These documents are an excellent starting
point for anyone new to microarray or RNA-seq analysis. The most recent
contribution from the group is the csaw package (ChIP-seq analysis with
windows) by Aaron Lun. I asked Aaron and Gordon if they would answer a few
questions about the package.

Aaron Lun and Gordon Smyth on the csaw package:

Q: What is differential binding (DB) in ChIP-seq and how does this
differ from differential expression (DE) in RNA-seq?

As most readers will know, ChIP-seq sequences genomic DNA that is bound to a
target protein whereas RNA-seq sequences RNA transcripts. From a scientific
point of view, RNA-seq measures gene expression whereas ChIP-seq is used to
examine the regulation of gene expression. ChIP-seq is often used for example
to find the binding sites of a transcription factor (TF) or to examine the
positioning of an epigenetic histone mark across the genome. Many people
analyze ChIP-seq data by calling peaks in each individual ChIP-seq library.
These peaks are then taken to identify where the TF binds or where epigenetic
marking is active. In the csaw package, however, we envisage ChIP-seq
experiments with multiple experimental conditions and with multiple biological
replicates within each condition, in other words with a structure much like a
typical gene expression experiment. We focus on testing for DB in a quantitative
way between the experimental conditions rather than on making present/absent
calls. ChIP-seq and RNA-seq experiments both produce short sequence reads that
can be aligned to a reference genome. In principle, the two types of experiments
can be analyzed in broadly similar ways. In each case, we can choose a set of
genomic locations of interest, count the number of reads mapping to those
regions, and then test for differential coverage between the experimental
conditions relative to biological variability. The key difference between
ChIP-seq and RNA-seq is how the genomic locations are chosen and combined.

Q: What scientific question does a DB ChIP-seq analysis help answer?

A conventional ChIP-seq analysis produces a list of peaks in each library, with
the implication that the protein is bound to DNA within the peak regions and not
bound elsewhere. We take an alternative view that binding coverage is
quantitative, not just present or absent. DB analyses identify a list of
genomic regions where the strength of binding changes between biological
conditions. We believe that the regions where the binding is subject to change
are the most likely to be biologically important, even if they are not
necessarily the regions with highest coverage. DB regions are necessarily linked
with the phenotype or treatment of interest being studied in the experiment. We
can therefore start investigating mechanisms through which DB can lead to
observed differences in biology. One of our favorite analyses is to relate DB to
differential expression for the same set of genes. These insights are harder to
obtain with the conventional analyses, as the identified regions are only
associated with each condition in isolation.

Q: What was the motivation for creating the package? Has your group recently
started doing DB ChIP-seq or have you been doing it for some time?

Our group has been doing DB analyses for a while. However, these analyses have
mostly been gene-orientated. We would count reads into pre-defined intervals
like the promoters or gene bodies and then test for whether these intervals were
DB using the edgeR package. This type of analysis can be done very similarly for
ChIP-seq as for RNA-seq. The csaw package is our first attempt at performing de
novo DB analyses where the regions of interest are not known in advance. We
wanted to be able to identify novel DNA elements like new enhancers or promoters
and we wanted to avoid potential headaches regarding the regions of interest,
for example misspecifying the promoter region of a gene. We were motivated to
take a new approach to de novo DB analysis by our observation that some commonly
used approaches do not control the error rates correctly. A simple method is to
call peaks in each condition separately, then simply compare the regions to
identify unique peaks in each condition. This ad hoc approach does not provide
any statistical error rate control and tends to overstate the differences
between the conditions. A more sophisticated version of this approach is to
conduct statistical tests between conditions for each of the regions called as
peaks in either condition. Again, this over-estimates the differences between
the conditions. Our approach gives similar flexibility to peak calling but with
rigorous error rate control. The idea is to slide windows across the genome,
count reads into those windows, and then use those counts to test for
significant differences between conditions. Adjacent regions are then merged and
the p-values combined in a statistically rigorous way. This provides the same
level of statistical rigor as for our previous gene-orientated analysis but
without the need to specify regions of interest beforehand. Using small windows
also provides excellent spatial resolution. There are a number of other Bioc
packages that can do DB analyses, diffBind and DBChIP for example, but these
require a set of peaks identified with external software like MACS. The
motivation for the csaw package is to avoid having to specific the genomic
regions externally . We wanted to generate and discover the DB completely de
novo as an integral part of the analysis. csaw is the first Bioc package to take
a windowing approach.

Q: How are the statistical methods from edgeR leveraged or extended in
csaw? Please describe from a statistical perspective, why methods used in
RNA-seq are appropriate for DB ChIP-seq?

Once a set of genomic regions has been chosen, a DB analysis is very similar to
an RNA-seq experiment. So it was logical to leverage the statistical methods
provided by edgeR. edgeR accounts for biological variability between replicates,
which is critical for DB analyses where the counts for each region are typically
over-dispersed. Failure to account for such variability will lead to spurious DB
calls. The generalized linear model functionality of edgeR provides a rich
framework with which to analyze complex experiments with multiple experimental
factors or covariate and batch effects. We chose edgeR over limma and voom
because the ChIP-seq counts for each window can be quite small. The reads can be
sparsely distributed across the whole genome. limma and voom are very effective
for moderate to large counts, but edgeR is able to more accurately model the
distribution of very small integer counts. csaw uses the quasi-likelihood
functionality of edgeR because of its rigorous error rate control and because it
gives access to the adaptive empirical Bayes functionality of limma within the
edgeR negative binomial framework. We made many improvements to edgeR as part
of the csaw project. Our sliding window approach to ChIP-seq generates a much
larger number of regions than is typical for an RNA-seq DE analysis, so it was
important to be as computationally efficient as possible. One of us (AL)
converted many of the edgeR functions into C++ for speed and memory efficiency.
There are a number of statistical extensions specific to csaw. These include an
implementation of Simes’ method for combining the p-values of adjacent windows
within a region; a non-linear normalization procedure adapted to low counts; and
a method to calculate the average abundance of a region scaled to its width.

Project Statistics

Publications

A number of core and community members have been working on a “perspective”
article that provides a project overview for potential users and developers.
Focus is on reproducibility, interoperability and data access. The collective
work, “Orchestrating high-throughput genomic analysis with Bioconductor”, is
scheduled to appear in Nature Methods in early 2015.

Other recent project-level (vs package-level) publications are “Scalable
Genomics with R and Bioconductor” by Michael Lawrence and Martin Morgan
which reviews strategies for processing, summarizing and visualizing large
genomic data. “Programming tools: Adventures with R” by Sylvia Tippman
highlights R / Bioconductor as the analysis tool of choice in genomics,
oceanography and ecology applications. Links are available on
the publications page.

It can be challenging to get an accurate count of articles that cite
the Bioconductor project or the individual software packages. The
Publications page of the web site lists results of a full text
search for the single term bioconductor for each of
PubMed,
PubMedCentraland
Google Scholar.
The number of hits range from 591 (PubMed) to 27,000 (Google Scholar).

Taking a package-level approach, a PubMed query for titles or IDs found in
CITATION files returned 22838 references. For packages with no CITATION file
a PubMed search of the package title returned 1281 references.

Website traffic

The following tables compare traffic from the fourth quarter of 2014 with
the fourth quarter of 2013 (October 1 - December 28).

In the fourth quarter we saw an increase in total sessions, new sessions and
the number of total users.

Website traffic Q4 2014 vs Q4 2013

Sessions

23.28% (311,731 vs 252,873)

% New Sessions

0.62% (36.01% vs 35.79%)

Users

24.32% (133,839 vs 107,655)

The greatest increase (percent change) in total sessions was seen in China
followed by Spain then the United States and Italy.

Total Sessions by Location Q4 2014 vs Q4 2013

United States

24.95% (101011 vs 80840)

China

28.43% (27593 vs 21485)

United Kingdom

11.19% (22627 vs 20350)

Germany

20.75% (21138 vs 17506)

France

20.47% (10446 vs 8671)

Canada

21.96% (9808 vs 8042)

Japan

19.21% (9406 vs 7890)

Spain

27.30% (8444 vs 6633)

India

4.15% (8048 vs 7727)

Italy

24.68% (7494 vs 6010)

Statistics were generated with Google Analytics.

Package downloads and new submissions

There was a 9% increase in the number of (distinct IP) downloads of software
packages in the fourth quarter of 2014 (106212 total) as compared to the fourth
quarter of 2013 (97462 total). See the website for a full summary of package
download stats.

A total of 63 software packages were added in the fourth quarter of 2014
bringing the count to 954 in devel (Bioconductor 3.1) and 934 in release
(Bioconductor 3.0).

Resources and Upcoming Events

Videos and Webinars

On average, 20-30 software packages are submitted to Bioconductor each
quarter resulting in 100+ new packages each year. New package guidelines are
posted on the
website but
developers often have additional questions or special case situations. We
thought a more interactive exchange would help avoid common errors and
encourage questions before a package was submitted for review.

In mid December Marc and Sonail hosted a Google Hangout to share some key
development tips and solicit questions from a wider audience. Topics covered
were package organization, documentation and code reuse. Questions were posted
via a YouTube chat window and answered after Marc finished the slide
presentation. If you are considering submitting a package you may want to
check out the Webinar posted on the Bioconductorvideo page.

Conferences and Courses

The events page is updated regularly
with new courses and conference announcements. In January 2015, EMBL is hosting
both the European Developer’s conference and an Advanced R Programming course.