Bioconductor Newsletter

The Bioconductor newsletter is a quarterly review of core infrastructure
developments, community projects and future directions. We cover topics of
general interest as well as those with the greatest impact on the software.
This quarter the Bioconductor core team relocated from Seattle, Washington to
Buffalo, New York and welcomed two new team members. Two new forums for
workflows and analysis pipelines were introduced: the recently launched
Bioconductor F1000Research Channel and the new section for
R / Bioconductor Workflows in the BMC Source Code for Biology and Medicine
journal. New features and functions are summarized for the infrastructure
packages as well as a handful of contributed packages that have been
especially active over the past devel cycle.

Bioconductor relocates to Roswell Park Cancer Institute

In September the Bioconductor core team relocated from Seattle, Washington to
Buffalo, New York. The new home institution is
Roswell Park Cancer Institute
(RPCI) which is part of the Buffalo Niagara Medical Complex. The Complex covers
about 120 acres in downtown Buffalo and is a consortium of hospitals, health care facilities and
educational institutions. The campus is growing rapidly with current employee
numbers at 12000 estimated to reach 17000 by 2017. RPCI has close ties with
SUNY University at Buffalo.

The change in physical location also brought some staffing changes. Marc
Carlson, Sonail Arora and Paul Shannon left the project at the end of August.
All were instrumental in the design and implementation of AnnotationHub
(and annotations in general), tools for biological network analysis,
educational material and many other areas. A big thanks to them for their many
contributions.

In Buffalo we are pleased to welcome new team members Jim Java and Brian Long.
Jim formerly worked as a biostatistician for the Gynecologic Oncology
Group at RPCI where he analyzed clinical trial data. He also worked as a
software engineer for several companies on projects ranging from point-of-sale
software to embedded systems. He has a Ph.D. in computer science (focused on
natural language processing), and MA degrees in biostatistics and English
literature. His research interests include scientific and statistical
computing, natural language processing (especially authorship identification),
linguistics, climate science, gun violence, ethics, and literature.

Brian’s work in software development has focused on relational database & NoSQL
storage, UI/UX design, security and scale. Most recently, he’s been working
with a team to develop a Platform-as-a-Service (similar to Google App Engine).
The platform offering allows application developers to deploy components to
on-demand virtual infrastructure and supports a large number of programming
languages via Apache Thrift. Most of the work was not done as open source but
the code has been released on GitHub. Feel free
to check it out!

Reproducible Research

F1000Research channel launched

F1000Research is an open science publishing
platform with the following goals:

Fast publication:
All scientific research is published with a few days making new results
immediately visible. Review is conducted after publication.

Open peer review:
Open peer review removes the potential bias that can be present with anonymous
pre-publication review.

Publication of all findings:
All results are published including null/negative results, small findings,
case reports, data notes and observation articles.

Data provided:
All articles are accompanied by the data used to generate the results
enabling reproducible research.

The motivation for a Bioconductor F1000Research channel was a forum for
task-oriented workflows that cover a current, genome-scale analysis problem
from start to finish. In contrast to package vignettes, these workflows combine
resources from several different packages, demonstrating integrative analysis
and modeling techniques. The end goal is to make it easier for Bioconductor
users to navigate the software offerings by adapting these workflows to
quickly arrive at a solution for their specific problem.

This effort has been led by Wolfgang Huber, Vince Carey, Sean Davis, Kasper
Daniel Hansen and Martin Morgan.

Levi Waldron is an Assistant Professor of Biostatistics at the CUNY School of
Public Health in New York City. He serves on the Bioconductor technical
advisory board and is currently heads up the group developing new
infrastructure for the analysis of
multi-assay genomic experiments.
Levi recently became a section editor for the
BMC Source Code for Biology and Medicine Journal
and earlier this month posted an
announcement calling for
R / Bioconducor workflows submissions to the new section. The journal is
open access and aims to publish source code for distribution and use
to advance biological and medical research. Manuscripts are considered on all
aspects of workflow for information systems, decision support systems, client
user networks, database management, and data mining.

The workflows should address a problem of general interest and be easily
adaptable by other researchers. Implementation can be as a workflow on the
Bioconductor web site, an
Amazon or
docker image or other
cross-platform supported approach. More details are available in the
author instructions.

Infrastructure Updates

Build machines to cloud

Over the past weeks Dan has been busy transferring package builds, the
single package builder and the web site to the cloud. The move provides more
flexible login access as well as machine configuration. Everything should look
the same from the outside so user interaction with these resources will
not change.

This has been a huge effort and is almost complete. Thanks Dan!

HTS core package stack

Bioconductor encourages software reuse and aims for a flexible, integrated
set of infrastructure packages. One consequence of interrelationship is
that a change in a low-level package can affect packages downstream.

Throughout a devel cycle new classes are added and code is reorganized which may
cause the dependency hierarchy to change. We do our best to identify and fix
packages affected by these changes. Modified packages and their dependencies
are committed to svn/git and should propagate together through the
nightly builds and become available via biocLite() the following day.

Hervé recently added a graphic to S4Vectors/inst/doc/ that depicts the
High Throughput Sequencing (HTS) core package stack. Knowledge of this
hierarchy is useful when developing new S4 classes and methods. It also
benefits leading-edge developers working directly from svn (vs biocLite()). A package
installed from svn may have updated dependencies that have not yet propagated
through the build system and must be installed (in order) by hand.

The package stack file will be updated as the core packages change. This is the
current snapshot.

Reads data from a GFF file into a data.frame or DataFrame object. The
function auto-detects the GFF version but has a ‘version’ argument if
needed. All columns are loaded by default or individual columns can be
specified in the ‘columns’ argument. Additional flexibility is provided
by rtracklayer::readGFF().

(contributed by Hervé Pages)

coercion between GRanges object and character vector

A GRanges object can be coerced to a character vector and back. See
?GRanges in the GenomicRanges package for details.

The venerable ExpressionSet can be coerced to and from the more
modern SummarizedExperiment. Coercion using as() supports
mapping identifiers from common probe- or gene-based annotation
labels to genomic ranges;
makeSummarizedExperimentFromExpressionSet() allows
user-specified identifier conversions.

(contributed by Jim Hester)

Activity in contributed packages

As complement to the section on new features and functions in the
infrastructure packages we want to highlight significant changes made in
contributed packages. Not all packages keep current NEWS files so it can be
tricky to determine which packages have added new features. One way of gauging
active development is the number of svn/git commits over a period of time.

A few authors said the commits were due to maintenance and the package had
not changed much since the last release. Other packages did change
significantly and the authors have summarized the changes below. Comments have
been lightly edited for length.

Add pipeline for drug discovery and drug repositioning based on the cogena
workflow. Candidate drugs can be predicted based on the gene expression of
disease-related data, or other similar drugs can be identified based on the
gene expression of drug-related data. Moreover, the drug mode of action can
be disclosed by the associated pathway analysis.

systemPipeR provides infrastructure for building and running
automated analysis workflows for a wide range of next generation sequence
(NGS) applications.

NGS WORKFLOWS

added new end-to-end workflows for Ribo-Seq and polyRibo-Seq, ChIP-Seq and
VAR-Seq

added the data package ‘systemPipeRdata’ to generate systemPipeR workflow
environments with a single command (genWorkenvir) containing all parameter
files and sample data required to quickly test and run workflows

about 20 new functions have been added to the package. Some examples are:

Read pre-processor function with support for SE and PE reads

Parallelization option of detailed FASTQ quality reports

Read distribution plots across all features available in a
genome annotation (see ?featuretypeCounts)

Visualization of coverage trends along transcripts summarized
for any number of transcripts (see ?featureCoverage)

Differential expression/binding analysis includes now DESeq2 as
well as edgeR

adaption of R Markdown for main vignette

WORKFLOW FRAMEWORK

simplified design of complex analysis workflows

improvements to workflow automation and parallelization on single
machines and computer clusters

Developers

Why vectorize?

When looking for ways to optimize R code one of the most common suggestions
is to ‘vectorize’. A ‘vectorized’ function in R is one that has a
loop-like construct written in a compiled language with a light R wrapper.
There are a couple of qualities that give these functions their performance
advantage.

Data are passed to a ‘vectorized’ function as a vector instead of individual
elements. Vectors in R are ‘typed’ meaning all elements must be of the same
data type. Passing a whole vector to the complied code reduces the amount of
work R has to do to interpret data type. Second, the loop computation is done
in a complied language such as C, C++ or FORTRAN.

Vectorized functions call .C, .Call, .Primitive or .Internal in the source
code. One example is base::which()

An concrete example of how these functions can improve performance is
this
post on the support site
where Hervé helped the author of the
ChIPseeker package
identify a bottleneck in ChIPseeker:::getFirstHitIndex(). The
solution was to replace a call to sapply() (i.e., R-level
iteration) with two vectorized functions, duplicated() and
which(). This elegant one-liner reduced the algorithm from quadratic
in time to linear in time. Nice!

Hervé took a classic approach to improving the code:
simplifying the original support site example (thanks!) to one that
illustrated the problem in a timely fashion; stepping through the code
a line at a time and noticing that one function call took inordinately
long; understanding the small section of code that caused problems;
and identifying a vectorized alternative.

We may not all see such drastic improvements but take home is that
these functions are worth knowing about and implementing when
possible.

The ellipsis

In R, the ellipsis (…) is used to pass a variable number of arguments to
a function. A common question from developers writing their own S4
generics is when (and why) to include the ellipsis in the function signature.
Hervé answered a recent
post on
Bioc-devel
about this topic. Main ideas are summarized here but for the full details
you’ll want to read the post.

Defining generics and methods instead of ordinary functions for the getters and
setters of an S4 object is generally considered good practice. When introducing
a new generic, the recommendation is to keep the signature as “generic” as
possible so methods can add arguments that are specific to the objects they
deal with.

The purpose of signature="object" is to limit dispatch to object
only. Defining sensible default values (e.g., method=”auto” or
verbose=FALSE) is also important to encourage consistent,
user-friendly behavior across methods.

Going the direction of more restrictive, an example where the ellipsis
was intentionally omitted is the organism() generic recently added
to
BiocGenerics. This
is a straightforward getter and we want to enforce this and encourage
methods to stick to that very simple contract. If someone comes up
with a use-case where extra arguments are needed for their method then
the ellipsis can be added to the generic.

While ... is often appropriate in the signature of generics, methods
are often best written to include only the arguments actually used. In
this way, invalid arguments are not silently ignored:

The key when designing generics is to identify or anticipate the
greatest common factor across methods and formalize that at the level
of the generic. When in doubt, better to underestimate than
overestimate. For methods, use the most restrictive signature
possible.

Project Statistics

Website traffic

The following compares the number of sessions and new users from the third
quarter of 2015 (July 1 - September 25) with the third quarter of 2015. Sessions
are broken down by new and returning visitors. New visitors correspond to the
total new users.

Package downloads and new submissions

The number of unique IP downloads of software packages for July, August and
September of 2015 were 35349, 33600 and 29953 respectively. For the same time
period in 2014, numbers were 36362, 36118 and 47918. Numbers must be
compared by month (vs sum) because some IPs are the same between months.
See the web site for a full summary of download
stats.

As of September 25, a total of 54 software packages have been added in the
third quarter of 2015 bringing counts to 1078 in devel (Bioconductor 3.2)
and 1024 in release (Bioconductor 3.1).