A new analysis of cyclone data and computer climate modeling, led by Lamont's Adam Sobel, Suzana Camargo, Allison Wing and Chia-Ying Lee, indicates that global warming is likely to intensify the destructive power of tropical storms.

A group of scientists studying a broad range of Arctic systems — from sea ice to permafrost to the Greenland ice sheet — gathered in D.C. to lay out just how extreme a year 2016 has been so far for the northern cap of the planet. “I see the situation as a train going downhill,” said Lamont's Marco Tedesco. “And the feedback mechanisms in the Arctic [are] the slope of your hill. And it gets harder and harder to stop it.”

The risk of rapid coastal subsidence to infrastructure and economies is global and is most acute in large river deltas, which are home to about 500 million people. An international community of researchers is calling attention to the need for better measurements and modeling and linking the science with its socioeconomic implications, Lamont's Michael Steckler and colleagues write.

I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now. Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis. The logic of that practice proceeds like this:

To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material. The extract might contain the DNA from 1 billion microbial cells present in the environment. Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample. After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer. Thus the final dataset contains a very small random sample of the original population of DNA molecules.

Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments. It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is. One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance. In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper. Thus more rare taxa might be represented. A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset. This attempts to approximate the case where both datasets undertake the same level of random sampling.

McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments. I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis. I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest. I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest. I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science. There’s no need however, to get everyone spun up without a good reason. To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration. I would strongly welcome any suggestions, observations, and criticisms to this post!

My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa. This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples. The question is how much of a difference does it really make? McMurdi and Holmes advocate using methods adopted from transcriptome analysis. These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution. An example of this can be seen in the plots below.

Taxon abundance, taken from a sample that I’m currently working with.

Transcript abundance by PFAM, taken from a sample (selected at random) from the MMETSP.

In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones. What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset. The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209. The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs. Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled. The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).

To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively. For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper). The results of subsampling are shown in these heatmaps, where each row is a new subsample. For brevity only columns summing to an abundance > 100 are shown.

In these figures the warmer colors indicate a higher number of observations for that OTU/PFAM. The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts. The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations). Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):

If you have normally distributed data don’t subsample!

For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling. There are differences between iterations however, what does this mean in the context of the larger study?

The sample was drawn from a multiyear time series from a coastal marine site. The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later. By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling. In this heatmap the row colors red and black indicate iterations belonging to the two samples:

Clustering of repeated subsamplings from two similar samples. Sample identity is given by the red or black color along the y-axis.

As this heatmap shows, for these two samples there is perfect fidelity. Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise. The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.

As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance. Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq. The problem with these methods – as I understand them – is that they require experimental replicates. According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…). Variance is calculated within each pool and this is used to to model differences between pools. This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies. If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.

There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates. For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling. Sequencing might be cheap now, but time and reagents are not. Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).

These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication. Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq. You can follow along with any standard abundance table where the rows are samples and the columns are variables.

library(DESeq)
## DESeq wants to oriented opposite how community abundance
## data is normally presented (e.g. to vegan)
data.dsq <- t(data)
## DESeq requires a set of conditions which are factors. Ideally
## this would be control and treatment groups, or experimental pools
## or some such, but we don't have that here. So the conditions are
## unique column names (which happen to be dates).
conditions <- factor(as.character(colnames(data.dsq)))
## As a result of 16S rRNA gene copy number normalization abundance
## data is floating point numbers, convert to integers.
data.dsq <- ceiling(data.dsq, 0)
## Now start working with DESeq.
data.ct <- newCountDataSet(as.data.frame(data.dsq), conditions = conditions)
data.size <- estimateSizeFactors(data.ct)
## This method and sharing mode is required for unreplicated samples.
data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only")
## And now we can execute a test of differential abundance for the
## two samples used in the above example.
test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06')
test <- na.omit(test)
## Plot the results.
plot(test$baseMeanA, test$baseMeanB,
#log = 'xy',
pch = 19,
ylim = c(0, max(test$baseMeanB)),
xlim = c(0, max(test$baseMeanA)),
xlab = '2014-01-01',
ylab = '2009-12-14')
abline(0, 1)
## If we had significant differences we could then plot them like
this:
points(test$baseMeanA[which(test$pval < 0.05)],
test$baseMeanB[which(test$pval < 0.05)],
pch = 19,
col = 'red')

As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant. I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power. This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…

*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here. In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work. As a member of said community, what’s wrong with subsampling and microbial community??

Lamont's Colin Stark visited the Glacier Bay landslide and said closer inspection revealed two big discoveries: the slide was still active days later, and the original landslide was so powerful it pushed rock and dirt up the sides of the valley almost 300 feet.

It's the second summer for the Biking While Breathing project which looks at the impact of air pollution on exercise in New York City. This year, researchers are considering going cheap. Cites Steve Chillrud's work.

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads. To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank. In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically. The following is a brief analysis of some genomic features using the paprica database and R. If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here. In this post I’ll be using the database associated with version 0.3.1. The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree
path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1)
path[is.na(path)] <- 0
## Read in the data associated with all completed genomes in Genbank
data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1)
## During database creation genomes with duplicate 16S rRNA genes were removed,
## so limit to those that were retained
data <- data[row.names(data) %in% row.names(path),]
## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree,
## however, "data" is not, so order it here
data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database. To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot. First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage. For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough. Second, there are some genomes that have very few pathways. There are a couple of possible reasons for this that can be explored with a little more effort. One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction. Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts). Obligate symbionts often have highly streamlined genomes and few complete pathways. We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome. Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good. For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions. The red point gives the location of Pelagibacter ubique HTCC1062. This marine bacterium is optimized for life under energy-limited conditions. Among its adaptations are a highly efficient and streamlined genome. In fact it has the smallest genome of any known free-living bacterium. All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs. There are a few larger genomes that have very few (or even no) pathways predicted. These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity. We can make another plot to show the distribution of these parameters with respect to phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting. These two parameters can be plotted to get a better view:

plot(data.select$phi ~ data.select$GC,
xlab = 'GC',
ylab = 'phi')

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting. Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes. This can be evaluated with linear models for genomes above and below 50 % GC.

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0). So what’s going on here? Low GC content is associated with particular microbial lineages but also with certain ecologies. The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is). Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

Seismic recordings registered a massive landslide in Alaska's Glacier Bay National Park, and scientists are studying how the region's geology and environmental change are elevating the risk of mountain landslides. Cites work by Colin Stark and Göran Ekström.

More than 100 million tons of rock slid down a mountainside in Southeast Alaska on Tuesday morning, sending debris miles across a glacier below and a cloud of dust into the air. Lamont's Colin Stark and colleagues analyzed the landslide through its seismic waves.

Slowdowns of the Atlantic meridional overturning circulation have long been suspected as a cause of the climate swings during the last ice age, but never definitively shown, until now. The new study “is the best demonstration that this indeed happened,” says Lamont's Jerry McManus.

The burning sensation in the southwestern United States was diagnosed by climate scientists more than a year ago, the Washington Post writes. The Post cites research by Lamont-Doherty scientist Park William into connections between the California drought and climate change.

California's overworked firefighters are being forced to take on another task — clearing dead and dying trees. John Upton talks with Lamont's Park Williams about the role of drought and rising temperatures.