December Mothur Workshop, part 3

Here are my notes from day 3 of the Mothur Workshop that is taught by Pat Schloss (pdschloss at gmail.com) at a hotel that is conveniently located near the Detroit airport. Of all of the bioinformatic workshops I have attended, this group of students was my favorite so far. We had a lot of fun having dinner together at some of the nearby restaurants. Contact Pat if you are interested in attending the next one in February. It’s really helpful and you get a free Mothur sticker!

—Day 3 of the Mothur Workshop—

At the beginning of your study, think about what your question is, what your hypothesis is

On Wednesday we started with fasta files from sequencer

Yesterday we ended up with shared files with OTUs and counts and a consensus taxonomy file

We talked about OTUs and how to define them, using a 3% threshold in general gives you multiple OTUs per genus and not multiple genera per OTU

The SOP is exactly what the Schloss lab uses, they may play around with the reference database used

Recommends that we update the alignment databases when you start using Mothur at home

Common questions addressed by these analyses

Diversity (also called alpha diversity)

– combines evenness and richness

– functional

– variation

– number of things –> richness

– biodiversity

there are a number of papers in the microbiome literature that confuse richness with diversity

–> ignores the “names”

diversity can be the same in different habitats but contain very different organisms

Introduces a lot of distortions into the data, for example, map of the US

Microbial ecology data is prone to a lot of distortions

– Eigenvalue / Vector-based

data fitting procedure, try to fit lines through your data

got a big cloud of points, try to draw a line through the cloud and say that this line accounts for most

of the variation in the data

draw another line that accounts for the next most amount of variation in the data, don’t have to be

perpendicular but are orthogonal

linear combinations of data

Say PC1-3 ~30% of the data

That sucks but maybe we can see signal of how the samples are separating

Can look at contributions of different OTUs in ordination space

Principal components: don’t use this (uses R2, way of calculating similarity of samples, treats double zeros as similar, but this doesn’t work if both OTUs are absent, makes things look more similar then they are because of the zeros)

Principal coordinates works with OTUs that are present

NMDS

– non-metric dimensional scaling

– you provide the number of dimensions that you want

say 2 dimensions

– will array your points in 2D so the distance between the points is proportional to the distance in the matrix, has a metric called stress, which describes how well this ordination reflects the distances from your beta diversity metric, random process, moves the points around until where they best reflect the input matrix

stress: ideally less than 0.1, often it’s 0.2, should report it, not sure what else you can do

the position of the points can shift when you run it multiple times but the relationships stay the same

A con is that people outside the field always ask what the axes mean

Can take Bray-Curtis distances, generate ordination, generate a new distance matrix from the ordination, then can calculate the R2 between input and output, how well does the ordination reflect the input matrix, output is always higher for NMDS versus PCoA

Biplot

corr.axes

About visualizing the data, at best you are looking at a distorted 2D view of the data

get.communitytype

Idea of the Enterotype

blood types for your gut

Could be a soil type, leaf type

Community types across the body

A way to reduce the complexity of the data

two methods that are commonly used:

PAM Partitioning around the mediod

distance based

DMM: Dirichelet multinomial mixture models

uses count data

Found DMM is better than PAM in Schloss paper on body community types in Nature