According to the recent estimate from National Institutes of Health (NIH), 90 percent of cells in the human body are bacterial, fungal, or otherwise non-human. The human gut alone contains on average 40,000 bacterial species, 9 million unique bacterial genes, and 100 trillion microbial cells.

In addition, plant-microbe interactions in the rhizosphere are the determinants of plant health, productivity, and soil fertility. Mycorrhizal fungi and nitrogen-fixing bacteria are responsible for an average of between 5–20 percent, and up to 80 percent of all nitrogen, and up to 75 percent of phosphorus, that is acquired by plants annually.

Metagenomics

Studies of microbial communities have become quite popular in recent years. Metagenomics is a collection of techniques to study microorganisms by direct extraction and cloning of DNA from an assemblage of microorganisms. Using these methods, sequence profiles of microbial communities from various sources can be obtained and analyzed to uncover the community structure. In many situations, the nature of these microbial communities matters a lot, one notable example is that fecal microbial transplantation radically resets gut community structure and cures relapsing Clostridium difficile infection in up to 90 percent of cases (van Nood et al., 2013).

In this blog post, I’m going to show a good example of analyzing a microbial data set using JMP. This Brewery Digesters data set is from a paper by Werner et al., 2011. I will investigate the structure of bacterial community across nine full-scale bioreactor facilities treating brewery wastewater, which was also one of the study goals of the original paper.

Data set overview

This data set is publicly available. The data processing details were described in Werner et al., 2011. Briefly, they collected 95 biomass samples for one year from nine methanogenic granular-sludge bioreactor facilities. Barcoded 454 pyrosequencing of bacterial 16S rRNA gene amplicons from the biomass samples was performed. They used QIIME to carry out barcodes trimming, denoising, Operational Taxonomic Unit (OTU) clustering, taxonomy assignment (Greengenes), and OTU table construction. Finally, they picked up a total of 4962 OTUs for subsequent analyses.

Analysis steps

The analysis steps are as follows:1. Read Biom file OTU (Operational Taxonomic Units) tables are usually represented by Biom (Biological Observation Matrix) format, the observations, in this case, are OTUs and the matrix contains counts corresponding to the number of times each OTU is observed in each sample. With the Biom Importer add-in, JMP can read Biom files and convert them into OTU and sample tables for subsequent analysis. You can download the Biom Importer JMP add-in here in the JMP User Community.

2. Select features

The first thing we want to do is to reduce the data dimension by dropping redundant or irrelevant OTUs. The JMP “Predictor Screening” platform, which is a bootstrap forest partitioning method, was used to select 144 out of 4962 OTUs on the basis of their contribution to predicting bioreactor locations.

3. Check the taxonomy of 144 top OTUs

We are interested in what are the taxonomic groups to which these most predictive OTUs belong. It’s a very easy task with JMP. Just go to Graph > Graph Builder, and drag relevant variables to the graph panel and select the type of plot you want to make. I plotted the frequency of phylum of the 144 most predictive OUTs. It is not hard to notice from the plot that taxonomic divisions are dominated by Firmicutes, Proteobacteria, Bacteroidetes, Chloroflexi, Spirochaetes, as well as unknown species.

4. Identify taxonomic divisions by bioreactors locations

The next question we want to ask is what is the breakdown of microbial species in each location. As shown in the following graph created in Graph Builder, we can notice significant differences in the distribution of the taxonomic divisions across bioreactors.

5. Perform Principal Components Analysis

The next task of this project is to evaluate how well these 144 OTUs can do to separate bioreactor locations, in other words, we are trying to cluster the locations using 144 OTUs. I first tried Principal Components Analysis (PCA) for this task since it can extract information from multidimensional data. Under Analyze > Multivariate Methods > Principal Components, JMP provides a handy implementation of PCA.

The cluster plot of principal component (PC) 1 and 2 shows that most of the locations, except E2 and U3, can be very well-separated by their microbial communities. We also notice that several samples are misclassified, for example, a sample from E4 was classified as E1, and a sample from U2 was classified as U4. The biplot shows the differences in taxonomic structure by class order between clusters. Notably, certain taxonomic classes, such as Deltaproteobacteria (25 times), Clostridia (26 times), and Bacteroidia (16 times), are major contributors to the clustering.

Scatterplot:

Biplot:

6. Conduct Hierarchical Clustering

The second clustering method I tried was Hierarchical Clustering. Again, JMP has a very nice implementation of this method, found under the menu Analyze > Clustering > Hierarchical Cluster. I clustered the locations based on their microbial communities. Both two-way dendrogram (locations vs. species) and constellation plot show that Hierarchical Clustering successfully clusters the samples into nine locations. Again, as we observed in PCA, several samples are misclassified, which is consistent with the prediction accuracy (96.4%) in original paper.

Hierarchical Clustering result:

Conclusions

In this blog post, I showed an example of analyzing metagenomics data, specifically, OTU tables, using JMP. JMP is a tool with powerful visualization and statistical modeling functionalities. There is much more to say about metagenomics data analysis with JMP. For instance, under Analyze > Predictive Modeling, there are multiple sophisticated predictive models, such as Neural Network and Naive Bayes, that can be applied to your data.

If you are interested in learning JMP, I would suggest starting with reading Basic Analysis, which gives an excellent introduction of starting data analysis with JMP. (You can also try JMP for free.) For more complex predictive modeling and cross-validation routines that have been optimized for biological data, check out the capabilities in JMP Genomics as well.

Given the frequent use of data visualization, clustering, and statistical modeling, metagenomics analysis is a perfect task for JMP. The add-in, Biom Importer, which can directly read Biom files, can further facilitate the analysis pipeline. Please leave your thoughts and questions in the comment area.

Hi MJ, thank you so much for this post! it provided a good overview and I purchased JMP as a result of your guideline. Because I'm a SAS lover, I'd like to be able replicate in SAS what is here in JMP. I'm interested in your use of the “Predictor Screening” platform, which is a bootstrap forest partitioning method. Would you mind guiding me in which specific procedure/options to use to reproduce such step?

Hi @MBV, I'm glad that you found this blog helpful. You can find this Predictor Screening platform at Analyze==>Screening==>Predictor Screening. You first need to load your data into JMP, organize your outcome and features as columns, and select outcome as your Y Response and features as your X variables.

In turns of replicating it in SAS, I think it depends on what SAS product do you use. For example, in base SAS, the HPSPLIT procedure might give you something similar.

Thanks for your quick response MJ! I have been able to recreate the steps in JMP with no problem. I want to replicate predictor screening in SAS, and since I have base SAS/STAT I will try HPSPLIT following your advice and let you know how it goes. Take care