We decided that the SLiME package, as it was packaged for the publication of the paper, should not be available anymore. This notebook replaces it, replicating the analysis executed on the paper with more up-to-date tools and (hopefully soon) expanding on its conclusion.
I hope this can be the starting point for others trying to follow the same approach and improve upon it.

The data came from two rounds of 16S sequencing of previously collected stool samples. Here we will use the OTU tables directly, which were created by using the RDP classifier and were subsequently normalized (details in the paper's methods).

Sequencing was performed at the Broad Institute. The first round of sequencing was dubbed CHIMP (Children Hospital IBD Pediatric), while the second round of sequencing -- performed following the request of an anonymous peer reviewer -- was termed 'blind validation'. Its purpose was to further validate the algorithm trained on the CHIMP dataset, as the reviewer did not think sufficient a "leave 20% out" approach on CHIMP was sufficient to demonstrate robust prediction. These were used as training and test set in the last figure of the paper respectively.

It is useful to join the two data sets here.

In [5]:

#get the CHIMP training dataX_chimp=pd.read_csv('data/chimp/chimp.Qsorted.rdpout.xtab.norm',delimiter="\t",index_col=0)y_chimp=pd.read_csv('data/chimp/sampledata.training.chimp.csv',index_col=0)#just make sure the labels are the sameX_chimp.sort_index(inplace=True)y_chimp.sort_index(inplace=True)assert(X_chimp.index==y_chimp.index).all()

In [6]:

## do the same for the blind validation test dataX_blind=pd.read_csv('data/chimp/blind.sorted.rdpout.xtab.norm',delimiter="\t",index_col=0)y_blind=pd.read_csv('data/chimp/sampledata.validation.blind.csv',index_col=0)X_blind.sort_index(inplace=True)y_blind.sort_index(inplace=True)assert(X_blind.index==y_blind.index).all()

In [7]:

#concatenate using pandasX=pd.concat([X_chimp,X_blind],keys=['chimp','blind'])X.head()

Out[7]:

cls_Actinobacteria

cls_Alphaproteobacteria

cls_Bacilli

cls_Bacteroidia

cls_Betaproteobacteria

cls_Clostridia

cls_Cyanobacteria

cls_Deltaproteobacteria

cls_Epsilonproteobacteria

cls_Erysipelotrichi

...

phylum_Euryarchaeota

phylum_Firmicutes

phylum_Fusobacteria

phylum_Lentisphaerae

phylum_NA

phylum_Proteobacteria

phylum_Spirochaetes

phylum_Synergistetes

phylum_TM7

phylum_Verrucomicrobia

chimp

003A

0.000000

0.000549

0.014827

0.002197

0.000275

0.230917

0

0.000000

0

0.000000

...

0

0.245744

0.023064

0

0.000000

0.728995

0

NaN

NaN

0

004A

0.000000

0.000000

0.002486

0.754195

0.000000

0.230889

0

0.000932

0

0.001865

...

0

0.238036

0.000000

0

0.000000

0.007147

0

NaN

NaN

0

005A

0.006521

0.000000

0.026084

0.000000

0.000000

0.908706

0

0.000000

0

0.054451

...

0

0.990545

0.000000

0

0.000326

0.002608

0

NaN

NaN

0

008A

0.000315

0.000000

0.000210

0.837024

0.035995

0.112499

0

0.000000

0

0.000840

...

0

0.113758

0.000000

0

0.000000

0.048903

0

NaN

NaN

0

009A

0.001291

0.000000

0.001550

0.823864

0.024277

0.118027

0

0.000000

0

0.000000

...

0

0.119576

0.000000

0

0.000000

0.055269

0

NaN

NaN

0

5 rows × 284 columns

In [8]:

X.fillna(value=0,inplace=True)#replace NAs with zeroes

In [9]:

y_dx=pd.concat([y_chimp.dx,y_blind.dx],keys=['chimp','blind'])y_dx#btw, what joy is to use pandas over R/dplyr for this. so intuitive and fast.

#convert the training and testing labels to numerical valuesle=LabelEncoder()le.fit(y_dx)y=le.transform(y_dx)# just for reference, the columns of the binarized label read respectively:le.inverse_transform([0,1,2])

We will go straight to using RandomForest and a 10-fold cross validation. Many other models were tried but RandomForest consistently prevented overfitting. First let's get an idea of how it looks like when you try to classify all the labels at the same time.

/usr/local/lib/python3.4/site-packages/sklearn/utils/__init__.py:93: DeprecationWarning: Function multilabel_ is deprecated; Attribute multilabel_ is deprecated and will be removed in 0.17. Use 'y_type_.startswith('multilabel')' instead
warnings.warn(msg, category=DeprecationWarning)

The probabilities of each class are now in a numpy array where each row corresponds to sample and each column to the label in question (CD, NM or UC). Let's take a pick at the first 10:

# Compute ROC curve and ROC area for each classfpr=dict()tpr=dict()roc_auc=dict()foriinrange(n_classes):fpr[i],tpr[i],_=roc_curve(y_test[:,i],y_score[:,i])roc_auc[i]=roc_auc_score(y_test[:,i],y_score[:,i],average="micro")

The original paper links to the available dataset. As outlined in the paper, the SLiME pipeline runs RDP on the fasta files and outputs a matrix of normalized abundance of each family, class, genus, etc.

I am going to load directly that table, which has had the ibd classification label appended to it.