When reading about deep learning, I found the word2vec manuscript awesome. Its a relatively simple concept to transform a word by its context into a vector representation, but I was amazed that the mathematical distance between these vectors actually turned out to keep actual meaning.

So, great. Now can we all agree its impressive that computers can learn how France->Paris is like Italy->Rome. but how useful is it if we give it a brief shot on medical genetic data?

I decided to make use of the NCBI OMIM database as a text corpus to build the word2vec model.

OMIM: Online Mendelian Inheritance In Man

OMIM is a comprehensive, autorative compendium of human genes and genetic phenotypes that is freely available. Authored and edited by Institute of Genetic Medicine, Johns Hopkins University School of Medicine

Willing to be productive as quick as possible, I decided to work with deeplearning4j as I am familiar with Java for the last 10 years. And I am pretty fond of spring-boot these days, so I could easily share the outcome of this experiment as a service in the future.

I first got up to speed with deeplearning4j by their tutorials on their home page, more specifically the one about word2vec.

Ok, so I downloaded the whole omim database, which is a 178MB txt file that I made available as a Java ClassPathResource which is fed into a SentenceIterator from deeplearning4j.

Next, just following the instructions from deeplearning4j, I decided to make use of the default TokenizerFactory, and we’re already fine to give it a first shot with a minimum configuration. (I’m running this during my daily train ride on my 3 year old macbook pro.)

brca gene has well known mutation driving breast and ovarian and cancer. As expected, but still very clever of the word2vec model.

telomere is associated with systolic and diastolic, I do not fully grasp these associations. Telomerase is clear though.

angiogenesis is associated with adhesion, migration, invasion, healing, tnf. Again very strong of the model.

Conclusion I
It seems like the word2vec model, far from optimally trained on my macbook, did learn to make quite a few good associations from the model.

Lets do some negative testing with nonsense words.

Term

word2vec similar terms

kenny

harano, moo-penn, hamel, stevenson, male

university

pp, press, ed.), (pub.)

the

/

why

/

So my first name is associated with some authors, and university is shared with press and other. As expected, the and why which occur at random, don’t return any associtions. Great, its pretty good.

Finally, the word2vec examples are known for their analogies. France is to Paris, what Italy is to X. Word2vec can fill in Rome here by crunching Wikipedia. So can we try to find analogous terms for genotype-phenotype associations?

Positive Terms

Negative Terms

Word2vec analogous Terms

+brca +breast

-alk

nonpolyposis, carcinoma, nonsmall, squamous, colorectal

Once again very impressive. The addition of the breast vector and negation of the alk vector, yields a vector nearby ‘colorectal’. Indeed, in a cancer setting, brca means to breast what alk means to colorectal.

I have been wondering what all the noise about deep learning is about.
Its still neural networks, right? I have had not so much experience with
NN because they’re supposed to be hard to get right due to paramater
tuning, which is a downer if you’re used to good alround performers like
random forests. Still I decided to set out on a series of blogposts
using h2o (R) and deeplearning4j (R) on biotech datasets.

We’ll be working with the BreastCancer dataset from the mlbench package.
From the package description:

The objective is to identify each of a number of benign or malignant
classes. Samples arrive periodically as Dr. Wolberg reports his
clinical cases. The database therefore reflects this chronological
grouping of the data. This grouping information appears immediately
below, having been removed from the data itself. Each variable except
for the first was converted into 11 primitive numerical attributes
with values ranging from 0 through 10. There are 16 missing attribute
values. See cited below for more details.

Test set error rate is at 2.31% without a lot of effort. There is not a
lot of room for improvement. (So maybe this is not the best dataset.)
One of the nice things about randomforests is that they’re easy to understand by looking at the variable importance plot.

varImpPlot(rf)

Demonstrates as expected that cell size and shape are most predictive
features for the breast cancer RF classifier. We can inspect per feature
decision surfaces, as plotted below where malignant weight increases
with higher value of cell size.

So how good does it get using h2o deeplearning without much finetuning?

Before this analysis, I had already setup the h2o R package. Instructions for running h2o are nicely summarized here. So I can
now simply fire up a local instance for testing with the following
command.

Now build a model with default parameters. With h2o, you have to
specify predictor and response variables by column index or by column
name. Here, we are using column names. (Have a look at the magrittr R package if you’re confused by the ‘%>%’ operator.)

So while cell shape was determining the RF a lot, its of minor
importance in the DL model. And while the mitoses did not contribute at
all to the RF, it has a lot of importance in the DL model. Actually all of the features seem to be used by the DL model, so is it making better use of all available information?

The goal of this blog is to bring more focus into those little projects one does as an extra. Learning a new package, a couple of random thoughts or a strong opinion about a particular story. We’ll not try to focus to much about technology, but put our minds into the data, week by week.