You may have already read many times that the job of a Data
Scientist is to skim through a huge amount of data searching for
correlations between some variables of interest. And also, that one of
his worst enemies (besides correlation doesn't imply causation) is
spurious correlation. But what really is correlation? Are there several
types of correlations? Some "good", some "bad"? What about their
estimation? This talk will be a very visual presentation around the
notion of correlation and dependence. I will first illustrate how the
standard linear correlation is estimated (Pearson coefficient), then
some more robust alternative: the Spearman coefficient. Building on the
geometric understanding of their nature, I will present a generalization
that can help Data Scientists to explore, interpret, and measure the
dependence (not necessarily linear or comonotonic) between the variables
of a given dataset. Financial time series (stocks, credit default
swaps, fx rates), and features from the UCI datasets are considered as
use cases.

Dataiku recently worked on an e-business vacation retailer
recommender system based on users' previously visited products. We
created a meta model on top of classical recommender system that
generated an increase of 7% in revenue during the A/B test. For this
type of business, the content of the product image is paramount. The
next step was obviously to add image information in the recommender. The
key take away is this: you don’t need a deep learning expert to solve
the tagging problem. Because labeled datasets and corresponding
pre-trained neural network are available on the Internet, you can use
“transfer learning” and map your problem to an existing one. The post
processing step consists in grouping labels to get features associated
to more global visual themes. For instance, "theme beach" = coast +
ocean + sandbar. We use them to recommend customer personalized products
or to address marketing issues such as : what kind of image should I
propose for this product?