Data Mining for Causal Inference

As an increasing amount of daily activity—ranging from what we purchase to who we talk—shifts to online platforms, it is only natural to ask how those platforms impact our behavior. Take, for instance, online recommendation systems: how much activity do recommendations actually cause over and above what would have happened in their absence? Without doing randomized experiments, which may be costly or infeasible, estimating the impact of such systems is non-trivial. In this talk, I will argue that careful data mining can help in answering relevant causal questions in a more general way than traditional observational approaches. In the first example, I will show how data mining can be used to augment a popular technique, instrumental variables, by searching for large and sudden shocks in time series data. Applying this method to system logs for Amazon’s “People who bought this also bought” recommendations, we are able to analyze over 4,000 unique products that experience such shocks. This leads to a more accurate estimate of the impact of the recommender system: at least 75% of recommendation click-throughs would likely occur in their absence, questioning popular industry estimates based on observed click-through rates. In the second example, I will present a general data-driven identification strategy for finding natural experiments in time series data, inspired from the shock-based approach above. This method too reveals a similar overestimate for the impact of recommendation systems.