While there has been a long-standing focus of empirical evaluation in building Web-based services like recommender and advertising systems, new challenges have been faced by traditional evaluation methodologies to truly reflect the systems’ actual online performance. This workshop aims to connect academic researchers and industrial practitioners who are working on, or interested in, online and offline evaluation of Web-based services. The goal is to provide a forum so that

The workshop will be a half-day event, consisting of invited talks and contributed talks/posters/demos, welcoming all topics related to evaluation for Web applications. A balance between academia and industry will be attempted.

Topics of the workshop include, but are not limited to, the following:

* Classic offline evaluation methodologies of systems, especially those based on standard metrics such as RMSE and NDCG.

* Online controlled experiments such as A/B testing, etc.

* Online interleaving experiments.

* Online adaptive sequential experiments.

* Offline evaluation of direct metrics such as revenue and click-through rate gains.

* Causal inference and counterfactual analysis based on log data.

* Practical applications and lessons related to evaluation on the Web.

Examples of open questions we would like to discuss include (and are not limited to):

* What are the best ways we can measure the performance of a machine learning algorithm offline? When are traditional machine learning criteria such as precision, recall, and AUC good enough for reflecting actual quality of the system?

* What are the more efficient ways of doing online experiments, other than the vanilla version of randomized controlled experiments? Can adaptive sequential experimentation techniques (such as Yelp’s MOE ) be helpful? Can variants such as interleaving be useful for applications other than information retrieval?

* What effective offline evaluation techniques do we have for measuring conversions/revenue gain, and what are their limitations?

* How can we measure the confidence in the estimates, either in the online or offline cases? Should we use t-tests or Bayesian statistics?