Exploring Bayesian Bandits - an Online Tool

I have been reading a bit recently about so-called Bayesian Bandits, as referred to by Ted Dunning. This problem involves the challenge of picking a strategy of playing N slot-machines some number of times, where the probability of winning for each slot machine is unknown. This problem has a number of interesting applications in display advertising, news article recommendation, and click-through-rate prediction (Agrawal and Goyal, 2012). As noted by Dunning, any solution must effectively handle the explore/exploit trade-off challenge. The implementation in this case - using Thompson sampling - is straightforward, and I can see how it could smoothly allow for quick updating as more data/information becomes available.

You can watch as the algorithm "decides" on which bandit to pick next , and you can see how it reacts when you change the probabilities mid-stream (either by entering new values or dragging your mouse left/right in the input box).

Each Hit/Miss cell flashes and then fades from green/red as the bandit corresponding to that cell is tried, and either pays off or not. This serves to convey how the algorithm is either "exploring" or "exploiting".

In playing with this, a few other things become apparent.

As noted by Dunning, the algorithm can very quickly find the best bandit when the probabilities for each are disparate - it's impressive.

After many pulls, when the algorithm has honed in on the best bandit, it can be fairly slow to react to changes in the probabilities . A possible solution might be to have some mechanism that forces it to more easily "forget" behavior past a certain (potentially rolling) point in time (a la Davidson-Pilon's suggested learning rate approach). This would also apply to bandits it may have ruled out based on previous poor performance for many pulls, but which might have actually improved since it was last "given a chance". Graepel/Candela/Borchert/Herbrich (2010) at Microsoft, while using a (much) more sophisticated scheme involving Gaussians for click-through-rate prediction for sponsored search (apparently based to some degree on the TrueSkill Ranking system used to match players for XBox Live games), use a correction term that has the posterior distributions converge back to the prior with zero data and infinite time - i.e, poor performance is slowly forgotten and there will be better and better chances to "get back in the game" to try again.

Feel free to let me know if you get a chance to play with this as well, and/or there are issues with the implementation here.

Notes:

Ted Dunning pointed out an issue with the sampling for the "pull arm for a bandit" that was fixed July 8, 2013. He also noted some weirdness with the "random" sampling, as the posteriors seemed to hang around too high above the actual probability, and he let me know of Sean McCullough's javascript implementation of Matsumoto and Nishimura's Mersenne Twister to replace javascript's Math.random() - this change was implemented July 9, 2013.

On July 10, 2013, steps were taken to deal with the apparent underflow issues with the beta function calculation for the displayed charts (everything is now calculated in log space). This allows the simulation to run for a longer time. The current maximum number of pulls is now 150,000 instead of 1500, which, since things are updated 10 times a second, corresponds to a simulation of a little more than four hours (clock-time). Though I'm not sure larger times would be needed for this tool in this context, this pull limit will likely be removed shortly assuming no further issues are seen with the larger values.

The purpose of this page is to summarize in one place some of the interactive visualizations I have worked on. Most of these were built...

"When you start on your journey to Ithaca, then pray that the road is long, full of adventure, full of knowledge... Always keep Ithaca fixed in your mind. To arrive there is your ultimate goal. But do not hurry the voyage at all." (from "Ithaca", by C. P. Cavafy)