Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Why run AB tests? because we might have some idea that something is better

The promise of MAB algorithms is that we can do something more like this. Ideally we want to be able to take advantage of what we learn as we go.

So there’s this dilemma between whether we exploit what we think is the best based on what we’ve seen or to explore other options to find out more about them.

MAB’s introduce this concept of regret. It’s how often did you have to try the objectively worst option in order to figure out the objectively best.

Lets look at a few classic MAB algorithms

Epsilon greedy works by alternating between exploration and exploitation. The name comes from the parameter epsilon that determines how much exploitation to do vs exploring.

So one of the weaknesses of epsilon greedy is that it doesn’t take into account of the proportional differences between variations.

Softmax attempts to address this by exploring options in proportion to how good they appear to be.

Lets say we’ve got two options and one is twice as good as the other.

So we could do a straight proportionality but instead softmax does this trick with exponentials so you can have rewards of arbitrary sizes but get back values between 0 and 1. The exponential thing kind of squishes it into a known range. You can even have negative rewards.

Softmax also has a concept of this “temperature”. Bigger temperature means more “energy”, more random.. closer to 50/50 A/B Lower number closer to 0 will explore the best option more in proportion to how good it is. Temp of 0 will be 100% exploitation.

So one of the weaknesses of softmax is it doesnt take into account how much you know about the diff options

Idea is to keep track of how much you know about each option gives you a measure of how confident you are about different options

There’s a whole family of UCB algorithms, but this is one called UCB1

So we take the observed conversion rate and add the confidence bound as a kind of bonus

So the main gotcha of UCB1 is that the payoff has to be between 0 and 1.

There are a bunch of variations of UCB algorithms. Also, contextual bandit algorithms that can take into account information about visitors. Exp3 algorithms are useful

Thanks to Lars. Also, thanks to John Myles White. This talk is based on a presentation of his.