Daily Archives: August 2, 2012

In some Olympic sports, such as diving, the athlete receives scores based on several trials. In diving, each trial is a separate dive. A diver’s score is the sum of the different dive scores in different trials. What is the best way to maximize the chance of getting a medal?

Divers can select their dives and their order on the fly (clearly not true, but makes it more interesting).

A diver’s performance on each dive is independent of his/her performance in other dives.

From experience, divers know the distribution of points based on each of their dives.

Model 1: Divers know the “threshold” for medaling ahead of time (dubious!)

Let’s solve this problem using a Markov decision process, where the stage here is the dive (1-5 for women or 1-6 for men).

Let

Vt(S(t)) = value of being in state S(t), where S(t) = the total number of points.

and let

M = point threshold for medaling.

The rewards are Rt(S(t-1),P) = 1 if the diver moves from state S(t-1)<M to S(t)>=M (i.e., the diver moves into medal contention if the P points from the dive at time t-1 moves them above threshold M). All other rewards are zero. The diver wants to maximize the probability of medaling, which is equivalent to maximizing the total value (V1(0)) or the expected value of the 0-1 indicator variable indicating whether the diver medals or not.

Let the set of dives in each category be captured by D1, D2,…,D5 for women or D1, D2,…,D6 for men. For each d in Di, there are known probability distributions for the point totals P.

Now the Bellman equations are

Vt(S(t)) = max_{d in Dt} E{Rt(S(t),P) + V[t+1](S(t)+P)}

Here, the expectation is taken over the points distribution for dive d in Dt. The probability of medaling is found by V1(0), the value before dive 1 starting with 0 points. The boundary conditions are V[T+1](S(T+1)) = 0 for all values of S(T+1) since rewards are accumulated once.

The optimal policy indicates what dive should be chosen in each trial (MDP stage) based on the total number of points that have been accumulated thus far. If a diver is successful with tough dives early on, the diver can choose easier dives later on (and vice versa).

Model 2: Divers do not know the “threshold” for medaling ahead of time so they maximize the total number of points

The model dynamics here are identical to those of the model above except for the rewards. The random rewards Rt(S(t-1),P) = P(d,t) if the diver gets P points for their dive d. The rest is the same yielding Bellman equations of

Vt(S(t)) = max_{d in Dt} E{P(d,t) + V[t+1](S(t)+P(d))}

They look the same as above, but the rewards in boldface are different. The expected number of points is captured by V1(0) and the boundary conditions are V[T+1](S(T+1)) = 0 for all values of S(T+1) as before.

Here, the diver doesn’t really need to solve an MDP. He or she can simply select the dive that yields the most points (on average) in each category, since the choices will not depend on the number of points accumulated thus far. Let EP(d,t) denote the average points from dive d. Then, the policies depend on the stage t rather than the full state variable S(t), yielding Bellman equations of

V[t] = max_{d in Dt} (EP(d,t) + V[t+1]).

The expected number of points is captured by V(1) and the boundary condition is V[T+1] = 0. Note that we lose the expectation here. The optimal solution isn’t rocket science: it is to select the dive with the largest EP(d,t) for each t.

Why these models are different

On face value, it may not be obvious why these models could yield different solutions. They are both used to identify dives that yield many points. The second model maximizes the expected number of points whereas the first model maximizes the point total distribution above a threshold (which can be thought of as moving as much of the tail of the distribution past a fixed threshold M). The first model could lead to “riskier” dive strategies for a diver who is not a favorite to win: the diver has a chance of being on the podium but could also go down in flames. For the gold medal favorite, the first strategy might lead to a conservative strategy that weeds out all of the dives that have a chance of a disastrous result. The second model leads to more conservative dive selections that yield more points (on average) but that might yield the most points but would almost certainly not lead to a medal.

Lift the first assumption: divers must make their selections ahead of time

If we lift the first assumption, the choice of dives cannot depend on the point total thus far to identify the best set of dives regardless of what happens with the earlier dives.

The answer to the second model is still obvious. The optimal solution is never change dives on the fly, even when one has that choice.

The first model, however, can be examined by looking at the joint probability distribution in the points that would be expected.