Saturday, June 06, 2009

I have been trying to more thoroughly understand concepts related to reinforcement learning; specifically temporal difference learning. I reread portions of the book by Barto and Sutton. One of the examples given in the book as a toy problem is that of the n-armed bandit. An interesting variant of this problem occurred to me – What if the rewards are not dispensed immediately, but rather after some fixed (or perhaps random) delay?

Imagine walking into a casino and being given one hour to play a large group of slot machines. However, unlike regular slot machines, these ones will produce a fixed amount of coins per pull. The trick is to figure out which slot machine gives you the most reward, and then exploit it. Sounds simple, right? There are two caveats: First, the machines will only give you the coins after some unknown delay. Second, when you receive the coins, you will not know the identity of the machine that produced the reward.

I wrote a fairly simple program in java to do this experiment by making use of eligibility traces, and found that it worked far better than I had expected was possible. Given a large number of possible actions (say n = 30), the system was able to properly assign credit to actions such that for all actions in A, if the real reward R(An) < R(Am) then the expectation of reward E(An) < E(Am) as well - so the actions would be ranked correctly in terms of value. This ranking could be learned perfectly, even when the delay between action and reward was 100 time-steps or more. Granted, a large number of observations are required (in the millions), but it is still impresses me that such a thing can be done at all.

In order to understand how and why this worked, I attempted to characterize the result mathematically, and was able to do so in a way that very closely matched the empirical results:

This rather simple equation appears to nearly perfectly describe the observed results in the limit of a great number of observations.

The main technique used is treating the “volume” of eligibility as the sum of an infinite geometric series, which is where the 1 / (1 – r) terms come from. Basically, we are assuming a randomized action policy, and therefore, an average over all possible rewards weighted by their relative likelihoods. These rewards will be (incorrectly) applied to our “causal” action, weighted by the eligibility. The one difference is the time-step at which the “actual” reward is emitted; this will be weighted by the eligibility at that delay, which is represented by the rd terms.