Specifically, for a deep deterministic policy gradient, DDPG, to expedite the learning speed, it's recommended to use a Replay Buffer

What if the reward is only given at a terminal state? Or, most of the experiences rarely give immediate rewards but the reward at the terminal state is sort of "what really matters"?

Cause in that case, in that sample of a random minibatch, if it does not include the terminal state, then you would most likely have many 0 immediate rewards. Would DDPG algorithm with Replay Buffer still work for such situation?

1 Answer
1

Cause in that case, in that sample of a random minibatch, if it does not include the terminal state, then you would most likely have many 0 immediate rewards. Would DDPG algorithm with Replay Buffer still work for such situation?

Yes it will work just fine. The point of replay buffer is not to find non-zero rewards, it stores observed state transitions, and these are critical to resolving the credit assignment problem. Every time the algorithm process a transition with zero reward, it also processes the link between state, action and next state. This link between states is critical to picking correct actions when rewards are sparse.

If all the experience only has zero rewards, then of course there is nothing useful to learn yet (the agent may as well take random actions as far as it knows). Also, environments with very sparse rewards are harder to learn.

However, there is no problem in general if a minibatch contains only samples with zero reward. The agent can still learn just fine in that case, as the learning targets are not just based on reward on the single step, but also predictions of future reward.

Biased sampling towards non-zero rewards might help in some cases, but a more robust related approach is to bias towards action values where there was a large change recently to next state's action values - that is the premise of prioritised sweeping.

$\begingroup$But would it affect overall training time in practice? Cause intuitively speaking, I would assume it would...$\endgroup$
– user355843Feb 16 at 12:36

$\begingroup$@user355843: Yes it might, but focusing on just the last parts of a trajectory regardless of progress is probably too naive to use in general. Whilst prioritised sweeping takes the idea further and is more general.$\endgroup$
– Neil SlaterFeb 16 at 13:18

$\begingroup$How was your personal experience with using adaptive learning rate methods like Adam or RMSprop or AdaDelta with RL with experience replay?$\endgroup$
– user355843Feb 22 at 18:01

$\begingroup$@user355843: I've tended to use Adam with learning rate 0.001 and defaults on other params. Works fine for me. When comparing optimisers, bear in mind that the different speeds of convergence interact with the bootstrap bias, so it is not a fair comparison to just change the optimiser. As usual with hyper-parameters, it is not isolated from other choices, and you have to adjust other things$\endgroup$
– Neil SlaterFeb 22 at 19:13

$\begingroup$So at the end of the day, I have to just try and see? Seems like RL is very difficult in practice cause there are so many things you've got to optimize.$\endgroup$
– user355843Feb 23 at 5:03