Good things come to agents who wait

Or: Understanding Reinforcement Learning

Intelligence is difficult to measure. If we watch a person, we’re often able to estimate their intelligence based on their behaviour. Someone who does what is “right” in a difficult situation at first appears intelligent. But what is the “right” thing? If we ask pre-school children if they would like to choose something with a small reward now or something with a big reward later, in most cases they will choose the small reward. Young people first need to learn that deferring gratification and waiting can be better, depending on how long you wait and what reward beckons.

Artificial intelligence, then, should also be able to understand such connections over longer periods of time.

A robot’s actions can have far-reaching consequences in the real world, and we can only tell after a certain amount of time has elapsed whether a past action was “right” or “wrong.” In AI, reinforcement learning (RL) is responsible for exactly these sequential problems, that means for handling tasks with a time dimension where the order of events plays a role.

As a result, in reinforcement learning we often talk about an agent who moves through time. In AI, this can be a car racing game and also — on a more abstract level — a component that controls the throttle for a helicopter engine to perform a looping. Of course, time can also pass step-by-step, as with Backgammon and Go.

An agent’s view of his surroundings is encapsulated in so-called conditions, which are a rough or not so rough representation of the relevant aspects of the world around it. Let’s imagine a robot in a labyrinth. The condition in which it finds itself at each moment in time could correspond to its position in the world. The agent can only leave its current condition via actions and move on to a new condition, or in this example take a step forward to arrive at another place. There are often several possible actions for each condition, requiring the agent to solve decision-making problems.

Particularly for complex problems, it is often not a trivial matter to define a sufficiently precise model of the world. In addition, a further step is necessary when formulating the decision-making problem: The next conditions should only depend on the current condition and the current action — but not on the previous condition or the condition before last (Markov properties). That means the path an agent has taken to get to its current position in a labyrinth is irrelevant as the decision which action to take to get to the exit remains the same.

During the actual learning process, the agent is not told which actions are right and which are wrong. It can only gather experience from the actions it selects — good experiences and bad ones. This reward is a quantitative evaluation of the last action. That means that the agent in the labyrinth would obtain, for example, the greatest reward when it escapes from the labyrinth with a specific action. It is often the case that actions which don’t lead directly to the target receive small punishments to give the agent an incentive to find the shortest possible way (i.e. to take the fewest possible steps on the way to the target and thus to receive as few penalties as possible).

An agent’s central task is thus to select the optimum action in any specific situation, or the one action that maximises the total of all future rewards (the total reward). There’s is only one problem. Without any experience, we don’t know which of the possible actions is the best.

The history of RL has seen various approaches how a past experience should impact future behaviour, including some methods which supplement each other. One of the most popular recent playing fields in RL is intuitively easy to measure: Q-learning. It focuses on so-called Q-functions, or utility functions, which for a certain condition and a selected action state the empirical value of the total of all future rewards (the anticipated total reward).

If the agent is in condition s and action a1 promises a Q value of 10, however action a2 only results in 5, then the agent will logically decide to take the first action.

For simple problems, this Q function can be shown in a table: Each row corresponds to a condition and the columns are the possible actions. The fields of the table then show the anticipated value of the future total reward. Since the future can be indefinitely long, calculating the average future reward is often difficult. A decreasing weighting of the total rewards terms is often used to define a future horizon beyond which the agent can’t see (discounting). To use the labyrinth example again: If the exit from the labyrinth is located to the left of the agent, we would look for the row in the table which corresponds to our current position and then shows a large value in the column “move left.”

But how do we complete this table in a learning process? Right at the start we initialise the table with random values or zeros. The agent gains experience by deciding on an action randomly in a specific condition. The hope is that it will try out all actions in as many conditions as possible using this strategy and do it often enough.

Learning is just a simple update: If the agent in condition sbefore performs a random action az and receives a reward r in the new condition safter, then he can simply modify the number in the Q table Q(sbefore, az). The new value is computed with the old value and the change that describes the difference from the expectation. If we take a closer look (I have simplified the notation slightly Qold is Q(sbefore, az) and Qnew is Q(safter, azz), or the anticipated future reward after the optimum action in the next step).

Qold = Qold + (r + Qnew – Qold)

The total of the direct reward r and the anticipated reward in the following condition is, at best, exactly the value in the Q table that we are looking at right now. This can be broken down into a total of rewards that the agent collects on his future path. Reward r in the equation would thus be the reward for the first step and Qnew combines all the remaining future rewards. If this total matches the Qold value in the table, the Q table is correct and no change needed. At the start of the learning process, though, there will be a discrepancy and the entry in Q needs to be changed.

Of course, this step works increasingly well the more precise the table is. Even if a table only includes zeros at the beginning, it will become more and more precise as time progresses, and at the end the agent will know exactly what to do. This method only works for very simple use cases, in particular for spaces with finite conditions and actions. More complicated problems will quickly exceed the capabilities of such a table, and it needs to be replaced with a technique suited for a very large or even infinite number of conditions.

That’s where neural networks come in. They learn the promised future rewards for individual actions in various conditions — exactly as was solved in the table. However, the network is able to compound similar conditions, meaning it may be of no importance if the next car in a race is one meter or 1.05 meters to the left — a safe distance must be kept in both cases. Now, it’s no longer important if the agent has encountered each situation often enough during the learning process, but whether he has at least encountered a similar situation before.

Using Backgammon as an example, Gerald Tesauro demonstrated showed at the start of the 1990s that it’s possible to use a neural network of approximated value functions to play at the same level as a human champion.

Almost a quarter century later, in 2016, the Google-owned company DeepMind made headlines with its victory over the 18-times GO world champion Lee Sedong. Here too, a function approximation using neural networks was a central feature — this time combining the policy and value function (more on this in another article). In both cases, the condition that was fed into the neural network was pre-prepared manually by humans to a certain degree.

DeepMind used its own RL system for a different situation to prove that it can play traditional Atari games such as Space Invaders or Breakout with super-human performance. The agent receives the pure image information that a human would see as the condition and uses it to learn an approximation of the Q function — once again with neural networks. Without human adjustment, the agent was able to learn how to master a whole range of games by just using the displayed score.

There are many other types of games for reinforcement learning that we can’t discuss here for space reasons, but suffice it to say:

Learning in RL always involves a lot of trial and error.

Which means solving complex problems can take a very long time. As with other trends in machine learning, modern RL shines with extraordinary performance thanks to increasing computing power. We still mostly use pre-training in a simulation, though, in order to have a system learn from the real world. And that approach has real-world consequences. Take the helicopter learning how to performing loops. It would not only have gathered too many negative reward along the way but also caused problems for the engineers if it crashed.

Humans have the distinct advantage of a built-in simulator in our imagination. If we are faced with the choice of receiving one euro today or hold off to get a bar of gold tomorrow, we don’t need a similar experience before we’re able to make the “right,” intelligent decision.