I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

PowerPoint Slideshow about 'Reinforcement Learning to Play an Optimal Nash Equilibrium in ...' - lotus

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

A team game can have multiple Nash equilibria. Only some of them are optimal. This captures the important properties of a general category of coordination games. Study on team games gives us an easy start without loss of important generalities.

Model environment as a set of states S. A decision-maker (agent) drives the changes of states to maximize the sum of its discounted long-term payoffs.

A coordination Markov game:

Combination of MDP and coordination games: A set of self-interested agents choose joint action aA to determine the state transition so as to maximize their own profits. For example, Team Markov games.

Relation between Markov game and Repeated stage games:

A joint Q-function maps a state-joint action pair (s, a) to the tuple of the sum of discounted long-term rewards individual agents receive by taking joint action a at state s and then following a joint strategy .

Q(s, . ) can be viewed as a stage game in which agent i receivesa payoff Qi(s, a) (a component of the tuple of Q(s, a)) with a joint action a being taken by all agents at state s. We call such a game as state game.

A Subgame Perfect Nash equilibrium (SPNE) of a coordination Markov game is composed of the Nash equilibria of a sequence of coordination state games.

Without knowing game structure, an agent i is trying to find an optimal individual strategy i: S  Ai to maximize the sum of its discounted long-term payoffs.

Difficulties:

Two layers of learning (Learning of game structure and learning of strategy) are interdependent during the learning of a general Markov game: On one hand, strategy is determined over Q-function. On the other hand, Q-function is learnt with respect to the joint strategy agents take.

RL in team Markov games

Team Markov games simplify the learning problem: Off-policy learning of game structure, learning coordination over the individual state games.

In a team Markov game, the accumulation of individual agents’ optimal policies is an optimal Nash equilibrium for the game.

Each agent has a limited memory size to hold m recent plays being observed.

To choose actions, an agent i randomly draws k samples (without replacement) to build up an empirical model of others’ joint strategy.

For example, suppose that there exists an reduced joint action profile a-i(all but i’s individual actions) which appears in the samples for K(a-i) times, agent i treats the probability of the action as K(a-i)/k.

Agent i chooses the action which best responds to this distribution.

Previous work (Peyton Young) shows that AP converges to a strict NE in any weakly acyclic game.

Similar to AP except that an agent biases its action selection when it detects that it is playing an NE in the biased set.

Biased rules:

For an agent i if its k samples contain the same a-i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples showthat A keeps playing A0 and its most recent best response is B0, B will stick to this action.

Biased adaptive play guarantees the convergence to an optimal NE for any VG constructed over a team game with the biased set containing all the optimal NE.

Using a slowly decreasing bound (called -bound) to find all optimal NE. Specifically,

At a state s and time t, an joint action a is -optimal for the state game in if Qt(s,a)+tmaxa’Qt(s,a’).

A virtual game VGt is constructed over these -optimal joint actions.

If limtt=0 and t decreases slower than Q-function, VGt converges to VG.

Construction of -bound depends on the RL algorithm used to learn the game structure. Over a model-based reinforcement learning algorithm, we prove that the following bound meets the condition: Nb-0.5 for all 0<b<0.5, where N is the minimal number of samples made up to time t.

The definition of other states is inductive: The successor state h’of a state h is obtained by deleting the leftmost element and add in a new observed joint action at the leftmost side of the tuple.

Absorbing state: (a,a,…,a) is an individual absorbing state if aD or it is a strict NE. All individual absorbing states are clustered into a unique absorbing state.

Transition:

The probability ph,h’that a state h transits to h’ is positive if and only if the left most joint action a={a1, a2,…, an} in h’ is composed of individual action aiwhich best responds to at least k samples in h.

Since the distribution an agent takes to sample its memory is independent of time, the transition probability between any two states does not change with time. Therefore, the Markov chain is stationary.

Theorem 1 Let L(a) be the shortest length of a best-response path from joint action a to an NE in D. LG=maxaL(a). If mk(LG+2), BAP over WAGB converges to either a NE in D or a strict NE w.p.1.

Nonstationary Markov Chain Model:

With GLIE learning policy, at any moment, an agent has a probability to do experimenting (exploring the actions other than the estimated best-response). The exploration probability is diminishing with time. Therefore, we can model BAP with GLIE over WAGB as a nonstationary Markov chain, with a transition matrix Pt. Let P be the transition matrix of the stationary Markov chain for BAP over the same WAGB. Clearly, GLIE guarantees that PtP with t.

In stationary Markov chain model, we have only one absorbing state (composed of several individual absorbing states). Theorem 1 says that such a Markov chain is ergodic, with only one stationary distribution, given mk(LG+2). With nonstationary Markov chain theory, we can get the following Theorem:

Theorem 2With mk(LG+2), BAP with GLIE converges to either a NE in D or a strict NE w.p.1.

In a team game, LG is no more than n (the number of agents). The following figure illustrates this. In the figure, each box represents an individual action of an agent. represents an individual action contained in a NE. In the figure, we see that n-n’ agents can move the joint actions to an NE by switching their individual actions one after the other. This switching is best-response given others stick to their individual actions.

Lemma 4 The VG of any team game is a WAGB w.r.t. the set of optimal NEwith LVG  n.

OAL only guarantees convergence to an optimal NE in self-play. That is, all players are OAL agents. Can agents find optimal coordination when only some of them play OAL? Let’s consider the simplest case: two agent, one is JAL or IL player (Claus and Boutilier 98) and the other is OAL player.

A straightforward way to enforce the optimal coordination:

Two players, one of them is an “opinionated” player who leads the play.

Leader

Learner

If the other is either JAL and IL player, the convergence to optimal NE is guaranteed.

How about that the other is also a leader agent? More seriously, how to play if the leader does not know the type of the other player?

For an agent i if its k samples contain the same a-i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples showthat A keeps playing A0 and its most recent best response is B0, B will stick to this action.

New biased rules:

If an agent i has multiple best-response actions w.r.t. its k samples, it chooses the one included in an optimal NE in VG. If there exists several such choices, it chooses the one which has been played most recently.

Difference between the old and the new rules:

Old rules biases the action-selection when others’ joint strategy has been included in an optimal NE. Otherwise, it just randomizes its choices of best-response actions.

Find out all the NE being  dominated. For example, a strategy profile (a,b) is  dominated by (a’,b’) if (Q(a)<Q(a’)-) and (Q(b)Q(b’)+).

Construct a VG which contains all the NE not being  dominated, setting other values in VG to zero (without loss of generality, suppose that agents normalize their payoff to a value between zero and one).

With GLIE exploration, BAP over the VG.

Learning of game structure

Observe the others’ payoffs and update the sample means of agents’ expected payoffs in the game matrix.

Compute an -bound in the same way as OAL.

The learning over the coordination stage games we discussed is conjectured to converge to an NE not being Pareto dominated w.p.1

In general, it is difficult to eliminate sub-optimal NE without knowing others’ payoffs. Let’s consider the simplest case: Two learning agents have at least one common interest (a strategy profile maximizes both agents’ payoffs).

For this game, agents can learn to play an optimal NE with a modified version of OAL (with new biased rules).

Biased rules: 1) Each agent randomizes its action-selection whenever the payoff of its best-response actions is zero over the virtual game. 2) Each agent biases its action to recent best response if all its k samples contain the same individual actions of the other agent, more than m-k recorded joint actions have this property and the agent have multiple best responses to give it payoff 1 w.r.t. to its k samples. Otherwise, randomly choose best-response action.

In this type of coordination stage game, the learning process is conjectured to converge to an optimal NE. The result can be extended to Markov game.