Train a Neural Network to Play Black Jack with Q Learning

Q Learning is a standard algorithm that’s used in Reinforcement Learning. The idea of the function is a utility function, which when given a situation and an action, outputs the expected reward that will happen when you perform the action. Today, we will develop and extend this idea in order to train a Machine Learning model to approximate the function. The full code can be found on GitHub.

Say we would like to learn the optimal policy for a game. This means, at any state ( the collection of all possible states of the game), we want to know the optimal action out of a set of actions . The game is typically called the “environment”, whereas the “agent” is our learner. The agent must be able to accept states from the environment and select an action which is then input back into the environment. If we would like our agent to learn, the environment must also be able to provide a reward for each action. Any Reinforcement Learning algorithm follows the same basic steps:

Agent is given a state

Agent chooses such that .

Input into environment and receive reward and move to a new state .

Update agent via .

Repeat

One of the standard Reinforcement Learning algorithms is the Q Learner which aimed to find the function iteratively. This algorithm is not tractible for large state spaces, and due to this limitation, the idea to approximate with a Neural Network arose. This technique has been successful to learn many types of games, including how to play Atari games after defining a reward function (see this paper).

In the sequel, we will discuss the differences between the standard Q Learning algorithm and how to use a Neural Network for reinforcement learning. We will then build a basic Black Jack implementation on which to apply a Neural Network and the standard Q-learning algorithm to compare the outcomes.

The Basic Algorithm

The learning algorithm is basically identical whether you use the standard Q Learner or a neural network. We are going to outline an -greedy strategy for learning. Let represent our (approximate) function, which is a function from .

Given a state , find the optimal action .

With probability , perform , otherwise perform a chosen at random. Call the performed action .

Receive a reward . Update with the triplet of information .

Step 3 is where we update our learner either by an online update to a Neural Network or an iterative step in the Q Learning algorithm. The choice of allows us to explore the environment rather than only choosing the optimal policy that we have learned up to that moment. When one wants to see how well our learner plays, one should set to 0.

A Basic Q Learner

Our basic Q-Learner implements the typical Q-Learning algorithm as defined here. At the start, we initialize an empty dictionary where we will store the states and the current iterations’ reward value for that state.

1
2
3
4
5
6

# Q Learner will inherit a Player class from our Black Jack game# We will get back to that laterclassLearner(Player):def__init__(self):super().__init__()self._Q={}

Assume we are at state , our agent must decide what action to perform. We can do this via

1
2
3
4
5
6
7
8
9
10
11
12
13
14

defget_action(self,state):# Notice here epsilon is the same as 1-epsilon from our previous discussionifstateinself._Qandnp.random.uniform(0,1)<self._epsilon:action=max(self._Q[state],key=self._Q[state].get)else:action=np.random.choice([Constants.hit,Constants.stay])ifstatenotinself._Q:self._Q[state]={}self._Q[state][action]=0self._last_state=stateself._last_action=actionreturnaction

Now, we perform action receiving reward . Then, we may update the approximate function at time , denoted with a value update at via

Here, is called the learning rate and is the discounting factor applied to future rewards. The code is

Neural Network Algorithm

As is typical in Machine Learning problems, we want to train a model to “predict” the value of some function given data. This situation is no different– approximating the function is the same as training a model to predict the value of the reward function. So to approximate our function, we must train the network on the rewards at each observed state.

Redefining the Q function

Usually, the function is defined as a function , and the original Q Learner algorithm is written with this definition. When using a Neural Network, we can apply a trick to reduce training time: It is easier to model the function as where is the number of elements in , i.e. the output is a vector of rewards where each value corresponds to an action in . This way, we will be able to train a single Neural Network to approximate .

To determine an optimal policy, we input our state into the network, and choose the action corresponding to the entry in the reward vector with the largest reward.

After receiving a reward and the new state , we may update the Neural Network. This is done by training the model in an online fashion on a single new data point. The data point is the current state and the target is the expected reward vector with the entry for the action performed updated to be , which is defined as the reward we received plus a discounted future reward. The code for such an update is given by

Black Jack implementation

To test our learners, we must build a (simple) implementation of Black Jack where it will output a reward. The goal is to have the highest hand value, but your hand cannot sum to more than 21 or you lose automatically. This logic is stored in two functions.

The player plays against the dealer. The dealer will hit unless the dealer’s hand is greater than or equal to 15. Otherwise, the dealer will stay. This is seen in the get_action() function.

1
2
3
4
5
6
7
8
9
10
11

# The dealer is an initialization of the player classclassPlayer:def__init__(self):self._hand=[]self._original_showing_value=0defget_action(self,state=None):ifself.get_hand_value()<15:returnConstants.hitelse:returnConstants.stay

To save time, the main game loop plays both the player and the dealer simultaneously, and every iteraton of the while loop corresponds to a single “hit/stay” for each player. However, the player does not have access to any information about the dealer besides the dealer’s first showing card. In fact, we have chosen to train our learner on the state space consisting of tuples of the player’s hands’ value and the dealer’s showing value.

This simultaneous play slightly modifies the rules of the game. The loop breaks when both players stay or either busts. The only new outcome is that the player could still be hitting when the dealer busts, ending the game in a new fashion. This is not necessarily very common considering the dealer stays at 15 and above, but can happen. Since we’re only trying to demonstrate the concept and not build a model to beat a casino, we’ll let this slide.

state=self.get_starting_state(p,p2)#p is the player, p2 is the dealerwhileTrue:# Determine hit/stayp1_action=p.get_action(state)p2_action=p2.get_action(state)# Apply the action if hitifp1_action==Constants.hit:p.hit(d)ifp2_action==Constants.hit:p2.hit(d)# Check if anyone has bustedifself.determine_if_bust(p):winner=Constants.player2breakelifself.determine_if_bust(p2):winner=Constants.player1break# If both players stayed the round is overifp1_action==p2_actionandp1_action==Constants.stay:breakstate=self.get_state(p,p1_action,p2)p.update(state,0)# Update the learner with a reward of 0 (No change)

Once we are out of this loop, it’s time to determine the winner and apply the correct reward to update our learner.

At this point, we rinse and repeat to train either learner in an online fashion from its interactions with the game.

Outcomes

During training, the code outputs a running tally of the win percentage of the learner. Over time you will see this value improving as the learner becomes better. Once training has completed, the learner will continue to play to calculate a win percentage. How long each of these period are can be set in main().

1
2
3
4
5
6
7
8
9
10
11
12
13
14

defmain():# Set number of roundsnum_learning_rounds=20000number_of_test_rounds=1000# Choose your learner# DQNLearner is the "Deep Q Network" neural networklearner=DQNLearner()#or Learner() for Q Learnergame=Game(num_learning_rounds,learner)# Training and Testingforkinrange(0,num_learning_rounds+number_of_test_rounds):game.run()

At the end of a run, a CSV is output in the directory containing the reward for hit/stay and the optimal strategy for that state. After 20000 training rounds, the Neural Net would hit anytime the hand value was or when its hand value is 16 and the dealer is showing 8,9 or 10. Otherwise, it would stay. This strategy resulted in a win percentage of ~47. After the same number of training rounds, the Q Learner had a win percentage of ~33. By looking at the optimal policy CSV, it seems from the neurotic strategy that it would have benefitted from additional training rounds. In this case, the Neural Network version is able to learn a winning strategy far quicker than the Q Learning algorithm alone.