Reinforcement learning – Q Learning Java

Ever wondered how a robot could learn something by itself? The short answer is: reinforcement learning. This is the third major machine learning algorithms class, next to supervised learning and unsupervised learning. The basic idea behind reinforcement learning is that the software agent learns which action to take, based on a reward and penalty mechanism. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. Simply put: carrots and sticks 🙂

A basic algorithm which can demonstrate easily the reinforcement learning concepts is Q Learning. In the following sections we will explore the algorithm and its Java implementation. The code for the article is written using Java 8 and can be found here: https://github.com/technobium/q-learning-java

Q Learning algorithm

Let’s imagine a light seeking robot that must navigate from room to room in order to stop in the brightest room. See below the map of the rooms. We can consider it a maze through which our light-seeking robot will need to navigate. The rooms are numbered from 0 to 8 to be easily identified further in the Java implementation of the algorithm.

Each room can be considered a state (S), moving from one room to another is considered an action (A). For simplification the robot can move one room at a time, left, right up or down. The algorithm will need a reward matrix R and will output a quantity matrix Q.

The matrix R contains for each state a row with the following encodings:

-1 for the impossible transitions. For example from 0 to 0 we have -1. From 0 to 4 we cannot navigate directly, so we have here also -1. The same is for the navigation from 0 to 8.

0 for possible transitions. For example we can go from 0 to 1, so we will have 0 in the matrix row 0 column 1. Another example, we can navigate directly from 4 to 7, row 4 column 7 is 0.

100 is the reward for the final state (the brightest room). We can navigate from 5 to 8 so the reward matrix has in row 5 column 8 the value 100. The same situation for row 7 column 8.

-10 is the penalty (suppose we had a room with lower light intensity compared to the other rooms). As you can see, we can reach room 3 from room 0, 4 and 6. For that reason, column 0 row 3 has value -10 and the same for row 4 column 3 and row 6 column 3.

1

2

3

4

5

6

7

8

9

10

States:012345678

Possible states from0:[-10-1-10-1-1-1-1-1]

Possible states from1:[0-10-10-1-1-1-1]

Possible states from2:[-10-1-1-10-1-1-1]

Possible states from3:[0-1-1-10-10-1-1]

Possible states from4:[-10-1-10-10-10-1]

Possible states from5:[-1-10-10-1-1-1100]

Possible states from6:[-1-1-1-10-1-1-10-1]

Possible states from7:[-1-1-1-10-10-1100]

Possible states from8:[-1-1-1-1-1-1-1-1-1]

Now the Q matrix is filled in by the following algorithm:

For each training cycle:

Select a random initial state.

Do While the final state has not been reached.

Select one among all possible actions for the current state.

Using this possible action, consider going to the next state.

Get maximum Q value for this next state based on all possible actions.

Alpha is the learning rate. The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. In our implementation we will use a constant learning rate of 0.1.

Gamma is the discount factor and determines the importance of future rewards. A factor of 0 will make the agent “myopic” (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. In our implementation we will use a discount factor of 0.9.

After building the Q matrix, we can use it as follows:

start from a given initial state

from that state select the action with the highest Q value

set the current state to the next state (reached by following the action with the highest Q)

repeat until the final state is reached

In the Java algorithm you will see that we print the optimal path starting from each state.

Java implementation of Q learning

The init() function loads the maze map from a .txt file and builds the R matrix, in concordance with the rules described in the previous section. The convention is that the final state is marked with the character ‘F’, the possible states are marked with the character ‘0’ (zero) and the states penalty states are marked with the character ‘X’ . Here is how the maze file should look like

1

2

3

000

X00

00F

In the constants sections you can configure the maze dimensions, the alpha and gamma parameters and also the reward and penalty values:

1

2

3

4

5

6

7

8

9

privatefinaldoublealpha=0.1;// Learning rate

privatefinaldoublegamma=0.9;// Eagerness - 0 looks in the near future, 1 looks in the distant future

privatefinalintmazeWidth=3;

privatefinalintmazeHeight=3;

privatefinalintstatesCount=mazeHeight*mazeWidth;

privatefinalintreward=100;

privatefinalintpenalty=-10;

The main function here is the one that builds the Q matrix. It follows the steps described in the previous section.

After the Q matrix is calculated we can use it and print the optimal route from each initial state:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

voidprintPolicy(){

System.out.println("\nPrint policy");

for(inti=0;i<statesCount;i++){

System.out.println("From state "+i+" goto state "+getPolicyFromState(i));

}

}

intgetPolicyFromState(intstate){

int[]actionsFromState=possibleActionsFromState(state);

doublemaxValue=Double.MIN_VALUE;

intpolicyGotoState=state;

// Pick to move to the state that has the maximum Q value

for(intnextState:actionsFromState){

doublevalue=Q[state][nextState];

if(value>maxValue){

maxValue=value;

policyGotoState=nextState;

}

}

returnpolicyGotoState;

}

For this particular example here are the results you should get:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Qmatrix

From state0:0.0072.900.0062.900.000.000.000.000.00

From state1:65.610.0081.000.0081.000.000.000.000.00

From state2:0.0072.900.000.000.0090.000.000.000.00

From state3:65.610.000.000.0081.000.0081.000.000.00

From state4:0.0072.900.0062.900.0090.000.0090.000.00

From state5:0.000.0081.000.0081.000.000.000.00100.00

From state6:0.000.000.0062.900.000.000.0090.000.00

From state7:0.000.000.000.0081.000.0081.000.00100.00

From state8:0.000.000.000.000.000.000.000.000.00

Print policy

From state0gotostate1

From state1gotostate2

From state2gotostate5

From state3gotostate4

From state4gotostate5

From state5gotostate8

From state6gotostate7

From state7gotostate8

From state8gotostate8

As you can see, the software agent has learnt how to build his way to the final state, only by giving him rewards for the actions taken. You can experiment with the algorithm by changing the maze and the parameters.

A short note on the importance of this algorithm: a recent application of Q-learning to deep learning, by Google DeepMind, titled “deep reinforcement learning” or “deep Q-networks”, has been successful at playing some Atari 2600 games at expert human levels.