Reinforcement Learning

- after a lot of random walks and actions and getting rewards associated to each the agent is more likely to pick more rewarding actions.

This is called Q learning

Markov Decision Process:

The mathematical framework for defining a solution in reinforcement learning scenario is calledMarkov Decision Process. This can be designed as:

Set of states, S

Set of actions, A

Reward function, R

Policy, π

Value, V

We have to take an action (A) to transition from our start state to our end state (S). In return getting rewards (R) for each action we take. Our actions can lead to a positive reward or negative reward.

The set of actions we took define our policy (π) and the rewards we get in return defines our value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have to maximize

E[r_t | π, s_t]

for all possible values ofSfor a time t.

Value is the total cumulative reward when you do a policy.

epsilon greedy, which is literally a greedy approach to solving the problem. First you take the greedy choices first and see which gets you to destination. After that, if you (the salesman) want to go from place A to place F again, you would always choose the same policy.

major categories

Policy based, where our focus is to find optimal policyValue based, where our focus is to find optimal value, i.e. cumulative rewardAction based, where our focus is on what optimal actions to take at each step