In this final course, you will put together your knowledge from Courses 1, 2 and 3 to implement a complete RL solution to a problem. This capstone will let you see how each component---problem formulation, algorithm selection, parameter selection and representation design---fits together into a complete solution, and how to make appropriate choices when deploying RL in the real world. This project will require you to implement both the environment to stimulate your problem, and a control agent with Neural Network function approximation. In addition, you will conduct a scientific study of your learning system to develop your ability to assess the robustness of RL agents. To use RL in the real world, it is critical to (a) appropriately formalize the problem as an MDP, (b) select appropriate algorithms, (c ) identify what choices in your implementation will have large impacts on performance and (d) validate the expected behaviour of your algorithms. This capstone is valuable for anyone who is planning on using RL to solve real problems.
To be successful in this course, you will need to have completed Courses 1, 2, and 3 of this Specialization or the equivalent.
By the end of this course, you will be able to:
Complete an RL solution to a problem, starting from problem formulation, appropriate algorithm selection and implementation and empirical study into the effectiveness of the solution.

강사:

Martha White

Adam White

스크립트

Imagine that you would like to land on the moon this seems like a difficult task. We would need to know loads of physics and control theory and we would have to anticipate countless things that could go wrong. But what if we frame this as a reinforcement learning problem. Our agent does not need to know the dynamics of the real world, it can learn simply through interaction. And your agent will do just that, the lunar module will interact with the world unsure of the dynamics, but to get the module to do what you want, you will have to design the reward. Of course, it might be a little dangerous and costly to learn how to land a lunar module on the moon through trial and error. Instead, we will train our agent in a simulator to find a robust and efficient landing policy. Then, we could deploy the agent on the moon and it allow it to continue learning adjusting its value function and policy to account for the differences between our simulator and reality. This means we do not have to worry about safety or money while training in the simulator. But do keep in mind we are using a very simple simulation here. It does not reflect all the complexities of outer space, but we cannot use a high-fidelity simulator since we do need your experiments to run reasonably quickly. Even with our high fidelity simulator our agent might still have issues in deployment. There are many technical nuances required to get agents trained in simulators to transfer to the real world. But this topic is way outside the scope of this small project. Our aim here is to help you tackle a moderately interesting problem and help you gain experience converting word descriptions of problems to a concrete solution. In this video, we will introduce you to the environment that you will be working with. You will understand the state and action space and the reward function that you need to implement. This is what the lunar lander environment looks like and here are a few examples of an agent successfully Landing. Our goal is to land the lunar module in the landing zone located between the two yellow flags. The Landing zone is always in the same location, but the shape of the ground around it may change. We can fire the main thruster or either of the side thrusters to orient the module and slow its descent. The state is composed of eight variables. It includes the XY position and velocity the module as well as its angle and angular velocity with respect to the ground. We also have a sensor for each leg that determines if it is touching the ground. Let's take a deeper look at the part of the environment that you will be implementing. The environment inputs the action that actually is given to the dynamics function along with the current state to produce a next state. The next state in action will then be passed to reward function that encodes the desired behavior. Finally, the environment will omit the next state and the reward. There are four actions that the agent can take, the agent can fire the main thruster, fire the left thruster, fire the right thruster or do nothing at all on this time step. You will need to implement the reward function for this environment. Fuel is expensive and the main thruster uses a lot of it. We want to discourage the agent from using the main thruster more than necessary. The side thrusters use less fuel, so it is less bad for the agent to use those frequently. We want to encourage the agent to move towards the goal. So it will lose some reward based on how far it moved from the goal since the last time step. Let's try to discourage the agent from learning to pile the module to the surface in ways that might damage the equipment and will also discourage flying off into outer space or a distant creator never to be seen again. The agent will be rewarded for each leg that it manages to get touching the ground. And the agent will receive a large reward for successfully landing in the landing pad at an appropriate velocity. More details like what is an appropriate velocity will be made available in the notebooks. And that's it, in this video, we introduced you to the lunar lander environment that you will be working with throughout this course. Your goal this week is to implement the reward function for this problem. Good luck building your environment.