This course aims at introducing the fundamental concepts of Reinforcement Learning (RL), and develop use cases for applications of RL for option valuation, trading, and asset management.
By the end of this course, students will be able to
- Use reinforcement learning to solve classical problems of Finance such as portfolio optimization, optimal trading, and option pricing and risk management.
- Practice on valuable examples such as famous Q-learning using financial problems.
- Apply their knowledge acquired in the course to a simple model for market dynamics that is obtained using reinforcement learning as the course project.
Prerequisites are the courses "Guided Tour of Machine Learning in Finance" and "Fundamentals of Machine Learning in Finance". Students are expected to know the lognormal process and how it can be simulated. Knowledge of option pricing is not assumed but desirable.

Enseigné par

Igor Halperin

Transcription

Now we can put together everything that we've produced so far and derive a self-consistent system of equations for the G and F functions. We have obtained in the last video, the explicit relation between the G-function, and the F-function. I show you this equation again here in the first line. Now this is a function of policy Pi. We can maximize it with respect to policy's Pi. And this produces an expression shown in equation 38, is given by the prior policy Pi zero, times the exponent of beta, times the G-function, and divided by a normalization factor Zt. This normalization factor is just the sum of the numerator over all possible values of AT. Therefore, the whole expression will produce one if we sum it over all values of AT, which means that Pi A is appropriately normalized distribution. If you're familiar with variational calculus, you could easily derive the second equation from the first one. But if you're not, you can simply think of variations D, which with respect to Pi as a usual D with respect to a letter Pi. By doing such formal differentiation, you would also be able to derive the second equation from the first one. Now, if we block this optimal policy Pi back into the previous expression for the free energy f, we will get the optimal free energy shown in equation 39. So, this equation means that the lock of Zt is equal to beta times the F-function, and therefore, we can replace Zt in the optimal policy expression by an exponent of this expression, and put the optimal policy Pi in the form shown in the equation 40. Now the optimal policy depends on both the G-function, and the F-function. And, in addition, these two functions are related between themselves as we just saw before. Now we can summarize, and put everything together. We have three equations for three functions g, f and Pi. That should be solved self-consistently. This should be done at each step of the backward equation starting from t equal to capital T minus 1, and going all the way down to T equal 0. In general, this is a rather involved system of equations. In the original paper by Tishbian co-workers, this system was solved in a discrete space tabulate setting. But here, we deal with the high dimensional continuous space and continues action problem. Still, it turns out that for our simple portfolio model, this system of equations can be solved relatively easy. Let me give you a brief sketch of how this is done. I will skip many details, but you will find them easily in a paper referenced for this week along with explicit forms of all coefficients and Matrices, that I'm going to show you next. What is more important than the specific form of this coefficients is the general structure over a computational scheme that I'm going now to sketch. So, the first observation is that if we substitute the equation for each terms into the equation for x Sub t plus 1 that we obtained before, we will get the dynamics equations shown in the first equation in this slide. It has two linear terms, At times Xt and At times Ut, but most importantly, it has two quadratic terms proportional to the matrix of market impacts. Therefore, our model has non-linear and more exactly, quadratic dynamics. Therefore, it cannot be solved exactly in the presence of market impacts, but rather should be solved approximately. And, this is done using linearization of the dynamics. Let's assume that we are given some deterministic trajectory that we will denote by upper bars like x bar and u bar. So, given the trajectory, we have a set of pierce x t bar u t bar for all values of t. Next, we define increments, delta xt and delta ut as shown in equation 43. Now, we can substitute these relations back into the dynamic equation for x t plus 1, and keep on linear terms in delta x and delta u. Assuming that quadratic terms are small as we expect around t given trajectory. This gives us a linearized state equation, number 44. So what we see is that if we express the dynamics in terms of the increments, delta x and delta u, we can get a linearized equation for the increment delta x. Now, we express everything in terms of increments. We assume that the F-function is a quadratic function of delta x as shown in the equation 45. It's parametrized by a matrix Dt vector HT and scaler FT. That can all depend on x bar. Now they [inaudible]. Therefore, we can use it to fix the terminal values of all these coefficients. For all other times, we use backward equation by going backwards in time, starting from T minus 1 then T minus 2 and so on. So, the first thing we do for any such value of T is to compute the expectation of the next period F-function. This can be easily done, and it produces the expression shown at the bottom of this slide. The important fact is that it's a quadratic function of expected value of the next period increment delta x which we denote it here as delta x t plus one cat. Now, if we use the previous equation then we can write this equation as a quadratic function of current time increments delta x and delta u with coefficients F a a t F x x t F a x t F x t F u t, and the free form for each term F sub t plus one. Each of this coefficients can now be explicitly computed in terms of previously computed coefficients. At the next step, we introduced a similar parametrization of the G-function as a quadratic function of delta x and delta u. But this time, coefficient of this expansion are unknown and will be computed next. The last thing, we can also express the rewards in terms of the increments using its explicit form, and this gives the question 47 shown here. Again, this expression is a quadratic function of the increments. Now, everything is ready to set up a recursive scheme to compute both the G-function and F-function in terms of their coefficients. What we do is take the Bellman equation for the G-function shown again here in the question 48, block all previous expressions for the G-function reward and the F-function. And then equate the coefficients in front of the like powers of delta x and delta u in the resulting expression. And this produces the relations for coefficients of the G-function that are shown here. Now, if rewards are absorable, then all terms in the right hand side of this equation are known time t because the coefficients in F terms in the right hand side are known from time t plus 1. Therefore, these relations give us coefficients for the G-function at time t. And now, when we computed the G-function at time t, we can compute the F-function at time t. And to do this, we use the equation 39 that I repeated here in the top line. We plug the result in expression for the G-function here, and compute the sum over all a K which is in fact, not a sum but an integral because we work with continued sections. But what is very important here is that in our model, this integral is Gaussian. Therefore, it can be easily computed. When we compute this integral and simplify, we get the equation 49 for their function. Which now has the same form as before, but this time all coefficients are fixed as shown in the formulas below.