This course aims at introducing the fundamental concepts of Reinforcement Learning (RL), and develop use cases for applications of RL for option valuation, trading, and asset management.
By the end of this course, students will be able to
- Use reinforcement learning to solve classical problems of Finance such as portfolio optimization, optimal trading, and option pricing and risk management.
- Practice on valuable examples such as famous Q-learning using financial problems.
- Apply their knowledge acquired in the course to a simple model for market dynamics that is obtained using reinforcement learning as the course project.
Prerequisites are the courses "Guided Tour of Machine Learning in Finance" and "Fundamentals of Machine Learning in Finance". Students are expected to know the lognormal process and how it can be simulated. Knowledge of option pricing is not assumed but desirable.

강사:

Igor Halperin

스크립트

Now, when we define one step risk adjusted and cost-adjusted reward for our problem, we can move on and define an objective function for our portfolio optimization problem. This objective function is shown in equation 23. As usual, it's given by an expectation of a sum of discounted one-step rewards from all future peers. Here, Gamma is a discount factor which is a number typically close to 1. We can take it to be a risk free discount factor for time-step in our portfolio problem. The most important thing in this formula is that the single step reward shown here is quadratic in states x_t and actions a_t. The optimization problem amounts to maximize in this expression and actions a_t from t equals zero to t equal to T minus 1 under certain constraints. One constraint is that both components say plus and then a minus should be non-negative by the mean of our construction. In addition, we might have other constraints for a portfolio. We can use a general notation a_t for a set of constraints on the action that are called Trajan constraints. We can also have constraints from a set Z_t, that would be constraints on new holdings in different assets. Please note that the sum does not include the last step T, as the action at the last step is deterministic as we saw in the last video. Now, we can see that the reward function that we have in our problem is a convex function of x_t and a_t. Therefore, if in addition, the set of constraints A_t and Z_t are convex, the whole problem is convex and can be solved by means of convex program. This setting will suggest that in the very nice tutorial style paper called Multi-Period Trading via Convex Optimization by Stephen Boyd and quarters. This paper shows that many practically interesting settings of portfolio optimization can be considered within such convex optimization framework. I will refer you to this paper for more details and show you just a few examples of such convex constraints. For example, pension funds cannot short sell assets. So for them, a long only constraint shown here would be appropriate. Another example would apply to institutions that can short sell. For such case, often, there are limits on new positions x_t plus u_t, which give limit constraints. Constraint on the total amount of short positions, can be expressed as a leverage constraint. This constraint is shown in the third line here. It involves a sum of absolute value of four positions. The amount that exceeds sum of four positions is called the leverage. Finally, I would like to mention minimum cash balance constraint as shown in the last line here. All these would be examples of convex constraints that keep the problem tractable even for high dimensional portfolios. All right. So, let's summarize this framework that we will call forward portfolio optimization. In this approach, our task is to find an optimal portfolio trading strategy. We're given an objective function constraints, and initial and terminal conditions for the portfolio. This is a typical problem of dynamic optimization that can be solved using dynamic program if a model is known. But if a model in this case involves a forecast for predictors z_t. And this is a hard part because any errors in their forecast directly translate into errors in the optimal portfolio allocations. To illustrate how this all works, let's consider a special case of these formalism when the number of steps equals just 1, so that we talk about one-step portfolio optimization. In this special case, things get simpler as expected. Instead of a dynamic trading strategy, we only need strategy for one-step only. In this case, people in finance usually speak of asset allocation. So, the other name for asset allocation would be a single step policy. Again, we need to optimize an objective function given constraints. If we apply this one-step setting for one-step portfolio optimization problem, we will obtain the celebrated Markowitz portfolio model of 1959. The Markowitz model is probably one of the most important things widely used models in all of finance. Practitioners using the Markov's model know very well that one of the main challenges in applying it in practice, is exactly the same problem that we already mentioned above. The model depends critically on accuracy of our forecasts for future values of predictors z_t. Now, there exists a very elegant way to resolve this issue by flipping the optimization problem on its head. And this is called inverse optimization. Inverse optimization works as follows: Imagine we already know the optimal portfolio allocation. This means we already know a solution to the optimization problem. Now, we can ask, "What are the parameters of the objective function that produced such optimal allocations?" For the specific case of one-step portfolio of Markov's optimization, it also means that we take a portfolio and invert formulas for the Markowitz optimal portfolio so that we get values forecasts for predictors z_t that are implied by observed optimal solution. And this was essentially an approach of the famous Black-Litterman model of 1992. Black and Litterman applied the Markov's portfolio optimization problem to the market portfolio. For example, the S&P 500 portfolio, and then inverted it. That gave them insights into how the market itself forecasts future values of predictors z_t. The original Black-Litterman model was not formulated as an inverse optimization but such formulation was suggested in 2012 by Bertsimas and co-workers. Both these papers and other researchers showed how the Black-Litterman framework can be used to assess the value of private signals z_t that are known only to an investors but not to the rest of the market. Now, we can generalize the same inverse optimization approach to a dynamic and multi-period setting. In this case, we are given not just one-time optimal allocation but rather a sequence of such re-allocations. That is we're given the history of actions. The task is to find one-step rewards or, in the parametric setting, parameters offer one-step rewards. The other task would be to find an investment policy that corresponds to observed actions. Now, we can discuss two possible settings for such dynamic inverse optimization of the portfolio. In the first case, we deal with propriety portfolio and actions of a particular trader or broker. In the second case, we look at their market portfolio. For this second case, such approach generalizes the Black-Littermen analysis to a multi-period setting and can be used for the same purposes as the original Black-Litterman model. For the first case, the same approach can be used to build the model of a trader. Now for both cases, it can be used in the same way to the Black-Lettermen model to assess value of private signals z_t that are known only to an investor, but not to other participants in the market. But the difference is that now, we do analysis of impact of predictors z_t, over an extended time-period instead of one-step Black-Litterman formulation. And this should be a good thing because multi-step portfolio performance is an ultimate objective of portfolio managers. If they can absorb the walking of their signal server in the extended period of time, this move without noise in such estimations and may help to find the more stable estimation of a value of signals. Now, at this point, you could notice that the both problem of dynamic inverse optimization sounds very similar to the objective of reinforcement learning or maybe inverse reinforcement. And this is indeed the case, which means that dynamic inverse portfolio optimization can be done using tools from reinforcement learning and inverse reinforcement learning. We will discuss such approaches in our next video.