Introduction

This post explores a concept at the heart of quantitative financial research. Most qfin researchers utilize statistical techniques that require varying degrees of stationarity. As many of you are aware financial time series violate pretty much all the rules of stationarity and yet many researchers, including me, have applied or will apply techniques when not appropriate thereby calling into question many of the resulting conclusions.

In the new book Advances in Financial Machine Learning by Marcos Lopez De Prado he proposes that qfin researchers utilize a different type of price bar. His research has shown that by using alternatives to fixed time interval bars (minute, hour, day, week, etc.), the return series will exhibit better statistical properties. In other words using alternative bar types, the return series will better approximate normality/stationarity which will make our research and conclusions more robust.

In this post we will experiment with the following bar types: Tick, Volume, Dollar Volume, and Dollar Volume Imbalance.

Recap

See <Mixture Model Trading (Part 1, Part 2, Part 3, Part 4, Part 5, Github Repo)>. This research demonstrates a systematic trading strategy development workflow from theory to implementation to testing. It focuses on the concept of using Gaussian Mixture Models as a method for return distribution prediction and then using a simple market timing strategy to take advantage of the predicted asset return outliers.

Chapter Goals

Demonstrate how to extract algorithm portfolio equity from Quantconnect backtest

Links

Introduction

This notebook will walkthrough the algorithm implementation process on the quantconnect platform. Please be advised that this notebook will not actually run the algorithm as I have not installed the quantconnect backtesting engine locally. This is a demonstration of the process. The script is available to copy and paste into the quantconnect environment within the ./scripts/ directory of the github repo.

Key Notes About The Quantconnect Platform

They use Python 2.7 and I do not know when/if Python 3 will be supported.

There is no interactive debugger at this time. Troubleshooting can be difficult if your algorithm is not logically structured for modularity.

There are some minor data issues that their team is hard at work correcting. At times there are trades that get filled that are in error so investigating the trade level data is important and fortunately straightforward to do.

Calls to the History() function create major RAM/time penalties so it is important to code your algorithm to be efficient with its data requests.

Mixture Model Trading Algorithm Outline

The algorithm will use Gaussian Mixture Models (GMM) to determine return outliers. Based on outlier direction the algorithm will go long (or short) the ETF. Based on the research conducted in chapter 3 I determined one tradeable pattern to be a long-only strategy with a 63 day holding period, post outlier event. The basic structure of the algorithm is:

Check open orders:

confirm all orders are filled

track fill dates

Check if any current holdings meet liquidation criteria. In this implementation the only liquidation criteria is whether we have held the security for the 63 day period.

check if today's date is greater than or equal to liquidation date.

if so liquidate the position.

Run the main algorithm computation. In this implementation we use a lookback of 252 days or approximately 1 trading year.

fit the GMM using N components.

extract hidden states and their parameters

sample from the chosen distribution using those parameters

compute confidence intervals

compare intervals with current return to identify outliers

assess direction of outliers e.g. too_low or too_high

assign securities to long (or short) based on direction of outliers

Use computed results to send orders.

this implementation uses MarketOnOpenOrders. This means that market orders are sent for the next day's open after an outlier event is triggered.

Next we setup a PARAMETER_REGISTRY. This helps associate the chosen set of parameters with each backtest. Without it there is no way to know what parameters were used with which backtest when you go to compare results at a later date. However by registering the parameters we can log them. These backtest logs are always available for download when you load the results of your backtest.

Next up we define and register the global parameters that the algorithm class will use. These parameters contain a flag which logs whether the strategy was implemented as long-only, the number of samples for our confidence interval sampling, the chosen distribution we are using, and the parameters for the sklearn GMM we will implement.

OnData is the event handler that is called at the resolution we choose e.g. minute, hour, daily. However because this algorithm uses scheduled functions this function is not needed and is simply pass(ed).

The initialize function has a lot going on. In addition to setting the parameters we create the custom charts to track leverage, cash, RAM usage, and computation time.

A quick note on the schedule functions; The way to read it is that the main functions are scheduled twice weekly on Monday and Friday to run after the market opens for the SPY etf at the designated number of minutes afterwards. The Action is the function we want to run at that time.

Another important note is that we initialize our price history dataframe. We call it once here for the full 252 day lookback. Later we define a function called update_prices() which computes the number of additional days of history to request between the current date and the last date of our self.prices dataframe. Then it requests only that limited history, concatenates and cleans up the data so we only have data for the specified lookback period. This methodology saves massive RAM/time during the backtest runs.

Next we define two functions to implement the main algorithm computation. First we define the function compute() which takes a single symbol, fits the GMM, extracts the hidden states and their parameters and determines if any outlier events have occurred.

Then we define the run_main_algo() function which aggregates the compute() information into a dataframe from a list of rows if and only if outlier events have occurred. This is also to save RAM/time. This function constructs the long (and/or short) numpy arrays that will be sent to the send_orders() function.

Next we define the send_orders() function which is responsible for sending the orders and updating our list of order tickets contained in the self.openMarketOnOpenOrders list. It contains some checks for efficiency and error handling purposes.

Finally we define our CHART_RAM() function which actually tracks RAM usage, computation time, leverage and cash. We also define the OnData() function which we simply pass as all functions are scheduled.

Again the full script can be found in the ./scripts/ directory of the github repo. Sign up to Quantconnect.com and paste the script into the Algorithm Lab (backtesting) environment. Test the algorithm with various parameters and see what you discover.

Next Steps

In part 5 we will evaluate the results of my backtests using 1,2, and 4 GMM components

Introduction

This is the beginning of a three part series that I completed towards the end of 2017 as a learning module for Quantinsti.com. The purpose of the series is to demonstrate a research workflow focused around the theory and application of mixture models as the core framework behind a algorithmic trading strategy. Below is a quote taken from the README of the github repo:

“The primary goal of this repo is to demonstrate the workflow between research of a quantitative idea or theory to implementation as a potential live trading strategy. Unlike other finance based tutorials the results will not be cherry picked to show only the best of the best examples. Sometimes results are counterintuitive, sometimes they are conflicting. Real strategy development is often dirty, complex, full of starts and stops and requires us to use all of our skills to extract the signal from the noise. With that said I welcome interactive engagement, ideas, insight, and constructive criticism, especially if errors or bugs are found.”

— https://github.com/BlackArbsCEO/mixture_model_trading_public

I will be presenting each of the notebooks on the blog although you can feel free to read ahead by visiting the github repo directly. What is new however is that at the end of three part series I will be publishing a Part 4 where I will describe an actual implementation of the strategy and release the code for the actual algorithm for my readers to dissect, alter, and experiment with on the Quantconnect.com platform.

Introduction

This is the beginning of a three part series that I completed towards the end of 2017 as a learning module for Quantinsti.com. The purpose of the series is to demonstrate a research workflow focused around the theory and application of mixture models as the core framework behind a algorithmic trading strategy. Below is a quote taken from the README of the github repo:

“The primary goal of this repo is to demonstrate the workflow between research of a quantitative idea or theory to implementation as a potential live trading strategy. Unlike other finance based tutorials the results will not be cherry picked to show only the best of the best examples. Sometimes results are counterintuitive, sometimes they are conflicting. Real strategy development is often dirty, complex, full of starts and stops and requires us to use all of our skills to extract the signal from the noise. With that said I welcome interactive engagement, ideas, insight, and constructive criticism, especially if errors or bugs are found.”

— https://github.com/BlackArbsCEO/mixture_model_trading_public

I will be presenting each of the notebooks on the blog although you can feel free to read ahead by visiting the github repo directly. What is new however is that at the end of three part series I will be publishing a Part 4 where I will describe an actual implementation of the strategy and release the code for the actual algorithm for my readers to dissect, alter, and experiment with on the Quantconnect.com platform.

Introduction

This is the beginning of a three part series that I completed towards the end of 2017 as a learning module for Quantinsti.com. The purpose of the series is to demonstrate a research workflow focused around the theory and application of mixture models as the core framework behind a algorithmic trading strategy. Below is a quote taken from the README of the github repo:

The primary goal of this repo is to demonstrate the workflow between research of a quantitative idea or theory to implementation as a potential live trading strategy. Unlike other finance based tutorials the results will not be cherry picked to show only the best of the best examples. Sometimes results are counterintuitive, sometimes they are conflicting. Real strategy development is often dirty, complex, full of starts and stops and requires us to use all of our skills to extract the signal from the noise. With that said I welcome interactive engagement, ideas, insight, and constructive criticism, especially if errors or bugs are found.

— https://github.com/BlackArbsCEO/mixture_model_trading_public

I will be presenting each of the notebooks on the blog although you can feel free to read ahead by visiting the github repo directly. What is new however is that at the end of three part series I will be publishing a Part 4 where I will describe an actual implementation of the strategy and release the code for the actual algorithm for my readers to dissect, alter, and experiment with on the Quantconnect.com platform.

Introduction

This article series provides an opportunity to move towards more interactive analysis. My plan is to integrate more Jupyter notebooks and Github repos into my research/publishing workflow. For datasets that are too big to share through github I will provide a download link both here and in the github readme.

I will be posting the notebooks into this blog using iframes. If you experience any issues with formatting I recommend viewing the notebook at github directly. If you're using mobile, you will have to "request the desktop site" for the ipynb to render.

Purpose

This is a simple reference article for readers that might wonder where I get/got my options data from. In this regard I would like to shout out the contributors to the pandas-datareader, without their efforts this process would be much more complex.

Intuitive Explanation

So this code consists of three components. The first is the actual script that wraps the pandas-datareader functions and downloads the options data. The second is a helper script to save the aggregated data to disk. The helper script which I call file_handler is designed to save the data in multiple formats in a structured file directory. Internally it checks to see if today's folder is created with a particular date and naming convention, if it isn't it will create the folder and then store all the data files there. What gives this code the ability to aggregate intraday data is the third component which simply requires making use of your system's task scheduler. For example, if you have Linux/Ubuntu you can package this script to run as a cronjob quite easily. After the code below I show an example cronjob template that works.

This is the code for the file_handler script. It can save in 1 of the following 4 formats: parquet, h5, feather, csv. I save the list of symbol errors as a CSV since this list is generally quite small. As seen above I save the options data in parquet format first, and a backup in the form of an h5 file. Generally I prefer to work with parquet files because the are compressed by default, contain metadata, and integrate better with the Dask. This code requires the installation of the pyarrow package.

Finally, below is an example of my cronjob. It is set to run Monday through Friday, hourly, from market open to close. Note the log directory and log file after the ">>"; all the print statements contained in the script will output to that log file including any exceptions.

Next Steps

The next article will document the code I refactored to calculate the option skew metric from the paper "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. If you have been a long time reader, you may recall I did a series where I tracked a theoretical ETF equity strategy that was based on this metric. Over time, people have asked how it is performing, and I did not have an answer because I stopped tracking it, as I have been busy with other projects. However, the strategy showed promise then and I wondered if it could be applied directly in options trading. My goal is to research the possibility of implementing this strategy live, and if the results show an edge, implementing it and tracking the results publicly.

To accomplish this task I first needed to gather data which this article shows. In the next article I make heavy use of Dask because the volume of intraday data aggregated over a month is over 14 million rows and operating on the dataframe in-memory is slow and/or unfeasible on most people's systems including mine.

Additionally the next article will be a jupyter notebook I will embed as a blog post here directly, but recommend it be viewed on the github repo I will make public.

Notes on Part-2

Some astute readers in the comments noted that analysis based on the absolute difference in bid-ask price is not robust when considering the price of the underlying option and can lead to spurious conclusions. They recommended defining bid-ask spread as a percent of the option's spot price.

Additionally, I failed to constrain the analysis to include only options with a certain level of "moneyness". That is, options far away from the strike price behave differently than options that are closer, and the prior analysis failed to incorporate that understanding. In Part 2 of this exploration we re-examine the conclusions drawn in Part-1, after incorporating the aforementioned suggestions. With that said, this post will largely follow the format of Part-1, so if you feel you are missing context for this analysis start there.

The Data

The data is a cleaned hdf5/.h5 file comprised of a collection of daily options data collected over the period of 05/17/2017 to 07/24/2017. By cleaned I mean I aggregated the daily data into one set, removed some unnecessary columns, cleaned up the data types and added the underlying ETF prices from Yahoo. I make no claims about the accuracy of the data itself, and I present it as is. It is approximately a 1 GB in size and I have made it available for download at the following link:

CALLS & PUTS

CALLS ONLY

PUTS ONLY

What jumps out at me is how large the spread is as a percentage of the option's ask price as you move closer to expiration. From ~220 days and below (or bin 4.5+) the pattern appears to show a a significant increase in spreads. With days to expiration longer than ~220 both calls and puts show a flattening.

My first guess as to what could cause this pattern is that, as the contract expiration approaches, the probability of being ITM is low for a vast majority of contracts. As a result the demand from market participants dries up so the cost to the market maker increases and to compensate spreads widen. I welcome any insight readers may have on this.

From these two plots we can see that the bulk of the bid-ask spreads are below 15% for both calls and puts. I find it interesting that for calls SLV, and XLF have more extreme tails than the others. DIA and XLF calls also appear to be priced consistently higher than the other symbols.

Looking at the put options we see DIA is more expensive with more extreme values than any other symbol. The tails for SPY, TLT, QQQ, and GLD are more extreme/dispersed than their call option counterparts.

CALLS ONLY

PUTS ONLY

The red lines indicate the 95% interval for the data. We can see that the two plots are very similar except for minor cosmetic differences. Looking at the puts It still appears that, as volume increases the spreads are compressed a bit more than the calls even though the 95% intervals are nearly identical. Looking at the calls, there appears to be more extreme values at lower volumes than the puts.

Furthermore it appears that in this admittedly small sampling, spreads decline as open-interest and volume increase. This should not be surprising to readers, but it is noteworthy. The hypothesized mechanism for this is simple, as volume/open-interest increase, it becomes less risky for market-makers to provide their services, thus lowering the overall cost to trade.

CALLS ONLY

PUTS ONLY

You could argue that the above plots show that market makers overall are pretty good at keeping spreads low regardless of the volume.

Also notice how much volume/open-interest there is in USO; both calls and puts are traded at a sharply higher volume than the other symbols. Next closest appears to be SLV, with XLF having some very popular contracts functioning as outliers. DIA and TLT appear to be least traded however DIA appears to be priced most inefficiently compared to the other symbols.

CALLS & PUTS

CALLS ONLY

PUTS ONLY

In aggregate it appears that there is some relationship between volatility and spreads, with DIA, SPY, USO, SLV, TLT, and XLF showing increases in spreads co-occurring with increases in volatility. However, the relationship looks more tenuous when we disaggregate the options into calls and puts. For example USO calls appear to show a relationship between spreads and volatility quite clearly, but USO puts show no relationship at all. The same can be said about SLV, and XLF.

Summary Conclusions

Spreads increase dramatically as the contract nears expiration. The exact cause of this is only speculative and worthy of more investigation.

Examining selected symbols, it appears that most of the contracts are priced competitively with each other with DIA and XLF showing the most extreme outliers.

USO options have high interest from market participants as both calls and puts are traded at a higher volume.

The sample size is too small to conclude anything about volatility and spreads. This relationship needs to be researched further, as common wisdom suggests spreads get wider as volatility increases. Is that true in aggregate, for calls or puts? Is that relationship stronger intraday? Would it even show up in daily or weekly samplings?

The Objective

Compared to the equity market, the options market is a level up in complexity. For each symbol there are multiple expiration dates, strike prices for each expiration date, implied volatilities, and that's before we get to the option greeks.

The increased complexity presents us with more opportunity. More complexity means less ground truth, more errors, more gaps, and more structural asymmetries. Consider that THE dominant factor underlying options pricing - implied volatility - cannot be directly measured only estimated! To estimate it requires other observable factors and a pricing model. We already know "All models are wrong. Some are Useful" thus there are opportunities to exploit the errors of others. To do that requires a better understanding than our competitors thus beginning our study of the options market.

This is the next step in the series for developing an options trading dashboard using Python and Python based tools. Thus far I have demonstrated two methods [1] [2] of scraping the necessary data. Now that the data has been collecting for a bit we can begin some initial exploratory analysis. As this is a purpose driven process we should set an objective for our study.

In this particular article I want to focus on exploring bid-ask spreads as that data is often unavailable for free.

The Data

The data is a cleaned hdf5/.h5 file comprised of a collection of daily options data collected over the period of 05/17/2017 to 07/24/2017. By cleaned I mean I aggregated the daily data into one set, removed some unnecessary columns, cleaned up the data types and added the underlying ETF prices from Yahoo. I make no claims about the accuracy of the data itself, and I present it as is. It is approximately a 1 GB in size and I have made it available for download at the following link:

Call & Put Options

Call Options Only

Put Options Only

Some things are interesting. From ~250 through ~600 days in both call and put options the bid-ask spreads are compressed towards zero. There also appears to be less dispersion in put bid-ask spreads overall.

Looking at these plots we see further evidence of bid-ask spreads showing less dispersion across puts vs calls. Also it's surprising to see DIA options having such a wide range of values compared to SPY and QQQ; this is especially true for the call options.

Calls and Puts

Calls Only

Puts Only

Again we see put bid-ask spreads squeezed towards zero even as volume increases. We also see SPY and USO with small spreads as both volume and open interest increases. This suggests there are symbols/contracts with higher relative trading capacity.

how do bid-ask spreads vary with volatility?

Calls and Puts

Calls Only

Puts Only

Some notes. DIA again appears to have the highest dispersion in bid-ask spreads for both calls and puts. GLD is also notable. It is also somewhat surprising that for these selected ETFs increased volatility doesn't appear with increased bid-ask spreads.

Summary Conclusions

Put options have less overall dispersion in bid-ask spreads than calls relative to days to expiration, volume, and volatility.

Bid-ask spreads have a major compression range between ~250 to ~600 days to maturity that appear smaller than all other buckets.

Recap

In the previous post I revealed a web scraping trick that allows us to defeat AJAX/JavaScript based web pages and extract the tables we need. We also covered how to use that trick to scrape a large volume of options prices quickly and asynchronously using the combination of aiohttp and asyncio.

The Problem

It worked beautifully until... I told people about it. Shortly after publishing, my code stopped functioning. After investigating, it was clear no data was being returned during the aiohttp call to the Barchart server. I attempted to fix the code by adding the semaphore option to the asyncio call. Roughly speaking, in this context the semaphore option allows you to specify the max number of calls that can be made simultaneously. I tried, 100, 50, 10, 2 and they all failed.

I do not know what happened for sure, but if I had to guess, the increase in server loads per unit time measure, was significant enough for Barchart system/network staff to update their server settings and squash the multiple simultaneous calls.

The Solution

We simply build a sequential scraper instead of an asynchronous one. To make it more robust we have to add a simple twist to the code that makes it more difficult to diagnose human vs automated traffic.

Barchart Scraper Class

This class is similar to the previous version except asyncio is stripped out. It's main function is to create the POST url, call the server and return the response data. Please note, I tested this class with a dynamic referer symbol and random user agents and this simple hardcoded setup has worked most consistently for me.

Utility Functions

Next we devise 2 utility functions. The first function is simply a convenience function to run the first iteration of the scraper. We need to do that for each symbol in order to extract the expiration dates dynamically.

Putting It All Together

Next we can implement the main script body. Essentially it runs a main loop and an inner loop. For each symbol get the default first data, extract the expirys, and then for each expiration extract the data. At the end of the inner loop, all data for that symbol is concatenated and then appended to a list containing all the symbols' dataframes. Finally all the symbols dataframes are concatenated and saved to hdf.

The Simple Trick

Did you notice the random_wait at the end of the inner loop? We simply pass an array of reasonable wait times (measured in seconds) and their probabilities to numpy's random_choice() and pass the result to the time.sleep() function before iterating to the next symbol. This isn't guaranteed to always work, but in cases where servers may be restricting traffic loads it makes it much harder to identify your traffic as automated.

Ultimately, it's also a respectful way to operate our scraper.

Next Steps

Next up in the series I plan to explore the data collected over the last 6 weeks I've been running this script. I hope to explore multiple angles and dynamics in the data.

Do you have any suggestions for exploration topics? If so, leave a comment or contact me via email or twitter.

Intro

This is Part 1 of a new series I'm doing in semi real-time to build a functional options data dashboard using Python. There are many underlying motivations to attempt this, and several challenges to implementing a tool like this from scratch.

Where to get the data? Is it affordable? Easily accessible? API?

How to parse the results?

How to aggregate and organize the data for analysis?

How to store the data? TXT, CSV, SQL database, HDF5??

How often should it run?

How to display the data? What dynamic graphic library to use? D3.js, MPL3d, Plotly, Bokeh, etc.?

These are some of the problems that need to be solved in order to create the tool.

In this post I show a current working solution to where to get the data, how to scrape it, how to parse it, and a storage method for fast read write access. We will scrape Barchart.com's basic option quotes using aiohttp and asyncio, both are included in Python 3.6 standard library. We will parse it using Pandas and Numpy and store the data in the HDF5 file format.

Disclaimers

This is primarily an academic exercise. I have no intent to harm or cause others to harm Barchart.com or its vendors. My belief is that, by facilitating knowledge sharing, we will increase the number of educated participants in the options markets; thereby increasing the total addressable market for businesses like Barchart and its vendors. By designing tools like this we improve our own understanding of the use cases and applications (option valuation and trading) and can provide better feedback to those in the product development process.

The Secret to Scraping Ajax Sites

First let's create a mental model of what AJAX really is.

So looking at this, we can say AJAX is a set of web development techniques to increase the efficiency and user experience during website interaction. For example, you go to a website with cool data tables on it. You want to change one of the filters on the data so you select the option you want and click. What happens from there?

In simply designed or older websites your request would be sent to the server, then to update the data table with your selected filters would require the server response to reload the entire page. This is inefficient for many reasons but one is that, often the element in need of updating is only a fraction of the entire webpage.

AJAX allows websites to send requests to the server and update page elements on an element by element basis negating the need for reloading the entire page every time you interact with the page.

This improvement in efficiency comes at the added cost of complexity, for web designers and developers and for web scrapers. Generally speaking the url you use to go to an AJAX page is not the actual url that gets sent to the server to load the page you view.

To build this understanding, let's look at a sample option quote page using the following link <https://www.barchart.com/stocks/quotes/spy/options>.

Warning: To follow along with the rest of this example you need access to developer mode in Chrome or its equivalent in other browsers.

Let's look behind the curtain so to speak. Click anywhere in the page and click inspect. Navigate to the Network tab in Chrome developer tools.

We're going to press F5 to reload the page and look for the following: Request Headers, and the Request URL.

We will need the Request URL and the Request Headers in order to construct our calls to the server a little later. Simply put, this is the secret! We can replicate our browser's behavior when it requests data from the server if we know the actualrequest url and the request headers. This will be made clearer in the next section.

The async_option_scraper.py Module

This is the key module for scraping the data. First the imports.

import asyncio
import aiohttp

first_async_scraper class

If you noticed when the page loads, it loads the nearest expiration date by default.

We know there are generally multiple expiration dates per symbol. However, some ETFs have weekly contracts, monthly, and/or quarterly. Instead of guessing the expiration dates, the first_async_scraper class scrapes the default pages so we can later extract the expiration dates directly from the page's JSON/dict response.

The workhorse function is run which calls the internal function _fetch. Inside the run function I've hardcoded a request url similar to the one we found before. I've also hardcoded the headers we found earlier as well. Notice both objects are string formats which can be dynamically updated with our ETF symbol.

The _fetch function takes the ETF symbol, the url string, session object, and our request headers and makes the call to the server returning the response as a JSON /dict object.

The run function takes a list of symbols, and a user agent string - more on this later.

The aiohttp package has a very similar interface to the requests module. We first create a ClientSession object which acts like a context manager. After creating the session object, we loop through each symbol using the asyncio.ensure_future function to create and schedule the event task. The gather function executes the tasks asynchronously waiting until all tasks have completed. It returns a list of JSON responses, each representing one ETF.

The Expirys Class

Once we have the list of responses we need to extract the expiry dates from each page source, collecting them for later use. The class is initialized with two parameters - a list of ETF symbols, and the list of page responses from the first scrape job.

It uses two functions. The internal function _get_dict_expiry takes a single response object and returns the list of expirations for a single symbol. The exposed function get_expirys loops through the list of ETFs and responses aggregating them into a dictionary. The dictionary keys are the ETF symbols and the values are lists of expirations for that symbol.

xp_async_scraper class

The final scraper class is nearly identical to the first_async_scraper except for some additional arguments for the functions xp_run(), and _xp_fetch() to accept the expiry dates. Also notice that the hard coded URL in the xp_run function is slightly different in that it is formatted to accept the ETF symbol and an expiration date.

last_price_scraper class

This class has the same structure and form as the other scraper classes except slightly simpler. The purpose of this class is to simply retrieve the basic html source for each ETF so that we can later extract the last quote price for the underlying equity.

The option_parser.py Module

Once we have all the data we need to be able to parse it for easy analysis and storage. Fortunately this is relatively simple to do with Pandas. The option_parser.py module contains one class-option_parser, and three functions-extract_last_price(), create_call_df(), create_put_df().

The option_parser class is initialized with an ETF symbol and the appropriate response object. The create dataframe functions extract the call/put data from the JSON/dict response, then iterates through each quote combining them into dataframes taking care to clean the data set and change the datatypes from objects to numeric/datetime where appropriate. The extract_last_price function is used to get the underlying quote price from the basic html source.

The Implementation Script

Finally we can combine the modules into a script and run it. Note that this script requires the fake-useragentpackage. This package has a nice feature where it generates a random user agent string on every call. We need to do this so our requests are not blocked by the server.

The script imports a list of ETF symbols originally sourced from Nasdaq. Some of these symbols don't have options data, so they are filtered out. The script runs in the following order: basic html scraper -> first async scraper -> extracts the expiry dates -> xp async scraper which aggregates all the option data -> parses the collected data into a dataframe format -> downloads and inserts any missing underlying prices -> then saves it to disk as an HDF5 file.

Recap

Thus far in the series we've explored the idea of using Gaussian mixture models (GMM) to predict outlier returns. Specifically, we were measuring two things:

The accuracy of the strategy implementation in predicting return distributions.

The return pattern after an outlier event.

During the exploratory phase of this project there were some interesting results worthy of more investigation. The initial results implied that the strategy implementation was adaptable to changes in the means and volatilities of a small number of ETF's returns.

Webinar Hypothesis

Recently I had the opportunity to present my first webinar with QuantInsti.com. I definitely have some areas for improvement, but the experience was great overall, and I learned a lot.

I chose this topic to present, and through the process I was able to refine the hypothesis, the code, and my thinking on the subject. The hypothesis is simple:

Can a GMM based strategy predict asset return distributions such that a strategy which "buys" the asset post an outlier event can "earn" a positive return?

Analysis and Conclusions

There were a couple of takeaways from the project. Overall the strategy showed promise. What really impressed me was the difference in the sampled confidence intervals when using the Normal distribution vs. the JohnsonSU distribution. See the following example:

On the left, we have the same strategy except the sampled confidence intervals are drawn from a normal distribution. On the right we use the JohnsonSU distribution. In terms of predicted return distribution accuracy it's not even close-JohnsonSU is the clear winner, even showing an ability to adjust to periods of clustered volatility.

However note the equity curves in the example. The normal distribution wins handily but that is because the strategy is so inaccurate that it predicts outlier returns occurred ~97% of the time, so technically that would be a buy and hold strategy which benefits from the strong uptrend in SPY post 2009.

Another takeaway is that the model shows a bias towards US based ETFs. You can see that by examining the Seaborn facetgrid plots in the notebook I will share at the end. First, by aggregating the results in to a tidy-dataformat the analysis was rendered so simple, I kicked myself for not adhering to these principles sooner. In the examples I examine the strategy results according to median returns and the sum_ratio.

Median returns are simply the median returns of the strategy for that set of parameters. The sum_ratio is the sum of all strategy returns that ended positively divided by the sum of all returns that ended negatively for a set of parameters. A "successful" strategy should have a sum_ratio > 1 across multiple dimensions as well as consistent positive median returns.

In the analysis I look at the two metrics across different lookback periods (1 year, 3 year, and expanding), different numbers of mixture model components (k=2, 3, 5, 7, 9, 13, 17, 21) and across a number of holding periods in days (steps = 1, 2, 3, 5, 7, 10, 21).

When applied to SPY, QQQ, and TLT the strategy showed consistent positive results across a wide spectrum of parameter combinations whereas the application to GLD, EFA, and EEM were a little more mixed and definitely not as encouraging.

One theory I have for this result is that the factors I used as input to the GMM are US based interest rate spreads. These are likely to have a much stronger relationship to the behavior of SPY, QQQ, TLT vs the other ETFs. To improve performance I believe one would have to locate indicators based on the asset/ETF one wants to trade.

To sum up, I'm encouraged by the strategy framework, but would like to see a wider array of stocks, asset classes, and ETFs tested with various combinations of factors.

Jupyter (IPython) Notebook

Here is a sample exploratory notebook I put together for the webinar that demonstrates the conclusions drawn above.

Recap

In the previous post I gave a basic "proof" of concept, where we designed a trading strategy using Sklearn's implementation of Gaussian mixture models. The strategy attempts to predict an asset's return distribution such that returns that fall outside the predicted distribution are considered outliers and likely to mean revert. It showed some promise but had many areas in need of improvement.

Model Update

In this version I've refactored a lot of the code into a more object oriented structure. Now the code uses three classes.

ModelRunner() class - This is the class for executing the model and returning our prediction dataframe and some key parameters.

ResultEval() class - This takes the data from the prediction dataframe and key parameters and outputs our strategy returns and summary information.

In this post I'm going to skip to the results and conclusions, and provide the refactored code at the end.

Model Results

First let's look at the model results using SPY.

The first thing I noticed was that the confidence intervals were less responsive to increases in return volatility. The difference shows up in the reduction in accuracy. In Part 1, I believe the accuracy was ~71% whereas in the updated model the accuracy has dipped to ~68%! Does that hurt our strategy?

Judging by the equity curve, our strategy is not noticeably impacted by the reduced model accuracy!

The plotted equity curve is the cumulative sum of each event's returns assuming every event was a "trade". This should include overlapping events.

Let's look at the model results for the other ETFs.

The model has some interesting output. Notice that model accuracy ranges from ~57% (TLT) to ~83% (EEM). However, both of these equity curves end positively. GLD is distinctly volatile, and ends poorly, however the model was 75% accurate. DIA, QQQ, SPY, and ACWI all have stable sharply positive equity curves.

Conclusions

This supports my initial findings that model accuracy seems loosely, if at all, related to the strategy's equity curve. These results do indicate that the strategy is worth further evaluation but I'm hesitant to declare success.

I need to test the strategy over a longer period of time and make sure to include 2008/9. Also, I need to drill down into evaluating the strategy results vs the correlation of asset returns. For example, DIA, QQQ, and SPY are highly correlated, so we would expect the strategy to have similar results among those ETFs, but what about negatively and uncorrelated assets? TLT is generally negatively correlated with SPY while GLD is likely uncorrelated. Is the strategy performance for those two ETFs representative of other negatively/uncorrelated ETFs?

Recap

In Part 1 we learned about Hidden Markov Models and their application using a toy example involving a lazy pet dog. In Part 2 we learned about the expectation-maximization algorithm, K-Means, and how Mixture Models improve on K-Means weaknesses. If you still have some questions or fuzzy understanding about these topics, I would recommend reviewing the prior posts. In those posts I also provide links to resources that really helped my understanding.

Hypothesis

Given what we know about Mixture Models and their ability to characterize general distributions, can we use it to model a return series, such that we can identify outlier returns that are likely to mean revert?

Strategy

This strategy attempts to predict an asset's return distribution. Actual returns that fall outside the predicted confidence intervals are considered outliers and likely to revert to the mean.

We first fit a Gaussian Mixture Model to the historical daily return series. We use the model's estimate of the hidden state's mean and variance as parameters to a random sampling from the JohnsonSU distribution. We then calculate confidence intervals from the sampled distribution.

From there we evaluate model accuracy and the n days cumulative returns after each outlier event. We compute some summary statistics and try to answer the hypothesis.

Why the johnson SU distribution?

Searching the net I found a useful bit of code from this site. Instead of assuming our asset return distribution is normal, we can use Python and Scipy.stats to find the brute force answer. We can cycle through each continuous distribution and run a goodness-of-fit procedure called the KS-test. The KS-test is a non-parametric method which examines the distance between a known cumulative distribution function and the CDF of the your sample data. The KS-test outputs the probability that your sample data comes from the benchmark distribution.

After running this code you should see output similar to the below code. For simplicity sake, just remember the higher the p-value, the more confident the ks-test is that our data came from the given distribution.

I had never heard of the Johnson SU distribution before this code. I had to research it, and I found that the Johnson SU was developed to in order to apply the established methods and theory of the normal distribution to non-normal data sets. What gives it this flexibility is the two shape parameters, gamma and delta, or a, b in Scipy. For more information I recommend this Wolfram reference link and this Scipy.stats link.

Now we create our convenience functions. The first is the run_model() function which takes the data, feature columns, and Sklearn mixture parameters to produce a fitted model object and the predicted hidden states. Note that you can use a Bayesian Gaussian mixture if you so choose. The difference between the two models is that the Bayesian mixture model will try to derive the correct number of mixture components up to a chosen maximum. For more information on the Bayesian mixture model I recommend consulting the Sklearn docs.

Now we take the estimated state mean and variance of the last predicted state and feed it into the _get_ci() function. This function takes the alpha and shape parameters, estimated mean and variance and randomly samples from the JohnsonSU distribution. From this distribution we derive confidence intervals.

Now we can run the model in a walk-forward fashion. The code uses a chosen lookback period up until the cutoff year to fit the model. From there, the code iterates refitting the model each day, outputting the predicted confidence intervals. The code is setup to run using successive cutoff years, however I will leave that to you readers to experiment with. In this demo we will break the loop after the first cutoff year.

After that's complete we need to set up our analytics functions to evaluate the return patterns post each event. Recall that an event is an actual return that fell outside of our predicted confidence intervals.

Conclusions

To answer the original hypothesis about finding market bottoms, we can examine the returns after a too low event. Looking at the summary we can see that the mean and median return are +62 and +82 bps respectively. Looking at the sum_ratio we can see that that the sum of all positive return events is almost 2x the sum of all negative returns. We can also see that, given a too low event, after 5 days SPY had positive returns 65% of the time!

These are positive indicators that we may be able to predict market bottoms. However, I would emphasize more testing is needed.

Caveats and Areas of Exploration

We don't consider market frictions such as commissions or slippage

The daily prices may or may not represent actual traded values.

I used a coarse search to find the JohnsonSU shape parameters, a and b. These may or may not be the best values. Just note that we can use these parameters to arbitrarily adjust the confidence intervals to be more or less conservative. I leave this for the reader to explore.

In many cases both too high, and too low events result in majority positive returns, this could be an indication of the overall bullishness of the sample period that may or may not affect model results in the future.

I chose k=2 components for computational simplicity, but there may be better values.

I chose the lookback period for computational simplicity, but there may be better values.

Varying the step_fwd parameter may hurt or hinder the strategy.

What makes this approach particularly interesting, is that we don't want anything close to 100% accuracy from our predicted confidence intervals, otherwise we won't have enough "trades". This adds a level of artistry/complexity because the parameter values we choose should create predictable mean reversion opportunities, but the model accuracy is not a good indicator of this. Testing the strategy with other assets shows "profitability" in some cases where the model accuracy is sub 60%.

part 1 recap

In part 1 of this series we got a feel for Markov Models, Hidden Markov Models, and their applications. We went through the process of using a hidden Markov model to solve a toy problem involving a pet dog. We concluded the article by going through a high level quant finance application of Gaussian mixture models to detect historical regimes.

part 2 goals

In this post, my goal is to impart a basic understanding of the expectation maximization algorithm which, not only forms the basis of several machine learning algorithms, including K-Means, and Gaussian mixture models, but also has lots of applications beyond finance. We will also cover the K-Means algorithm which is a form of EM, and its weaknesses. Finally we will discuss how Gaussian mixture models improve on several of K-Means weaknesses.

This post is structured as a Jupyter (IPython) Notebook. I used several different resources\references and tried to give proper credit. Please contact me if you find errors, have suggestions, or if any sources were not attributed correctly.

Who is Andrey Markov?

Markov was a Russian mathematician best known for his work on stochastic processes. The focus of his early work was number theory but after 1900 he focused on probability theory, so much so that he taught courses after his official retirement in 1905 until his deathbed [2]. During his research Markov was able to extend the law of large numbers and the central limit theorem to apply to certain sequences of dependent random variables, now known as Markov Chains [1][2]. Markov chains are widely applicable to physics, economics, statistics, biology, etc. Two of the most well known applications were Brownian motion [3], and random walks.

What is the Markov Property?

"...a random process where the future is independent of the past given the present." [4]

Assume a simplified coin toss game with a fair coin. Suspend disbelief and assume that the Markov property is not yet known and we would like to predict the probability of flipping heads after 10 flips. Under the assumption of conditional dependence (the coin has memory of past states and the future state depends on the sequence of past states) we must record the specific sequence that lead up to the 11th flip and the joint probabilities of those flips. So imagine after 10 flips we have a random sequence of heads and tails. The joint probability of that sequence is 0.5^10 = 0.0009765625. Under conditional dependence, the probability of heads on the next flip is 0.0009765625 * 0.5 = 0.00048828125.

Is that the real probability of flipping heads on the 11th flip? Hell no!

We know that the event of flipping the coin does not depend on the result of the flip before it. The coin has no memory. The process of successive flips does not encode the prior results. Each flip is a unique event with equal probability of heads or tails, aka conditionally independent of past states. This is the Markov property.

What is a Markov Model?

A Markov chain (model) describes a stochastic process where the assumed probability of future state(s) depends only on the current process state and not on any the states that preceded it (shocker).

Let's get into a simple example. Assume you want to model the future probability that your dog is in one of three states given its current state. To do this we need to specify the state space, the initial probabilities, and the transition probabilities.

Imagine you have a very lazy fat dog, so we define the state space as sleeping, eating, or pooping. We will set the initial probabilities to 35%, 35%, and 30% respectively.

Now that we have the initial and transition probabilities setup we can create a Markov diagram using the Networkx package.

To do this requires a little bit of flexible thinking. Networkx creates Graphs that consist of nodes and edges. In our toy example the dog's possible states are the nodes and the edges are the lines that connect the nodes. The transition probabilities are the weights. They represent the probability of transitioning to a state given the current state.

Something to note is networkx deals primarily with dictionary objects. With that said, we need to create a dictionary object that holds our edges and their weights.

Now we can create the graph. To visualize a Markov model we need to use nx.MultiDiGraph(). A multidigraph is simply a directed graph which can have multiple arcs such that a single node can be both the origin and destination.

In the following code, we create the graph object, add our nodes, edges, and labels, then draw a bad networkx plot while outputting our graph to a dot file.

Not bad. If you follow the edges from any node, it will tell you the probability that the dog will transition to another state. For example, if the dog is sleeping, we can see there is a 40% chance the dog will keep sleeping, a 40% chance the dog will wake up and poop, and a 20% chance the dog will wake up and eat.

What Makes a Markov Model Hidden?

Consider a situation where your dog is acting strangely and you wanted to model the probability that your dog's behavior is due to sickness or simply quirky behavior when otherwise healthy.

In this situation the true state of the dog is unknown, thus hidden from you. One way to model this is to assume that the dog has observable behaviors that represent the true, hidden state. Let's walk through an example.

First we create our state space - healthy or sick. We assume they are equiprobable.

This is where it gets a little more interesting. Now we create the emission or observation probability matrix. This matrix is size M x O where M is the number of hidden states and O is the number of possible observable states.

The emission matrix tells us the probability the dog is in one of the hidden states, given the current, observable state.

Let's keep the same observable states from the previous example. The dog can be either sleeping, eating, or pooping. For now we make our best guess to fill in the probabilities.

The hidden Markov graph is a little more complex but the principles are the same. For example, you would expect that if your dog is eating there is a high probability that it is healthy (60%) and a very low probability that the dog is sick (10%).

Now, what if you needed to discern the health of your dog over time given a sequence of observations?

Using the Viterbi algorithm we can identify the most likely sequence of hidden states given the sequence of observations.

High level, the Viterbi algorithm increments over each time step, finding the maximum probability of any path that gets to state iat time t, that also has the correct observations for the sequence up to time t.

The algorithm also keeps track of the state with the highest probability at each stage. At the end of the sequence, the algorithm will iterate backwards selecting the state that "won" each time step, and thus creating the most likely path, or likely sequence of hidden states that led to the sequence of observations.

A Hidden Markov Model for Regime Detection

By now you're probably wondering how we can apply what we have learned about hidden Markov models to quantitative finance.

Consider that the largest hurdle we face when trying to apply predictive techniques to asset returns is nonstationary time series. In brief, this means that the expected mean and volatility of asset returns changes over time.

Most time series models assume that the data is stationary. This is a major weakness of these models.

Instead, let us frame the problem differently. We know that time series exhibit temporary periods where the expected means and variances are stable through time. These periods or regimes can be likened to hidden states.

If that's the case, then all we need are observable variables whose behavior allows us to infer the true hidden state(s). If we can better estimate an asset's most likely regime, including the associated means and variances, then our predictive models become more adaptable and will likely improve. We can also become better risk managers as the estimated regime parameters gives us a great framework for better scenario analysis.

In this example, the observable variables I use are: the underlying asset returns, the Ted Spread, the 10 year - 2 year constant maturity spread, and the 10 year - 3 month constant maturity spread.

Next we will use the sklearn's GaussianMixture to fit a model that estimates these regimes. We will explore mixture models in more depth in part 2 of this series. The important takeaway is that mixture models implement a closely related unsupervised form of density estimation. It makes use of the expectation-maximization algorithm to estimate the means and covariances of the hidden states (regimes). For now, it is ok to think of it as a magic button for guessing the transition and emission probabilities, and most likely path.

We have to specify the number of components for the mixture model to fit to the time series. In this example the components can be thought of as regimes. We will arbitrarily classify the regimes as High, Neutral and Low Volatility and set the number of components to three.

In the above image, I've highlighted each regime's daily expected mean and variance of SPY returns. It appears the 1th hidden state is our low volatility regime. Note that the 1th hidden state has the largest expected return and the smallest variance.The 0th hidden state is the neutral volatility regime with the second largest return and variance. Lastly the 2th hidden state is high volatility regime. We can see the expected return is negative and the variance is the largest of the group.

Conclusion

In this post we've discussed the concepts of the Markov property, Markov models and hidden Markov models. We used the networkx package to create Markov chain diagrams, and sklearn's GaussianMixture to estimate historical regimes. In part 2 we will discuss mixture models more in depth. For more detailed information I would recommend looking over the references. Setosa.io is especially helpful in covering any gaps due to the highly interactive visualizations.

Motivating the Journey

Where do Edges Come From?

You can see this simply by imagining the first strategy able to identify pricing errors on identical items in different markets. This knowledge is valuable in two scenarios: you can execute the transaction yourself or you know someone who can and will pay you for the "signal".

Abstractly, a signal can be thought of as a glitch in the matrix allowing us a view through a window into probabilistic future states. Signals can come from anywhere and are not always understood.

Our job is to find these signals, vet them, and implement them. This is difficult in practice. The competitive environment we seek to understand is dynamic with positive and negative feedback loops operating at various scales. The system processes are very noisy making signal extraction confusing and difficult. Competitors are always seeking strategies that "work" until they don't.

Generally profitable edges stop working when both your identification and execution strategies are well known. Thus a profit motive for secrecy and obfuscation exists among participants. If you are familiar with poker this will sound very familiar.

This also means that using well known identification techniques puts you at a strategic disadvantage because your competitors have likely incorporated knowledge of your methods into their own strategies.

Therefore we must continuously search for strategies that are not well understood, not well known, or otherwise difficult for our competitors to implement.

The Problem with Traditional Research

Too much published "research" focuses on using well known statistical tools to draw conclusions that do not improve the odds of profitable investment. Worse still, many research papers' results are not reproducible.

For periods of time, techniques involving technical analysis, regression, and simple correlations, were good enough to beat the market. This worked because the methodology was not well known or well understood. Times have changed.

These methods have been taught and promoted to generations of practitioners. These techniques form the foundation of many market participants investment strategies. Therefore the majority of well known strategies are already in use by the market.

This means sophisticated participants have had time and opportunity to develop counter strategies to take advantage of the limitations of publicly known methods.

Typical business finance teachings focus on the theory that stock values are directly tied to the expected value of net cash flows produced by the underlying operating business from now into some future period. Other research links stock prices to any number of other observable factors. My perception is that these well taught methods can bias our exploratory research when it comes to the art and science of prediction.

Successful prediction does not require understanding or logic. Prediction does not require expertise in the industry or business which generated the data. These things can help solidify our belief in the power of the prediction, however successful prediction methods only require a stable, positive payoff function relative to prediction accuracy over an expanding time period. Nothing more, nothing less.

The Hidden Side

By definition a hidden factor is not directly observable. Its presence or influence is detected by its effect on observable factor(s) or on a delayed basis.

Conceptualizing the influence of hidden factors is difficult for many decision makers to either understand or incorporate into already existing processes.

The combination of bias created by traditional finance and difficulty conceptualizing hidden factors, creates the barriers to entry we need for successful strategy development. We can reasonably assume this research pathway is still rich with profitable edges and worth pursuing.

A Brief Description:

Part 1 - A Visual Introduction to Hidden Markov Models with Python

In part 1, we will discuss Markov Models, Hidden Markov Models and a toy application for regime detection.

Part 2 - Exploring Mixture Models with scikit-learn and Python

In part 2, we will explore the motivation behind mixture models and how they improve on the weaknesses of K-means algorithms. We will also discuss the connection between Mixture Models and Hidden Markov Models. Finally we will extend our toy regime detector to use a mixture model instead.

Part 3 - Predicting market bottoms with scikit-learn and python

In part 3, we will implement a toy strategy using mixture models to predict market bottoms. The strategy assumes that we can calibrate a model to predict the market return distribution such that actual returns that fall below the confidence intervals are profitable long entries over short time periods.

MOTIVATION

Since I started this blog a few years ago, one of my obsessions is creating good looking, informative plots/charts. I've spent an inordinate amount of time learning how to do this and it is still a work in a progress. However all my work is not in vain as several of you readers have commented and messaged me for the code behind some of my time series plots. Beginning with basic time series data, I will show you how I produce these charts.

DEFAULT PLOT WITH RECESSION SHADING

Now we have to setup our recession data so we can get the official begin and end dates for each recession over the period.

# recessions are marked as 1 in the data
recs = data.query('USREC==1')
# Select the two recessions over the time period
recs_2k = recs.ix['2001']
recs_2k8 = recs.ix['2008':]
# now we can grab the indices for the start
# and end of each recession
recs2k_bgn = recs_2k.index[0]
recs2k_end = recs_2k.index[-1]
recs2k8_bgn = recs_2k8.index[0]
recs2k8_end = recs_2k8.index[-1]

Now we can plot the default chart with recession shading. Let's take a look.

It's starting to look pretty good, but we can get even more fancy. Say we wanted to annotate the global maximum and minimum returns in each subplot along with their respective dates for SPX and VIX . That could be a challenge. To do this we first need to extract the max/mins and idxmax/idxmin for both series.

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns. Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.