"Trading is statistics and time series analysis." This blog details my progress in developing a systematic trading system for use on the futures and forex markets, with discussion of the various indicators and other inputs used in the creation of the system. Also discussed are some of the issues/problems encountered during this development process. Within the blog posts there are links to other web pages that are/have been useful to me.

Pages

Tuesday, 23 August 2011

It has taken some time, but I have finally been able to incorporate the Trend Vigor indicator into my Naive Bayesian classifier, but with a slight twist. Instead of being purely Bayesian, the classifier has evolved to become a hybrid Bayesian/clustering classifier. The reason for this is that the Trend Vigor indicator has no varying distribution of values but tends to return values that are so close to each other that they can be considered a single value, as mentioned in an earlier post of mine. This can be clearly seen in the short 3D visualisation animation below. The x, y and z axis each represent an input to the classifier, and about 7 seconds into the video you can see the Trend Vigor axis in the foreground with almost straight vertical lines for its "distributions" for each market type. However, it can also be seen that there are spaces in 3D where only combined values for one specific market type appear, particularly evident in the "tails" of the no retracement markets ( the outermost blue and magenta distributions in the video. )

The revised version of the classifier takes advantage of this fact. Through a series conditional statements each 3D datum point is checked to see if it falls in any of these mutually exclusive spaces and if it does, it is classified as belonging to the market type that has "ownership" of the space in which it lies. If the point cannot be classified via this simple form of clustering then it is assigned a market type through Bayesian analysis.

This Bayesian analysis has also been revised to take into account the value of the Trend Vigor indicator. Since these values have no distribution to speak of a simple linear model is used. If a point is equidistant between two Trend Vigor classifications it is assigned a 0.5 probability of belong to each, this probability rising in linear fashion to 1.0 if it falls exactly on one of the vertical lines mentioned above, with a corresponding decrease in probability assigned to the other market type classification. There is also a boundary condition applied where the probability is set to 0.0 for belonging to a particular market type.

The proof of the pudding is in the eating, and this next chart shows the classification error rate when the classifier is applied to my usual "ideal" time series.

The y axis is the percentage of ideal time series runs in which market type was mis-classified, and the x axis is the period of the cyclic component of the time series being tested. In this test I am only concerned with the results for periods greater than 10 as in real data I have never seen extracted periods less than this. As can be seen the sideways market and both the up and down with no retracement markets have zero mis-classification rates, apart from a small blip at period 12, which is within the 5% mis-classification error rate I had set as my target earlier.

Of more concern was the apparent large mis-classification error rate of the retracement markets ( the green and black lines in the chart. ) However, further investigation of these errors revealed them not to be "errors" as such but more a quirk of the classifier, which lends itself to exploitation. Almost all of the "errors" occur consecutively at the same phase of the cyclic component, at all periods, and the "error" appears in the same direction. By this I mean that if the true market type is up with retracement, the "error" indicates an up with no retracement market; if the true market type is down with retracement, the "error" indicates a down with no retracement market. The two charts below show this visually for both the up and down with retracement markets and are typical representations of the "error" being discussed.

The first pane in each chart shows one complete cycle in which the whole cycle, including the most recent datum point, are correctly classified as being an up with retracement market ( upper chart ) and a down with retracement market ( lower chart. ) The second pane shows a snapshot of the cycle after it has progressed in time through its phase with the last point being the last point that is mis-classified. The "difference" between each chart's respective two panes at the right hand edge shows the portion of the time series that is mis-classified.

It can be seen that the mis-classification occurs at the end of the retracement, immediately prior to the actual turn. This behaviour could easily be exploited via a trading rule. For example, assume that the market has been classified as an up with retracement market and a retracement short trade has been taken. As the retracement proceeds our trade moves into profit but then the market classification changes to up with no retracement. Remember that the classifier (never?) mis-classifies such no retracement markets. What would one want to do in such a situation? Obviously one would want to exit the current short trade and go long, and in so doing would be exiting the short and initiating the possible long at precisely the right time; just before the market turn upwards! This mis-classification "error" could, on real data, turn out to be very serendipitous.

All in all, I think this revised, Mark 2 version of my market classifier is a marked improvement on its predecessor.

Tuesday, 16 August 2011

Some time ago (the file was last edited in July 2010) I wrote an Octave .oct function to create synthetic data for testing and optimisation purposes. I was inspired to do so by the December 2005 issue of The Breakout Bulletin and it has recently come to mind again due to a posting on the StackExchange Quantitative Finance Forum here. I have posted the code for my .oct function in the code box below.

In writing this function I wanted to extend the ideas presented in the Breakout Bulletin and make them more applicable for the purposes I had/have in mind. By randomly scrambling the data any bar to bar dependency is destroyed (by design of course), but what if you want to preserve some bar to bar dependencies? This .oct function is my solution to preserving this dependency and a brief discussion of the theory behind it follows.

Firstly there is an assumption that any single bar and the market forces that caused the bar to be formed the way it did (up bar, down bar, doji etc.) are dependent on the immediately preceding market activity and the "current mode" of the market. Implicit in this assumption is that certain "types" of bars are more likely to be seen depending on market "mode," i.e. the types of bar in an up trend are likely to be distinctly different from those in a down trending or sideways trending market, so what is needed is some way to bin the bars which reflects this.

My solution is to apply a 21 bar moving median of the close and median absolute deviations from this median as bands above and below it, similar to Bollinger Bands. There are 3 levels; 1 x MAD, 2 x MAD and 3 x MAD above, and 3 below; to give a total of 8 "zones" as they are called in the code. Furthermore, a 21 bar moving median of the True Range and a 4 bar WMA of the True Range are also calculated. The first part of the code ("Code Block A Loop"), after all the required declarations, loops over the input time series data calculating all the above and assigning each bar to a specific bin based upon the "zone" in which the previous bar resides, and further assignation depends on whether the previous bar is a high or low volatility bar decided by the True Range 4 bar WMA being above or below the True Range 21 bar moving median. This gives a total of 16 different bins to which a bar can be assigned. On assignation to a bin, the open, high, low and close are recorded in that bin by their relation to the previous close thus: log10(close/previous_close), log10(open/previous_open)... etc.

The next part of the code ("Code Block B Loop") actually creates the synthetic data by randomly drawing a bar's relationships to its previous close from the "relevant bin" and calculating a "new" bar based upon these relationships. This "relevant bin" is determined by the "zone" position and volatility of the most recently calculated synthetic "new" bar. After a new, "new bar" has been created, the median, MADs and True Range calculations are updated to include this new, "new bar," which becomes the previous bar on the next iteration of the loop for Code Block B Loop.

Finally, a small part of the code adjusts the input data in the case of negative values due to the possible use of continuous back-adjusted futures contracts as the input data. This is necessary to avoid errors in trying to calculate the log10 of a negative number.

The above method of binning the input data and subsequent randomisation is my attempt to ensure that dependencies/characteristics of the original data are preserved - for example - assume a bar is above the upper 3 x MAD level and is determined to be a high volatility bar, then the next synthetically created bar will be drawn only from the binned distribution of bars that in the real data also follow a bar above the upper 3 x MAD level and is determined to be a high volatility bar.

This code is offered as is and comes with no warranty whatsoever. However, if you like it and use it I would be interested to hear from you. In particular, if you have any suggestions for the code's improvement, extension, optimisation etc. or see any errors in the code, I would really appreciate your feedback.

A final thought: although not implemented in the above code it would be possible to apply some form of "quality control" to the output. Statistical measures of the input time series could be taken and thresholds established and only those synthetic outputs that fall within these threshold conditions could be accepted as a valid synthetic time series output.

Below is a screenshot of a time series and synthetic data generated from it using the above function code. For the moment I won't say which is the original and which is the synthetic data - perhaps readers would like to post their guesses as comments?