Machine Learning on DARWIN Datasets (MLD-I)

Machine learning in essence, is the research and application of algorithms that help us better understand data.

By leveraging statistical learning techniques from the realm of machine learning, practitioners are able to draw meaningful inferences from and turn data into actionable intelligence.

Furthermore, the availability of several open source machine learning tools, platforms and libraries today enables absolutely anyone to break into this field, utilizing a plethora of powerful algorithms to discover exploitable patterns in data and predict future outcomes.

This development in particular has given rise to a new wave of DIY retail traders, creating sophisticated trading strategies that compete (and in some cases, outperform others) in a space previously dominated by just institutional participants.

In this introductory blog post, we will discuss supportive reasoning for, and different categories of machine learning. In doing so, we will lay the foundation for using machine learning techniques to create DARWIN trading strategies in future blog posts in this series.

For your convenience, this post is structured as follows:

1) The Case for Machine Learning

2) Three Main Categories of Machine Learning

3) Setting up Python/R & C++ for Machine Learning on DARWIN Datasets

1) The Case for Machine Learning

We live in an age where both structured and unstructured data are available in abundance. Not only that, people now also have the tools and resources to gather this data for themselves if they so wish (at little to no cost), a reality that did not exist before.

Over time, machine learning has evolved into a robust means for capturing knowledge from, analyzing and creating predictive models for large volumes of data in a scalable, efficient manner when compared to manual human-driven practices. In doing so, it has also enabled practitioners to iteratively improve upon existing models and incorporate data driven decision-making in their pursuits.

Apart from its widespread use in finance, machine learning has also given rise to things over time that many now take for granted.

For example,

Email SPAM filters,

Video recommendation engines,

Personalized advertising,

Internet search engines,

Industrial robotics (e.g. in the automobile industry),

..and even self-driving cars!

The DARWIN dataset (a multivariate time series) can therefore benefit from machine learning led research, and that’s exactly what this series of blog posts aims to lay the groundwork for.

In fact, there exists an ever-growing number of DARWIN assets on our Exchange that are powered entirely by machine learning driven trading strategies, three categories of which we discuss next.

Three Main Categories of Machine Learning

Main Types of Machine Learning

These are:

Supervised Learning

Unsupervised Learning

Reinforcement Learning

An argument can indeed be made for a fourth category – Deep Reinforcement Learning – that involves a combination of supervised and reinforcement learning practices (more on this in future posts).

We will now discuss the key differences between these three types, and with the help of examples, develop an understanding of their practical applications.

1) Supervised Learning

Supervised Machine Learning

In supervised learning, our aim is to “learn” a predictive model from “labeled” training data. The learned model is assessed for its ability to generalize well to unseen data, after which it can be used to predict outcomes based on future unseen data.

In this example, the Quote represents our response or target variable, and the rest our predictor variables. However, there is nothing stopping us from considering any other variable as our target variable.

For example, a study could switch from attempting to predict a DARWIN’s next Quote (for trade entry purposes) to say predicting the next La (for forecasting loss aversion). Several possibilities exist depending on the problem one is trying to solve.

In all cases, supervised machine learning attempts to find relationships between the predictor variables that “explain” the data, and the target variable (the output).

—

The following image illustrates one of the most basic forms of regression tasks, a linear regression.

In this example, a straight line is “fit” robustly to training data containing predictor values (x) and a response value (y), such that the distance between the data points and the line is minimized. The resultant gradient and intercept of the line can then be used to predict the outputs (y) of future unseen data (x).

Machine Learning – Linear Regression

Future blog posts in this series will cover the details, mathematical notation and how to perform regression tasks on DARWIN datasets in Python and R, with sample code.

1.2) Classification

In this sub-category of supervised machine learning, our task is to predict what discrete group (or “class”) unseen data belongs to.

As in regression analysis, the predictive model is once again “learned” from a training set where predictor variables and their corresponding target variable have already been provided to us. Only in this instance, the target variable is not a continuous numeric value, but a fixed set of class labels or groups.

Using the example given above in the discussion on regression analysis applied to DARWIN data, a classification approach could modify the problem from predicting a continuous output (DARWIN Quote), to a binary output (UP or DOWN).

The predictive model in this case would then be used to predict the DARWIN’s next movement (UP or DOWN) as opposed to a numeric value for its next forecast Quote (or any other target variable depending on the problem being attempted).

However, binary classification is not a must. A predictive model will classify unseen data based on class labels (groups) observed in the training set, thereby also permitting multi-class classification.

Supervised Machine Learning – Multi-Class Classification

For example,

If the training set of DARWIN data contained rows of attribute scores for predictors (as in the regression example above) and class labels UP, DOWN, SIDEWAYS, BREAKOUT, STAY-OUT for targets, a robust predictive model could then “classify” future unseen data as one of these classes, possible use cases including forecasting direction, volatility, risk management, etc.

Future posts in this series will cover the details, mathematical notation and how to perform classification tasks on DARWIN datasets in Python and R, with sample code.

2) Unsupervised Learning

Unsupervised Machine Learning

Unlike supervised machine learning where a training set contains predictors and a target variable’s true outcomes, in unsupervised learning the data structure is unknown.

Unsupervised learning techniques can be used to study this unknown structure, in an attempt to explore and extract valuable intelligence for a variety of predictive modeling purposes.

There are two main sub-categories of unsupervised learning:

Clustering

Dimensionality Reduction

2.1) Clustering

Clustering is an unsupervised learning technique that enables practitioners to take data with unknown structure and assemble it into meaningful classes or clusters.

Unlike supervised classification problems where training data will enable the “learning” of underlying relationships from already available ground truths, clustering algorithms will assemble data of unknown structure into classes without any previous knowledge of underlying relationships.

Each class or cluster arrived upon essentially includes a set of observations that are quite similar to each other, but dissimilar to observations found in other clusters. This makes clustering a great approach to extracting meaningful intelligence from input data.

Some of the many motivations for utilizing unsupervised learning in finance include data cleansing, portfolio selection, de-noising and detecting regime change.

The following image illustrates how clustering algorithms can be deployed on data with unknown structure, and yield finite numbers of clusters based on the similarity of predictor data:

Unsupervised Machine Learning – Clustering Algorithm

Future posts in this series will explore and implement possible use cases of unsupervised clustering to DARWIN data, such as dynamic portfolio selection, custom filter creation, determination of seasonality in DARWIN risk profiles, to name a few ideas.

Working code in Python/R/C++ will also be provided alongside any implementations arrived upon.

2.2) Dimensionality Reduction

Data of large dimensions can present challenges in terms of storage, computational efficiency (especially in real-time – an important consideration for trading algorithms) and performance.

Combining 12 investment attributes for each DARWIN, across over 2,500 DARWINs (as of 07 December, 2017 12:30 GMT), with the multitude of underlying strategy parameters available, as well as any additional feature engineering can quickly give rise to situations where a dimensionality reduction exercise may be warranted.

Dimensionality reduction is useful for:

Reducing data from large to smaller dimensions, such that most of the important information in it is retained.

Visualization exercises where data of large dimensional space can be projected onto 1D to 3D space for subsequent rendering in standard statistical charts for analysis.

The following image illustrates how dimensionality reduction can project a multi-dimensional (>3) dataset to a 2D surface while retaining most of its important information:

Machine Learning – Dimensionality Reduction

Future posts in this series will outline the rationale and implementation of any dimensionality reduction exercises carried out, accompanied by Python/R/C++ source code where appropriate.

3) Reinforcement Learning

Machine Learning – Reinforcement Learning

This sub-category is related to supervised learning, and involves the development of agents (e.g. systems) that optimize their own performance via interactions with their environment.

Agents respond to the state of their current environment, which also contains a reward signal. With repeated interactions using a trial-and-error driven approach, the agentlearns what series or assortment of actions leads to maximal reward.

Possibly one of the most amazing developments in the field of reinforcement learning is DeepMind’s AlphaGo Zero – in a nutshell, a reinforcement learning algorithm that mastered the game of Go by playing against itself repeatedly!

Reinforcement learning has several applications in trading, including its use in trade entry/exit timing, portfolio rebalancing and determining optimal holding periods, to name a few.

Future posts in this series will assess the suitability of reinforcement learning to DARWIN datasets, present any studies carried out and provide Python/R/C++ source code for the same.

Setting up Python/R & C++ for Machine Learning on DARWIN Datasets

In order to follow along with our future publications that include implementations and source code, you’ll need to have a functional DARWIN data science environment setup to support Python, R & C++.

This is the second post in the MetaTrader Expert Advisor [EAS] series we’ve begun recently. In case you missed it, here’s a link to the first post: Commercial Expert Advisors: everything that glitters is not gold. If you’re wondering what “set & forget” means, it’s a common catch phrase used widely by many commercial MetaTrader […]

This post describes how to setup a data science environment for DARWIN R&D. Whether you’re a Data Scientist, Quant, Trader, Investor, Researcher, Developer or just someone keen on putting the DARWIN asset class under a scientific microscope, the contents of this post should hopefully give you a sound start. The tools, libraries and datasets referenced herein […]

In a previous post – Quantitative Modeling for Algorithmic Traders – we discussed the importance of Expectation, Variance, Standard Deviation, Covariance and Correlation. In this post we’ll discuss how those concepts can be applied to DARWIN assets. As a practical example, we will employ a series of statistical tests to assess if DARWIN $DWC is […]

In the last two posts, LVQ and Machine Learning for Algorithmic Traders – Part 1, and LVQ and Machine Learning for Algorithmic Traders – Part 2, we demonstrated how to use: Linear Vector Quantization Correlation testing ..to determine the relevance/importance of and correlation between strategy parameters respectively. Yet another technique we can use to estimate […]

Subscribe to our newsletter to stay up to date with Darwinex news, we promise we won’t spam you!

Email

The Darwinex® trademark and the www.darwinex.com domain are owned by
Tradeslide Trading Tech Limited, a company duly authorised and regulated by the Financial Conduct Authority
(FCA) in the United Kingdom with FRN 586466. Our Company number is 08061368 and our registered office is Acre
House, 11-15 William Road, London NW1 3ER, UK.

The Darwinex® trademark and the www.darwinex.com domain are owned by
Tradeslide Trading Tech Limited, a company duly authorised and regulated by the Financial Conduct Authority
(FCA) in the United Kingdom with FRN 586466. Our Company number is 08061368 and our registered office is Acre
House, 11-15 William Road, London NW1 3ER, UK.

This website stores cookies on your computer. These cookies are used to collect information about how you interact with our website and allow us to remember you. We use this information in ord$
Privacy Policy.

If you decline, your information won't be tracked when you visit this website. A single cookie will be used in your browser to remember your preference not to be tracked

CFDs are complex instruments and come with a high risk of losing money rapidly due to leverage.
71 % of retail investor accounts lose money when trading CFDs with this provider. You should consider whether
you understand how CFDs work and whether you can afford to take the high risk of losing your money.