Machine Learning: Dealing with Growing Data for Neural Networks

Introduction

You may or may not know the phrase "data is the new oil" was coined by a British mathematician named Clive Humby in 2006, who famously rolled out the British-based grocery chain Tesco's highly successful loyalty program Clubcard in the early 1990s. While you've likely heard this phrase countless times since 2006, Clive Humby had a simple but profound point. Simply, data by itself doesn't have much value. This is not so dissimilar from oil - it needs to be refined before its true practical benefits can be "unlocked" for the countless end markets it serves. In that vein, if data is the new oil, I guess our start-up Apteo is effectively a data refinery!

At Apteo, data is the lifeblood of what we do, but properly analyzing data to develop AI tools for investors, such as Milton, presents a unique set of technical challenges that must be addressed. In this post, we highlight one of those challenges.

Objective & Target Audience of Post

In this post, we go over some of the key technical challenges we have faced when using deep networks to analyze financial data on a single machine. The intended audience for this post is software and machine learning engineers, data scientists, engineering managers, and other technical folks that are running into memory and scalability issues when building neural networks on single-instance GPUs. While I have worked with large-scale clusters to analyze massive data sets in prior roles as a Data Scientist at Twitter or the Lead Engineer at TapCommerce (my prior start-up that was sold to Twitter), we have yet to use large-scale clusters at Apteo. But with so much new data, I suspect we may have to transition to large-scale clusters soon!

At the end of this post, our goal is for you to understand some of the data scalability issues that we've run into and to have an idea of how to address these issues efficiently.

Our Data Is "Medium Sized" Today...But Growing Fast

In the world of machine learning and AI, more usually means better. This statement tends to be true as it relates to the size of training datasets. All else being equal, more training instances lead to less overfitting, a higher likelihood of model conversion, and overall better models.

But there’s no such thing as a free lunch, and the accuracy gains that come from a large dataset are accompanied by engineering and infrastructure challenges that come from processing that dataset.

These are some of the technical challenges that we’ve recently had to address with Milton.

We build deep networks that leverage analyst opinions, news, articles, and reports, all in conjunction with various macro and company-specific data, and all with the goal of analyzing stocks objectively to create high-quality investing insights. Because we’re using deep networks to handle natural language processing, we need a large number of training examples in order for our models to learn. To date, we’ve avoided using distributed infrastructure to train our networks, which means all of our training sessions run on single-instance GPUs.

Thus far, we've been lucky in being able to avoid distributed computing. Clusters cost more money, require more engineers, require more maintenance, have more points of failure, and are significantly more complicated to maintain than single machines. What they’re really good for, though, is processing large datasets that a single machine just can’t handle.

Thus far, our dataset has been small enough to fit on a single machine. Recently, though, it has grown significantly, though not quite to the point where it would be worth it for us to implement a distributed architecture. Because of this growth, we’ve experienced a significant increase in memory and segfault issues that had never been problems before.

Despite the fact that we have plenty of swap space available on our machines, the Linux kernel, Python interpreter, and Cpython modules that we use don’t always interact as we’d expect them to, at least when it comes to addressing memory space. So our efforts to handle our memory issues from an infrastructure and devops side yielded minimal results. Because of that, we turned to software-based solutions.

Batch Training

On Apteo's tech team, we place a strong emphasis on solving our problems by getting to their root cause, rather than guessing at what’s going wrong. Because of that, we took a lot of time to instrument our performance and analyze what was happening when our model training code was being executed on our machines.

As mentioned, we realized that our process was being killed due to memory errors, despite a sufficient amount of remaining swap space. Most of these came in the form of Segfaults, though some of them were simply in the form of Memory Error.

Segfaults in Python are usually due to an underlying module written in C experiencing problems accessing physical memory locations. As we traced through our code, we realized that there were many inefficient areas in our code where our entire dataset was needlessly loaded into memory, which we believed were causing underlying C-based modules to operate incorrectly. This led us to conclude that we could solve these problems by batch-processing our data. This was made especially easy by Keras.

Initially we decided to implement a single fix - instead of training our model on all data at once, train it on smaller batches that could be loaded in from disk in a streaming manner. This method allows the underlying Python interpreter to more effectively allocate blocks of memory. Fortunately for us, Keras provides direct support for this paradigm in the form of Data Generators.

The initial refactor of our code was relatively straightforward. We went from code that looked like this:

Of course, we did have to create a new Generator object that implemented the __len__and__get_item__ methods, however doing that was relatively straightforward. Unfortunately, there were several other issues that arose as part of this refactor, including updating unit tests and handling new Segfault issues that arose from our need to upgrade from Keras 1.0 to Keras 2.0.

We got around these new Segfault issues by using Keras 2.1.0 instead of the most recent release at the time — for some reason the older version’s code seemed to be a bit more efficient here.

This approach worked well for a while, but the size of our dataset began affecting us in other ways — primarily, when we were transforming our dataset into a format that could be fed into our model’s fit function.

Because of this, we had to update our transformation code to use batches as well.

Batch Transformation

We implement a few techniques to make our entire training process more efficient.

The first is in the use of a golden set: we periodically create a fully hydrated dataset of all of our training instances so that we can later re-use that set on an as-needed basis, rather than recreating it on the fly. When we need to add new features or data to that set, we simply add those features on-demand.

We’ve also created a series of data transformers that operate in sequence. This paradigm, which follows closely from the Pandas pipeline model, allows us to modularize the process of transforming our dataset into small and manageable chunks of code that are responsible for individual transformations themselves.

What we found was that when we were trying to update our golden sets with new features (so that we could evaluate how those features changed our models’ predictive accuracy), we were running into new memory errors.

We took a look at the offending transformers and again realized that we were running into the same issue as above. Namely, we were trying to allocate large blocks of memory space to process our data, so our approach to this was the same as above: batch-process our data in small chunks so that the underlying OS could more effectively manage available memory.

After we updated our transformers to operate in batches (while also removing and refactoring other transformers that were no longer necessary), we were able to update our code to operate properly, without memory issues.

This is a huge win for us, since it allows us to scalably add more data and features without having to resort to implementing a distributed training cluster. Perhaps at one point we’ll have to bite the bullet and do that, but as a small startup, it's important to be resourceful.

Get in Touch

If you're interested in keeping up with what we're doing, you can sign up for Milton hereand subscribe to our newsletter.

Also, please feel free to reach out directly at shanif@apteo.co with any questions.

About ApteoApteo, the company behind Milton, is made up of curious data scientists, engineers, and financial analysts based in the Flatiron neighborhood in New York City. We have a passion for technology and investing, and we strongly believe that investing is one of the most reliable and effective ways to build long-term wealth. We build AI tools to help informed investors make better decisions.

To learn more about us, please reach out to us at info@apteo.co, join our mailing list at milton.ai, or subscribe to Milton’s blog at blog.milton.ai.

DisclaimerApteo, Inc. is not an investment advisor and makes no representation or recommendation regarding investment in any fund or investment vehicle.