System & Language Agnostic Hyperparameter Optimization at Scale

And its Importance for AutoML

In recent years, there has been an explosion in the development of new machine learning architectures that achieve tasks once unfathomable for AI. While this article could rave endlessly about these exciting developments, it will instead be about the necessary framework that helps AI extend past the boundaries of the human mind. Boring? Maybe. But imagine a microservice which can optimize any ML model because it is invariant to language, infrastructure, and result/model storage desires. At Capital One, I am proud to have helped build a cloud-based, system- and language-agnostic hyperparameter optimization framework that has helped us achieve state-of-the-art results.

Before getting into the how, it is important to understand the purpose as well as the considerations that went into its development within the Capital One development and deployment environments on AWS.

What is Hyperparameter Optimization?

During the training phase of a ML model such as image object classification, we are iteratively optimizing the model’s parameters (weights and biases) to meet an expectation. Such parameters are called model parameters, which are separate from a different class of parameters called hyperparameters, the focus of this article. In the most basic sense:

Hyperparameters — those which alter the manner in which a model learns to do its task (i.e. optimization method, batch normalization, etc.) or the model’s learning capabilities (the structure of the model itself, i.e the number of hidden layers, size of the layers, etc.).

Model parameters — those which are iteratively learned while training the model (model weights and biases).

For example, if I want to teach an animal a new trick, some of the hyperparameters could be: animal type/breed/age; how I train the animal; treats used during training; number of training sessions, etc. because these either affect their brain structure or learning process; however, the process of teaching the animal is what optimizes the animal’s brain, the model parameters.

Figure 1: Neural network with labeled hyperparameters and model parameters. Model parameter weights, wij, and biases, bi, are contained within the hidden layers of the neural network. The hyperparameters number of hidden layers, layer size, and node connection patterns control the architecture of this neural network.

Ultimately, the purpose of hyperparameter optimization is to tune the training methodology and model architecture until the model achieves its best possible performance given available training, development, and test data.

Importance of Scaling

Depending on the optimization technique driving the hyperparameter tuning, multiple sets of hyperparameters can be selected concurrently. Each set of hyperparameters for a given model can be trained independently, making model training an extremely parallelizable process. For instance, each hyperparameter set from either grid or random search (non-learning optimizer) is determined independently from one another, whereas bayesian optimization (learning optimizer) requires the previous results to determine the next best hyperparameter set. Hence, non-learning optimizers can be easily parallelized while learning optimizers may require a few tricks.

Importance of Agnosticism

The most advanced techniques for model development and optimization change quite rapidly. Therefore, the ideal framework must be modular in order to easily swap to the newest capabilities. For hyperparameter optimization, this requires three primary agnostic components:

Programming language

Optimization library

Model management service

Additionally, for portability or security, there are the following three secondary agnostic requirements:

1. Infrastructure

2. Deployability/containerization

3. Configurable/controllable via RESTful APIs

Hyperparameter Optimization at Capital One

Local Development on a Laptop

Today’s ML capabilities make it fairly straightforward for the average user to download the latest data science packages (i.e. scikit-learn, etc.) and begin developing models locally. However, it is fairly unlikely for the model to perform optimally, and local hyperparameter optimization may not be scalable depending on the size of the model being optimized, the size of the hyperparameter space being evaluated, the optimization technique being applied, and/or the model training data. These limitations derive from the local bounds on memory and CPU speed/parallelization of the hyperparameter tasks.

Manually Controlled Cloud Optimization

Capital One is heavily invested in cloud infrastructure, which means spinning up VMs or containers to parallelize the hyperparameters tasks is relatively easy. These can be controlled all within the cloud or locally via remote access. However, managing results and deploying new hyperparameter searches to each container/VM requires an extensive DevOps background. Additionally, security limitations may not allow such communication to occur.

Automated Cloud Optimization

Figure 2: Communication diagram of the hyperparameter optimization framework. The hyperparameter microservice is constantly determining new hyperparameter sets from the optimization service, sending individual hyperparameter sets to each parameter testing node, and sending the results to the model management service.

Here is where things get exciting. Instead of managing the optimization ourselves, why not let a service handle it? I was fortunate enough to build such a product and it has only two requirements to begin training:

A GitHub repo with your model

A short JSON configuration script.

Below is an example script sent to this hyperparameter tuning microservice:

Once the script is received by the hyperparameter mircoservice, it will spin up parameter nodes testing each set of parameters it gets from an optimizer service until it receives a stop command or no more search parameters from the optimizer. As hyperparameter sets are being evaluated, completed training results are simultaneously sent to the optimizer and model management service, which can store both model and results for later retrieval.

Figure 3: Front-end of the hyperparameter optimization microservice. Once the configuration script is sent to the microservice, the user can submit the configuration for optimization and track its completion status.

This hyperparameter optimization framework makes it easy to:

Submit models to be tuned using the engineer’s desired optimization techniques

Store the trained models

Store the associated hyperparameters, results, and source of the dataset

While the first two points are necessary/useful, it is the model training history which is critical for AutoML, which I’ll address next.

Importance and Application to AutoML

Ultimately, the goal is to allow the services to build models with limited human effort. Simply point to data and allow the service to generate the model automatically (AutoML). However, without a knowledge set of what are unsuccessful/successful model architectures, new models would have to be built from scratch, without using the knowledge gained from building prior models.

Instead, an AutoML service can be built upon a catalog of previously attempted architectures and results (as mentioned above) to predict what architectures may work best based on previous searches. In other words, this allows a model which is trained on the knowledge set of previous architectures and attempts to predict our next steps in architecture search, based on all previous history rather than just that done while training the current model.

Current and Future Applications

For the most part, this microservice has mostly been applied to the tuning of deep learning models. One use case includes the models mentioned by my colleague in his post, “Why You Don’t Necessarily Need Data for Data Science.” These models, especially GANs, can be brittle in nature due to the dependencies between key/value pairs in data. Utilizing this service led to an increased success in both structured and unstructured synthetic data generation by as much as 30%, and also reduced manual model manipulation time from weeks/months to less than a day.

Figure 4: Example 2D mapping of accuracy in relation to batch size and number of time steps for an RNN when utilizing the hyperparameter microservice. For this case, A larger batch size increased accuracy; however, increasing the amount of time steps decreased accuracy.

As I mentioned,the hope is to be able to map the hyperparameter space for a given model and use said information to be able to predict ideal architectures for future models. In order to build said predictor, a plethora of data would be required to make such a prediction, but with the simplicity of challenger model generation and storage of each hyperparameter iteration’s results, model, and metadata, this possibility is becoming reality.