Seamlessly Scale Predictions with AWS Lambda and MXNet

Building AI solutions at scale can be challenging, in this blog we’ll look at how to leverage AWS Lambda and MXNet to build a scalable prediction pipeline.

Companies that leverage machine and deep learning invest in much more than just training models. They have sophisticated pipelines that include the following stages:

Data storage

Pre-processing

Feature extraction

Model generation

Model Analysis

Feature engineering

Evaluation and feedback

Each stage of the pipeline requires:

Elasticity to adapt to changing workload demands

Scalability to adapt well to the overall size of the workload

Cost effectiveness to optimize total cost of ownership (TCO)

Amazon S3 meets all of the requirements for data and model storage. But the unpredictability of user demand and location can make scaling up for batch predictions (or the results of the model analysis) challenging, and can affect the overall user experience. In this post, we show how to use MXNet and AWS Lambda to deploy models at scale for predictions.

What is MXNet?

MXNet is a full-featured, flexibly programmable, and highly scalable deep learning framework that supports state-of-the-art deep models, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It is the result of collaboration between researchers at several top universities, including the founding institutions of the University of Washington and Carnegie Mellon University.

As discussed in Werner Vogel’s MXNet – Deep Learning Framework of Choice at AWS post, not only is MXNet scalable for multi-instance training, it also scales down to a variety of devices and small memory footprints, even when serving predictions on very large models. MXNet is available through open source under the Apache Version 2 license.

Challenges with the prediction pipeline

As previously mentioned, ML model training and validation is just a small part of the story. After the model is built, the real work begins. To service millions of customers seamlessly, every application must scale. However scaling the prediction portion of the pipeline can be challenging and expensive, especially when end users are geographically dispersed.

“Code, test, deploy, and let the service do the heavy lifting” is the Lambda customer motto. Regardless of your traffic patterns—high concurrency or bursty workloads—Lambda scales to service your needs.

We compiled and built the MXNet libraries to demonstrate how Lambda scales the prediction pipeline to provide this ease and flexibility for machine learning or deep learning model prediction. We built a sample application that predicts image labels using an 18-layer deep residual network. The model architecture is based on the winning model in the ImageNet competition called ResidualNet. The application produces state-of-the-art results for problems like image classification.

Putting it all together

Lambda has a deployment package limit of 50 MB. This limit means that you might not always be able to package your models along with the code. To accommodate this limitation, you can use S3 to store the model, and download the model when you service the request.

For optimal performance, you need to download the model outside of the lambda_handler function so that the downloaded file persists across< requests in memory. For subsequent Lambda invocations, MXNet uses the downloaded model that’s already in memory. For more information about this optimization, see AWS Lambda: How It Works.

The following reference Lambda function for prediction is quite simple. It loads the model, downloads image from the specified URL, transforms the image into an NDArray, and uses the model to make a prediction that outputs labels, with associated confidence percentages. The implementation code is provided below.

Because the code must download the model from S3, download an image from the web, and run the image through the model, we benchmarked it to evaluate how it scales. We ran a local benchmark on a laptop in San Francisco using wrk to the endpoint deployed in US West (Oregon) Region (us-west-2). We observed an average latency of 1.18 seconds at a sustained rate of 75 requests per second, as shown in the following output.

To benchmark global prediction latencies, we used Goad, a Lambda-based distributed load tester. We observed average latencies ranging from 1.2 seconds to 1.5 seconds. The following figure shows the observed latencies from various regions with the endpoint hosted in the US West (Oregon) Region (us-west-2).

Conclusion

For a state-of-the-art image label prediction model, the numbers are impressive. The average global latency of 1.5 seconds makes it worth exploring integrating Lambda into your machine learning or deep learning pipeline for batch predictions. The libraries and code samples for deployment are available in the mxnet-lambda GitHub repo.