Serving ML Quickly with TensorFlow Serving and Docker

Serving machine learning models quickly and easily is one of the key challenges when moving from experimentation into production. Serving machine learning models is the process of taking a trained model and making it available to serve prediction requests. When serving in production, you want to make sure your environment is reproducible, enforces isolation, and is secure. To this end, one of the easiest ways to serve machine learning models is by using TensorFlow Serving with Docker. Docker is a tool that packages software into units called containers that include everything needed to run the software.

TensorFlow Serving running in a Docker container

Since the release of TensorFlow Serving 1.8, we’ve been improving our support for Docker. We now provide Docker images for serving and development for both CPU and GPU models. To get a sense of how easy it is to deploy a model using TensorFlow Serving, let’s try putting the ResNet model into production. This model is trained on the ImageNet dataset and takes a JPEG image as input and returns the classification category of the image.

Our example will assume you’re running Linux, but it should work with little to no modification on macOS or Windows as well.

Serving ResNet with TensorFlow Serving and Docker

The first step is to install Docker CE. This will provide you all the tools you need to run and manage Docker containers.

TensorFlow Serving uses the SavedModel format for its ML models. A SavedModel is a language-neutral, recoverable, hermetic serialization format that enables higher-level systems and tools to produce, consume, and transform TensorFlow models. There are several ways to export a SavedModel (including from Keras). For this exercise, we will simply download a pre-trained ResNet SavedModel:

--name tfserving_resnet : Giving the container we are creating the name “tfserving_resnet” so we can refer to it later

--mount type=bind,source=/tmp/resnet,target=/models/resnet : Mounting the host’s local directory (/tmp/resnet) on the container (/models/resnet) so TF Serving can read the model from inside the container.

As you can see, bringing up a model using TensorFlow Serving and Docker is pretty straight forward. You can even create your own custom Docker image that has your model embedded, for even easier deployment.

Improving performance by building an optimized serving binary

Now that we have a model being served in Docker, you may have noticed a log message from TensorFlow Serving that looks like:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

The published Docker images for TensorFlow Serving are intended to work on as many CPU architectures as possible, and so some optimizations are left out to maximize compatibility. If you don’t see this message, your binary is likely already optimized for your CPU.

Depending on the operations your model performs, these optimizations may have a significant effect on your serving performance. Thankfully, putting together your own optimized serving image is straightforward.

First, we’ll want to build an optimized version of TensorFlow Serving. The easiest way to do this is to build the official Tensorflow Serving development environment Docker image. This has the nice property of automatically generating an optimized TensorFlow Serving binary for the system the image is building on. To distinguish our created images from the official images, we’ll be prepending $USER/ to the image names. Let’s call this development image we’re building $USER/tensorflow-serving-devel:

Building the TensorFlow Serving development image may take a while, depending on the speed of your machine. Once it’s done, let’s build a new serving image with our optimized binary and call it $USER/tensorflow-serving: