NVIDIA’s TensorRT is a deep learning library that has been shown to
provide large speedups when used for network inference. MXNet 1.3.0 is
shipping with experimental integrated support for TensorRT. This means
MXNet users can now make use of this acceleration library to
efficiently run their networks. In this blog post we’ll see how to
install, enable and run TensorRT with MXNet. We’ll also give some
insight into what is happening behind the scenes in MXNet to enable
TensorRT graph execution.

Installing MXNet with TensorRT integration is an easy process. First
ensure that you are running Ubuntu 16.04, that you have updated your
video drivers, and you have installed CUDA 9.0 or 9.2. You’ll need a
Pascal or newer generation NVIDIA GPU. You’ll also have to download and
install TensorRT libraries instructions
here.
Once your these prerequisites installed and up-to-date you can install a
special build of MXNet with TensorRT support enabled via PyPi and pip.
Install the appropriate version by running:

To install with CUDA 9.0:

pipinstallmxnet-tensorrt-cu90

To install with CUDA 9.2:

pipinstallmxnet-tensorrt-cu92

If you are running an operating system other than Ubuntu 16.04, or just
prefer to use a docker image with all prerequisites installed you can
instead run:

TensorRT is an inference only library, so for the purposes of this blog
post we will be using a pre-trained network, in this case a Resnet 18.
Resnets are a computationally intensive model architecture that are
often used as a backbone for various computer vision tasks. Resnets are
also commonly used as a reference for benchmarking deep learning library
performance. In this section we’ll use a pretrained Resnet 18 from the
Gluon Model
Zoo
and compare its inference speed with TensorRT using MXNet with TensorRT
integration turned off as a baseline.

In our first section of code we import the modules needed to run MXNet,
and to time our benchmark runs. We then download a pretrained version of
Resnet18, hybridize it, and load it symbolically. It’s important to note
that the experimental version of TensorRT integration will only work
with the symbolic MXNet API. If you’re using Gluon, you must
hybridize
your computation graph and export it as a symbol before running
inference. This may be addressed in future releases of MXNet, but in
general if you’re concerned about getting the best inference performance
possible from your models, it’s a good practice to hybridize.

For this experiment we are strictly interested in inference performance,
so to simplify the benchmark we’ll pass a tensor filled with zeros as an
input. We then bind a symbol as usual, returning a normal MXNet
executor, and we run forward on this executor in a loop. To help improve
the accuracy of our benchmarks we run a small number of predictions as a
warmup before running our timed loop. This will ensure various lazy
operations, which do not represent real-world usage, have completed
before we measure relative performance improvement. On a modern PC with
a Titan V GPU the time taken for our MXNet baseline is 33.73s. Next
we’ll run the same model with TensorRT enabled, and see how the
performance compares.

While TensorRT integration remains experimental, we require users to set
an environment variable to enable graph compilation. You can see that at
the start of this test we explicitly disabled TensorRT graph compilation
support. Next, we will run the same predictions using TensorRT. This
will require us to explicitly enable the MXNET_USE_TENSORRT
environment variable, and we’ll also use a slightly different API to
bind our symbol.

Instead of calling simple_bind directly on our symbol to return an
executor, we call an experimental API from the contrib module of MXNet.
This call is meant to emulate the simple_bind call, and has many of the
same arguments. One difference to note is that this call takes params in
the form of a single merged dictionary to assist with a tensor cleanup
pass that we’ll describe below.

As TensorRT integration improves our goal is to gradually deprecate this
tensorrt_bind call, and allow users to use TensorRT transparently (see
the Subgraph
API
for more information). When this happens, the similarity between
tensorrt_bind and simple_bind should make it easy to migrate your
code.

We run timing with a warmup once more, and on the same machine, run in
18.99s. A 1.8x speed improvement! Speed improvements when using
libraries like TensorRT can come from a variety of optimizations, but in
this case our speedups are coming from a technique known as operator
fusion.

Behind the scenes a number of interesting things are happening to make
these optimizations possible, and most revolve around subgraphs and
operator fusion. As we can see in the images below, neural networks can
be represented as computation graphs of operators (nodes in the graphs).
Operators can perform a variety of functions, but most run simple
mathematics and linear algebra on tensors. Often these operators run
more efficiently if fused together into a large CUDA kernel that is
executed on the GPU in a single call. What the MXNet TensorRT
integration enables is the ability to scan the entire computation graph,
identify interesting subgraphs and optimize them with TensorRT.

This means that when an MXNet computation graph is constructed, it will
be parsed to determine if there are any sub-graphs that contain operator
types that are supported by TensorRT. If MXNet determines that there are
one (or many) compatible subgraphs during the graph-parse, it will
extract these graphs and replace them with special TensorRT nodes
(visible in the diagrams below). As the graph is executed, whenever a
TensorRT node is reached the graph will make a library call to TensorRT.
TensorRT will then run its own implementation of the subgraph,
potentially with many operators fused together into a single CUDA
kernel.

During this process MXNet will take care of passing along the input to
the node and fetching the results. MXNet will also attempt to remove any
duplicated weights (parameters) during the graph initialization to keep
memory usage low. That is, if there are graph weights that are used only
in the TensorRT sections of the graph, they will be removed from the
MXNet set of parameters, and their memory will be freed.

The examples below shows a Gluon implementation of a Wavenet before and
after a TensorRT graph pass. You can see that for this network TensorRT
supports a subset of the operators involved. This makes it an
interesting example to visualize, as several subgraphs are extracted and
replaced with special TensorRT nodes. The Resnet used as an example
above would be less interesting to visualization. The entire Resnet
graph is supported by TensorRT, and hence the optimized graph would be a
single TensorRT node. If your browser is unable to render svg files you
can view the graphs in png format:
unoptimized and
optimized.

As mentioned above, MXNet developers are excited about the possibilities
of creating
APIs
that deal specifically with subgraphs. As this work matures it will
bring many improvements for TensorRT users. We hope this will also be an
opportunity for other acceleration libraries to integrate with MXNet.

Thank you to NVIDIA for contributing this feature, and specifically
thanks to Marek Kolodziej and Clement Fuji-Tsang. Thanks to Junyuan Xie
and Jun Wu for the code reviews and design feedback, and to Aaron
Markham for the copy review.