Arm support for Android NNAPI gives >4x performance boost

The launch of Arm support for the Android Neural Networks API (NNAPI) sees the release of open-source, optimized neural network operators that deliver significant performance uplift across CPUs and GPUs.

Back in May at Google I/O, we heard the first public news about TensorFlow Lite for Android. This was the first exciting hint of a major new API that will affect the deployment of neural networks on Arm-based platforms supporting Android.

Inference engines are nothing new, but the big change with the announcement of NNAPI is standardized support within Android and the ability to target the wide array of accelerators available from the Arm ecosystem, such as the Arm Mali GPU.

At Arm, we fully support this development and will be releasing support for our Arm Cortex-A CPUs and Mali GPUs from day one. This is following on from other efforts to improve the performance of machine learning applications on Arm platforms, adding to our existing release of the Compute Library at the beginning of the year, and our ongoing engagement with the community of companies and developers that is standardizing approaches and sharing developments in the open.

A tricky problem

The way neural network inference is supported at the high level is deceptively simple. First, a model representing a neural network and its associated weights is provided by the application or ML framework (such as TensorFlow Lite). Then, the Android NN Runtime performs scheduling to determine how the graph should be run – on CPU or any devices that have been registered to support neural network computation. After this, the selected device – often the CPU or GPU, and sometimes another accelerator – will be given the model to run. Finally, the device will break the workload down into key operations, and run the inference process on the model, producing the result that will be used by the application.

An overview of Arm's support for Google's NN API

This may appear simple, but there's been a lot of work put in by our software teams to make each stage run well – particularly when it comes to HAL and driver support for Mali GPUs and the heavily optimized operators which run on both the CPU and GPU. These have been carefully tuned by Arm, and are at the heart of the Google CPU backend for Android NN, as well as in the Arm Mali GPU routines provided through our GPU implementation of the Android NN HAL.

The key operators needed for convolutional neural networks are supported, ready to speed up existing applications and open up the possibility for new ones to be deployed. Fortunately, we’ve been building these software components for a long time, so when this new API became available, we were ready.

Heavily optimized

Since the announcement, lots of hard work has been happening at both Arm and Google to make sure that high-performance neural network inference is easy to achieve on Arm platforms. This has culminated in the release of optimized CPU operators for Cortex-A, integrated into Google's framework, and for Arm Mali GPUs, along with an inference engine to run them. What's more, these operators are released as open source and available as part of the Compute Library.

Arm already provides support for 32-bit floating point and this support is improved with our NNAPI release to speed up neural network computation by three times. We're also working to support 8-bit integer operations, which will provide around four times more performance than fp32 when running on the Mali GPU already deployed in most mobile devices.

Additionally, there's ongoing work to add support for further Arm CPUs and GPUs as they are released. For example, Cortex-A55 and Cortex-A75 are beginning to appear in products, and we'll unlock the power of the new ARMv8.2 architecture to give a 4x performance boost to 8-bit convolution and matrix multiply.

All this is great news for anyone wanting to deploy convolutional neural networks on Arm, as these invariably quantize down to 8-bit with nearly the same accuracy as running in 32-bit, but at notably higher performance.

Alongside this, the added benefit of reduced bandwidth, and improvements due to the memory subsystem, result in even better performance, whichever Arm platform you choose.

Where next?

Alongside the Android 8.1 release, smartphones will immediately gain from the performance improvements made to CPU routines and, for platforms with the Mali GPU, work will automatically be offloaded for even higher performance. This is an area where we continue to optimize, so expect even better performance in future.

We've been working closely with a number of our partners to make this available for use on their platforms, so you'll soon see devices from Huawei and Mediatek that support NNAPI with accelerated Arm CPU and GPU support.

To keep track of developments on machine learning and how to run it on Arm-powered devices, keep an eye on this blog. Expect more on Arm’s machine learning platform and our support for Android soon!

This document demonstrates how to call the JNI, through a procedure :written with the GNU ARM Assembly syntax,assembled and stored in a ELF shared library,executed by an Android system through an Android…