Machine learning challenges for computer architecture

Neural networks have become a hot topic in computing and their development is
progressing rapidly. They have a long history with some of the first designs
proposed in the 1940s. But despite being an active area of research since
then, it has not been until the last five to ten years that the field has
started to deliver state-of-the-art results, with deep neural network-based
algorithms displacing conventional machine-learning and programmed ones in many areas.

The recent developments in neural networks, since around 2010, has coincided
with the availability of commodity high-performance GPUs. These devices provide
enough memory and compute that networks can be trained with large datasets, in
the order of hours or days, to perform classification tasks for practical and
interesting problems such as image and speech recognition. Although GPUs have
established themselves as the standard way to accelerate neural networks, they
have done this by transitioning relatively quickly from applications in
traditional HPC, but they are already evolving to meet the needs of machine
learning. In this article I want to discuss some of the challenges that neural
networks and their development present to GPUs, and indeed more generally to
the status quo of computer architecture.

Compute and memory

The fundamental operations of a neural network are floating-point
multiplications and additions. These are used to combine input data with the
parameters of the network that control the influence of connections between
neurons. Modern networks require considerable resources to store millions of
parameters and perform billions of operations per input.

Neurons in fully-connected layers take
weighted sums of their inputs (a multiplication and an accumulation, MAC, for
each input) from every neuron in the previous layer. The number of MACs grows
with the square of the layer size, and the number of layers, so even with
modest numbers of layers and neurons per layer, the number of MACs can be
large. In the AlexNet network, the last three layers are
fully connected with 4,096, 4,096 and 1,000 neurons respectively, requiring
58.6 million parameters and, for the forward pass to classify
a single input image with a trained network, the same number of MACs.

The use of convolutional layers reduces the number of
parameters by sharing a small sets between the neurons. The five
convolutional layers preceding the fully-connected layers in AlexNet contain
just 2.5 million neurons, but require 655.6 million MACs per input. AlexNet was
state of the art in 2009 and networks since then have developed with many more
convolutional layers and a smaller fully connected component, resulting in
relatively slow growth in the number of parameters but significant increases in
the number of MACs. A variant of the VGG network
(2014) with 19 layers (three fully connected) has 143.6 million parameters and
requires a total of 19.6 million MACs in the forward pass. A variant of the
ResNet network (2015) with 50 layers (one fully connected) has
25.5 million parameters and 3.8 billion MACs for the forward pass. More recent
work has demonstrated benefits of networks with more than
1,000 layers.

When a network is being trained, more compute is required by an additional
backwards pass and and memory requirements increase since intermediate values
for each parameter must be maintained from the forward pass.

The challenge for computer architecture here is to deliver the huge number of
MACs required for training and inference, whilst minimising the movement of
data between fast local memory and slower main memory, or via a communication
link. This will of course require corresponding developments in the
implementation of neural networks. A recent result
demonstrated that when data is kept on chip, much better use of GPU compute
resource can be made to achieve an order of magnitude improvement in the depth
of network that could be trained. Another has demonstrated that compute can be traded for a logarithmic reduction of
memory in the number of layers.

Precision

Reducing the precision of arithmetic reduces the cost of memory and compute
since lower-precision floating-point numbers require less bits of storage and
require smaller more power-efficient structures in silicon to implement
arithmetic operations. Recent research has demonstrated that representations
between 8 and 16 bits can deliver similar results to
32-bit precision for inference and training. This has already has an impact on
architecture: Google has claimed a 10x increase in efficiency
with it’s Tensor Processing Unit (TPU) using 8-bit precision,
and Nvidia’s new Pascal architecture supports 16-bit floating-point arithmetic at twice the rate of single precision, and 8-bit integer
arithmetic at four times the rate. Intel have also
recently announced a variant of their Xeon Phi processor,
code named Knights Mill, that will be optimised for deep learning with variable
precision floating-point arithmetic.

Structure

There is no single structure for data movement in deep neural networks. The
simplest networks have connections between adjacent layers, which are evaluated
in sequence, but many more complex structures have been proposed. For example, residual connections provide a
pathway between non-adjacent layers, fractal architectures have self-similar structures at different scales and
entire neural networks can be used as basic building blocks. There can also by dynamism in the structure; dropout prevents overfitting by randomly removing connections during
training to ‘thin’ the network, and networks with stochastic depth randomly exclude subsets of layers during training to
make deep networks more shallow.

These neural-network structures contrast with traditional HPC-style programs,
which have long been the focus of parallel computing research and development
and are characterised by a single structure and algorithm. The
challenge here is for computing hardware and the programming models targeting
it to support complex, highly-connected and potentially dynamic communication structures.

Programming

There are many languages, frameworks and libraries
available for creating deep-learning applications and they are having to
evolve quickly though to keep up with the pace of research. This is a strong
indication that the means by which we program neural networks need to be
general enough to facilitate experimentation but also deliver reasonable
performance so that it is practical to explore different designs and
hyper parameters.

However, there is a gulf between the high-level representations of neural
networks used by researchers and their actual implementation on hardware. For
example, Google’s TensorFlow programming framework is written in
C++ and interfaces with GPUs via an abstraction layer that calls CUDA library
routines. On top of this, Google have released a high-level Python wrapper for
TensorFlow, called TensorFlow-Slim. But despite the
abstraction and generality of the TensorFlow framework, achieving good
computational efficiency on GPUs depends on a heavily-optimised high-level deep
neural network library, such as cuDNN or NEON.
The problem for all high-level programming approaches is that the performance
of neural network designs that cannot exploit an underlying optimised library
directly will degrade significantly. Closing the gap between the methods used
to build neural networks and their mapping to a machine architecture would
deliver more performance for a wider range of programs.

Deployment and portability

Finally, a unique aspect of machine-learning algorithms is the separation
between the phase in which they are trained and their subsequent deployment for
inference. Since training demands more compute and memory resources and is
typically carried out in a data-centre environment where space, power and, to
some extent, time are not constraining issues. A trained neural network can be
deployed in more constrained environments, such as mobile or robotics, where
they may be reacting in real time, to a voice user interface or sensor input
for example, with limited memory and power. They may also continue to learn as
they are exposed to more data.

The result of training is a set of parameter values and portability to another
platform requires the weights to be loaded in an implementation of the same
neural network. The implementation may differ in the numerical precision it
uses since trained networks are known to be robust to low-precision parameter
representations, and doing so takes advantage of the associated memory,
performance and power benefits. A portable neural network might therefore need
separate implementations for training and inference, optimised for the memory
and compute constraints and to be targeted at different machine architectures.
A standardised specification of neural networks, including trained parameters,
would further improve portability between platforms.

There have been some efforts to try to measure aspects of the implementation,
deployment and performance of deep neural networks. In particular Deepmark, which is based on specific networks, and Deepbench, which takes a simpler approach by just looking at important kernels.

In summary

Modern deep neural networks are now state-of-the-art in many application areas
of computing but with their unique characteristics, they pose a significant
challenge to conventional computer architecture. This challenge however is also
an opportunity to build new machines and programming languages that break away
from the status quo of sequential shared-memory von Neumann machines.

Please get in touch (mail @ this domain) with any
comments, corrections or suggestions.