Artificial neural networks have been around for a long time – since either the 1940s or the 1950s, depending on how you count. But they’ve only started to be used for practical applications such as image recognition in the last few years. Some of the recent progress is based on theoretical breakthroughs such as convolutional neural networks, but a much bigger factor seems to be hardware: It turns out that small neural networks aren’t that much better than many simpler machine learning algorithms. Neural networks only excel when you have much more complex data and a large/complex network. But up until recently, the available hardware simply couldn’t handle such complexity. Moore’s law helped with this, but an even bigger part has been played by a type of chip called a GPU, or Graphical Processing Unit. These were originally designed to speed up computer animations, but they can also be used for other types of processing. In some cases, GPUs can be as much as 100 times as fast as standard CPUs at certain tasks. However, it turns out you only get this speedup with a fairly narrow category of tasks, many of which happen to be necessary for processing neural networks. In this post, I want to discuss what types of task these are and why GPUs are so much faster at them.

Lets start with the way traditional CPUs work, keeping in mind that I’m not a hardware expert, so much of what I’m going to say will be intentionally vague. Whenever your computer is running, your CPU is endlessly following a list of very simple instructions involving external inputs and outputs (RAM, hard disk, your Wifi card, etc) and a small amount of memory that’s internal to the CPU called registers. The number of registers is usually pretty small – for example, the Intel’s fancy Core i7 processor has 16 64-bit registers.

The instructions that the CPU follows are along the lines of “Add the values in registers 1 and 2, then save the result in register 3” or “Copy the value at the memory location defined by register 1 into register 2” or “If the value of register 1 is greater than the value in register 2 then jump to the instruction number saved in register 3.” So if, for example, you wanted to add together two vectors in a 100-dimensional space, you would have to read each coordinate for each vector from RAM into a register, add the numbers, then save each value back into RAM.

Many modern CPUs have multiple cores, each of which is simultaneously and independently doing what I described above. In theory, this could speed things up a bit by doing multiple coordinates at the same time, but in practice, coordinating multiple cores is complicated enough that it’s more common to have the different cores working on completely different tasks rather than different parts of the same task. Also, the number of cores tends to be small (between 2 and 6 seems pretty typical.)

A second type of parallelism that many processors can take advantage of is what’s called Single Instruction, Multiple Data (SIMD) architecture. This allows them to find sequences of independent/parallel instructions in an algorithm and perform them all in a single cycle. So, it might add the first four values of the vectors in a single cycle, then the next four and so on. This can cut the number of cycles dramatically, but the number of parallel instructions is limited by the number of registers, usually to around 4 or 8, so we’re still far from a 100-times speedup.

Instead, the speed up comes from two major ways in which GPUs differ from GCUs. The first is that rather than having a small number of registers, a GPU has a large chunk of internal memory that it can operate on directly. So if, say, you’re going to do a lot of processing involving a collection of vectors that fits into the GPU’s internal memory, then you can save the time of shuffling the values back and forth to/from RAM. Of course, this alone only gives you a small speedup, since passing values to/from memory only takes a fraction of a CPU’s time.

The big speed up comes from the fact that each time a GPU performs an operation, it can do it many times simultaneously. And it’s more than 2 or 6. Instead, 64 seems to be a typical size for the number of operations a GPU can do in parallel. Rather than an instruction like “Add register 1 to register 2” like the CPU had, a GPU instruction may be something like “Add the values in locations 1-64 to the values in locations 65-128, and save them in locations 129-192.” And this operation is done in a single step, simultaneously by 64 separate circuits within GPU. In other words, you can think of a GPU as having a row of CPUs that (unlike the multiple cores in a CPU) all follow the same instruction at the same time on different parts of the internal memory.

So now, when we add those 100-dimensional vectors, instead of reading in 200 values, adding them in 100 separate cycles, then transferring 100 values back to RAM for a total on the order of 100 consecutive operations (not to mention a bunch of overhead I’m glossing over), we only need two cycles of the GPU. We would still need to transfer the values in and out of the GPU’s internal memory, but if we’re doing a lot of processing on the same vectors, we can minimize this time by keeping them in the GPU’s memory until we’re done with them.

So tasks that involve doing the same thing at the same time to lots of different data (such as vector and matrix operations) can be done much faster on GPUs. In fact, it’s because matrix operations are so important to computer graphics that GPUs were designed this way. Note that GPUs tend to be slower than CPUs in terms of the number of cycles per second, plus they lack many optimization features that modern CPUs have. So for tasks that can’t take advantage of parallelism – i.e. almost everything other than vector and matrix operations – CPUs are much faster. That’s why the computer you’re working on right now has a CPU at its center instead of a GPU.

But the processes involved in training and evaluating a neural network happen to fit very nicely into the vector/matrix genre. The “knowledge” in a neural network is defined by the weights on the connections between neurons. For example in a network with rows of neurons, the weights between successive rows are defined by a matrix in which the entry at position (i, j) is the weight from the ith neuron in the first row to the jth neuron in the second row. Each row, in turn, defines a vector, and we calculate the output from each neuron by multiplying the outputs of the first row by this matrix, then applying a non-linear function to the resulting vector. We do this for each successive row until we get to the end of the network. Training the network via back-propagation is another process involving these same vectors and matrices.

As a result, it’s possible in practice to work with much larger neural networks than would be otherwise possible, even after a few more decades of Moore’s Law. This is important, for example, in image processing where the first row alone (i.e. the input) contains thousands of neurons. Things still get tricky when the networks get too big to fit in the memory of a single GPU. At that point multiple GPUs are required to store the network, and data must be transferred between them, which becomes the major bottleneck. But that’s a whole different story. For now, this is at least the rough idea behind why GPUs have been one of the main drivers of the recent success of large-scale neural networks.

11 Responses to GPUs and Neural Networks

I know you said this is just cursory and vague description of how things work but still this description of CPU’s action is wildly inaccurate.
Most modern CPUs are out-of-order superscalar processors. It means that they are able to execute more independent instructions per cycle (IPC). x86 can do in theory up to 8, but more realistic upper bound is somewhere around 4 IPC. So Instead of 400 cycles in your example, it might need only 100 cycles, depending on circumstances.
Other thing: CPUs have support for SIMD instruction that can operate on vector of values at the time (similar to what GPUs do). One SIMD instruction can for example can add 2 vectors containing 8 32-bit floats. And in some cases multiple SIMD instructions can be dispatched in one cycle again reducing number of cycles needed.
But if you do this computation on big enough data that won’t fit into CPU cache, you most likely run out of bandwidth to main memory.

Thanks for the correction. I updated the post to include a description of SIMD capabilities and lowered the estimate of the number of instructions. (Though for the record, I originally said it was on the order of 400, not exactly 400. I only meant to suggest that it was the same order of magnitude.)

This is a very interesting post. I was particularly fascinated by your comments about the way in which a processor might add two vectors together, i.e. by taking two of the components from the memory, adding them, then storing that summed value back into the memory.

A question I now have is why they might refrain from building chips with more cores on each? I feel like I’ve heard of certain server hardware containing more than 2-6 cores these days, but utilizing the parallel cores to perform the same task more quickly seems like a difficult concept. For instance, how would one core know where to start or stop in relation to the others? Perhaps another core could be used to keep track of their operations and make sure they don’t overlap?

I’m currently a math major, but I’m thinking about picking up a double major of computer engineering in addition to math next year. This is some very interesting stuff.

That’s right – utilizing multiple cores to carry out a single task is fairly difficult, and requires very carefully thought-out coordination. The techniques are usually described in terms of threads, which are more abstract than cores. In particular, a single core will usually control multiple threads, switching back and forth between them and saving their state while they’re dormant. Look up concurrency on wikipedia for a start.

For the type of parallel processing that a GPU does, the independent processors are, in some sense, all controlled centrally and in a very limited way, so you know they will all finish each step at the same time, and it’s easier to ensure that the processes are independent.