In Deep Learning (DL), you have a long, computation-intensive training phase where you micro-fiddle/fudge the model parameters until you get desired accuracy. Then you deploy this optimized model parameters (i.e., the Deep Neural Network [DNN])for inference with real-world inputs. The paper is about this inference/serving layer of DL.

In the serving layer, the input goes through the DL with the tuned model parameters activating some subset of neurons at each layer and finally activating the correct neuron[s] at the output layer. This can still be a computation intensive process as the model has millions of parameters, and you apply matrix multiplication layer after layer. So this serving layer still has many juicy problems to work on.

A very relevant problem is that executing inference at the mobile can be slow because of the computational and energy limitations of the mobile. Executing at the cloud backend server is fast, but how do you get the input there? Uploading the input to the cloud can be slow, especially if the input is a large image and the connection is slow. So there is a tradeoff.

In Section 4, the paper shows how beneficial it can be to perform a proper DL inference partitioning. For image processing/computer vision (CV), e.g., AlexNet, partitioning at a middle layer is the most optimal for both latency and energy optimization. Since the input image is large (512Mb is used), uploading it to the cloud is both time and energy consuming. However, if you execute the convolutional layers followed by the pooling at the mobile, you reduce the size of the intermediate output data and it is time and energy efficient to upload this to the cloud. The rest of the computation, carried on the cloud, consists of processing fully connected layers, that are computation intensive. If we were to execute them also on the mobile, we would be waiting for the mobile CPU/GPU to finish execution, where as uploading the intermediate output to the cloud and executing the rest of the layers at the cloud finishes earlier.

The paper also finds that, for Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) applications, usually the best approach is to execute everything at the mobile.

Enter Neurosurgeon

Are we done here then? Why do we need a neurosurgeon tool, if a static lookup can do the trick? At this point, the paper makes another argument. You can't just use this one time static observation per application class (CV, ASR, NLP) and be done with it. The best partition point for a DNN architecture depends on the DNN's topology, which manifests itself in the computation and data size variations of each layer. Moreover, the connectivity conditions are changing, so you need to monitor and adjust your decision with the current network quality.

(The paper also argues that changing cloud backend conditions are a factor, but I am not convinced with the datacenter can get busy/overloaded argument. The evaluation experiments for that part is done synthetically.)

The proposed system to address this problem, Neurosurgeon, consists of a deployment phase and a runtime system that manages the partitioned execution of an inference application. Figure 10 shows the design of Neurosurgeon.

As part of the deployment stage, Neurosurgeon runs once per mobile and server platform for producing performance prediction models. This is application and NN independent. It tries different NN layer types for these mobile and server platforms and estimates regression line wrt changing configuration parameters.

The runtime stage is where the Neurosurgeon uses the layer performance prediction models produced in the deployment stage to dynamically choose the best DNN partition models. Neurosurgeon analyzes the target DNN’s constituent layers, and uses the prediction models to estimate, for each layer, the latency on mobile and cloud, and power consumption on the mobile. As Algorithm 1 shows, this is a simple evaluation of the conditions to choose the partition point.

Figures 11 and 12 show results for latency and energy-efficiency improvements achieved using Neurosurgeon.

Figure 13 shows the comparison results. The paper says: Maui makes incorrect offloading choices for more complicated scenarios (e.g., VGG, FACE, DIG and ASR). This is because Maui relies on past invocation of a certain DNN layer type to predict the latency and data size of the future invocations of that layer type, leading to mispredictions. But why don't we make Maui control points *per layer* method invocations? Instead of making Maui control points per layer type, if we made them per layer number method invocations things might improve for Maui.