DSPs Power Deep Learning SoCs

Graphics processing units (or GPUs), and to some extent FPGAs, have generally been deployed for training models in deep learning neuron nets. And the computationally intensive evaluation and training process is often done offline in large server farms.

Next up, these trained models are translated into the actual production environments using a hybrid of CPUs and GPUs, or a hybrid of CPUs and FPGAs. But what about embedded systems in automotive, consumer, and industrial environments that are highly sensitive to both cost and power consumption?

There is a myriad of embedded applications—ADAS, virtual reality, object recognition, etc.—that are ripe for employing deep learning technology. And here DSPs clearly surpass both GPUs and FPGAs when it comes to performance/watt benchmarks. Moreover, the deep learning chips using DSP cores represent a more specialized and flexible solution compared to general-purpose GPUs and FPGAs.

Two Case Studies

Take CEVA, the supplier of DSP cores for low-power embedded systems, which has recently illustrated a 24-layer convolutional neural network (CNN) powered by its XM4 vision processor. According to CEVA, the DSP-based CNN engine was able to deliver nearly three times the performance compared to a typical hybrid CPU/GPU processing solution.

Furthermore, apart from consuming 30 times less power than a GPU, CEVA claimed that the DSP engine conserved nearly one-fifth of the memory bandwidth. Alongside the CEVA-XM4 imaging and vision processor, the company offers a network generator that translates the trained network into a cost-effective CNN execution.

CEVA brings deep learning to the embedded space by taking a neural network that has been tuned and trained on a workstation and converting it to run on its DSP-based XM4 processor. The DSP core supplier converts the floating-point operations from the workstation to fixed-point instructions so that they can run more efficiently on the DSP core.

Toronto–based Phi Algorithm Solutions is using the CEVA deep neural network (CDNN) framework for embedded systems and XM4 processor to implement deep learning in its Universal Object Detector algorithm. The algorithm is now available for applications such as ADAS, pedestrian detection, and facial recognition.

Cadence is another notable player in the DSP camp investing heavily in deep learning and CNN applications. The firm claims that the deep learning processor based on its Tensilica Vision P6 DSP core can achieve twice the image frame rate at lower energy usage compared to commercially available GPUs.

The features in P6 DSP—wide vector SIMD processing, VLIW instructions, and fast histogram and scatter/gather intrinsics—make it inherently more suitable for the demanding deep learning environment. The processor combines power-efficient implementation of CNN algorithms with on-the-fly data compression that substantially reduces memory footprint and bandwidth requirements for neural network layers.