Song Han

Song Han is starting in July 2018 as an assistant professor in the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (MIT). Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford University advised by Prof. Bill Dally.
His industry experiences include Google Brain (2017-2018), Facebook (2016), Apple (2013).

Dr. Han's research focuses on energy-efficient deep learning, at the intersection between machine learning and computer architecture. He proposed Deep Compression that can compress deep neural networks by an order of magnitude without losing the prediction accuracy. He designed EIE: Efficient Inference Engine, a hardware architecture that can perform inference directly on the compressed sparse model, which saves memory bandwidth and results in significant speedup and energy saving. His work has been featured by TheNextPlatform, TechEmergence, Embedded Vision and O’Reilly. He led research efforts in model compression and hardware acceleration for deep learning that won the Best Paper Award at ICLR’16 and the Best Paper Award at FPGA’17. Before joining Stanford, Song graduated from Tsinghua University.

I will join MIT EECS as an assistant professor starting July 2018 (MIT news). I'm looking for PhD students interested in deep learning and computer architecture. I also have multiple openings for summer interns starting from July 1, 2018. If you are interested in working with me during summer 2018, drop me an email at FirstnameLastname [at] mit [dot] edu with your CV, publication AND research proposal; a demo or proof of concept would be a plus.

MIT Hardware Intelligence Lab (HAN's Lab):

H: High performance, High energy efficiency Hardware

A: Architectures and Accelerators for Artificial Intelligence

N: Novel algorithms for Neural Networks and Deep Learning

News

Jan 29, 2018: Deep Gradient Compression is accepted by ICLR’18. This technique can reduce the communication bandwidth by 500x and improves the scalability of distributed training. [slides].

May 4 2016: Song received Best Paper Award in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico.

Research Interest

I'm interested in application-driven, domain-specific computer architecture research. The end of Dennard scaling makes power become the key constraint. I'm interested in achieving higher efficiency by tailoring the architecture to characteristics of the application domain. My current research center around co-designing efficient algorithms and hardware systems for machine learning, to free AI from the power hungry hardware beasts and democratize AI to cheap mobile devices, and also reduce the cost of running deep learning on data centers. I enjoy the research intersections across machine learning algorithms, computer architecture and VLSI design.

Research Projects

Pruning & Sparse NN: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resource. Conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude by learning only the important connections. This reduced the number of parameters of AlexNet by a factor of 9×, that of VGGNet by 13× without affecting their accuracy.

Deep Compression: Large deep neural network model improves prediction accuracy but results in large demand for memory access, which is 100× more power hungry than ALU operations. “Deep Compression” introduces a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of deep neural networks. Experimented on Imagenet dataset: AlexNet got compressed by 35×, from 240MB to 6.9MB; VGGNet got compressed by 49×, from 552MB to 11.3MB, without affecting their accuracy. This algorithm helps putting deep learning into mobile App.

Efficient Speech Recognition Engine (ESE): ESE takes the approach of EIE one step further to address not only feedforward neural networks but also recurrent neural networks (RNN and LSTM). The recurrent nature produces complicated data dependency, which is more challenging than feedforward neural nets. To deal with this problem, we designed a data flow that can effectively schedule the complex LSTM operations using multiple EIE cores. ESE also present an effective model compression algorithm for LSTM with hardware efficiency considerations, compressed the LSTM by 20x without hurting accuracy. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a processing power of 282 GOPS/s working directly on a compressed sparse LSTM network, corresponding to 2.52 TOPS/s on an uncompressed dense network.

Dense-Sparse-Dense Training (DSD): A critical issue for training large neural networks is to prevent overfitting while at the same time providing enough model capacity. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks to achieve higher accuracy. DSD training can improve the prediction accuracy of a wide range of neural networks: CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. DSD training flow produces the same model architecture and doesn't incur any inference time overhead.

Trained Tenary Quantization (TTQ): The deployment of large neural networks models can be difficult for mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, our models are nearly 16× smaller than full-precision models.

SqueezeNet: Smaller CNN model is easier to deploy on mobile devices. SqueezeNet is a small CNN architecture that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Together with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510× smaller than AlexNet), which can fully fit on-chip SRAM, making it easier to deploy on embedded device.

Pruning Winograd Convolution: Winograd’s minimal filtering algorithm and network pruning both reduce the operations in CNNs. Unfortunately, these two methods cannot be combined. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we prune the weights in the ”Winograd domain” to exploit static weight sparsity. Second, we move the ReLU operation into the ”Winograd domain” to improve the sparsity of the transformed activations. On CIFAR-10, our method reduces the number of multiplications in the VGG-nagadomi model by 10.2× with no loss of accuracy.