Improving the Efficiency and Accuracy of Normalization Schemes in Deep Networks

December 6, 2018

In recent years, batch-normalization has been used commonly in deep networks and has enabled faster training and higher performance in a wide variety of applications. However, the reasons for these benefits have not been well understood and several shortcomings have hindered the use of batch-normalization for certain tasks.

In a paper jointly authored by myself and Elad Hoffer, Itay Golan and Daniel Soudry from The Technion at the Israel Institute of Technology, we offer a novel view of normalization methods and weight-decay as tools to decouple weights’ norm from the underlying optimized objective. Additionally, we improve the use of weight-normalization and show the connection between practices such as normalization, weight decay, and learning-rate adjustments. Finally, we suggest alternatives to the widely-used L2 batch-normalization and show that using normalization in and L∞ spaces can substantially improve numerical stability in low-precision implementations while also providing computational and memory-use benefits. Together, these findings have many implications for increasing training performance while maintaining high accuracy, especially for lower-precision workloads. We have been invited to present this research as a Spotlight Paper and Poster Session at the 2018 Conference on Neural Information Processing Systems (NeurIPS).

Challenges with Current Normalization Methods

Batch-normalization, despite its merits, suffers from several issues, as pointed out by previous work [8][3][NEW]. These issues are not yet solved in current normalization methods.

Numerical precision. Though interest in low-precision training continues to increase[14] [15], current normalization methods are notably not suited for low-precision due to their reliance on L2 normalization, which involves several operations requiring high precision. Using norm spaces other than L2 can alleviate these problems, as we demonstrate in the paper.

Computational costs. The computational overhead of batch-normalization is significant. Previous analysis has found batch-normalization to constitute up to 24% of computational time [BE1] [HL2] [BR3] needed for an entire model[11]. Further, it can require as much as twice the memory of a non-batch-normalization network during the training phase[12]. Methods like weight-normalization have smaller computational costs, but can result in lower accuracy when used with large-scale tasks[13].

Interplay with other regularization mechanisms. Other regularization mechanisms are typically used in conjunction with batch-normalization. Though earlier studies[6] have shown that explicit regularization, such as weight decay, can improve generalization performance, it is not clear how weight decay interacts with batch-normalization, or if weight decay is even really necessary, as batch-normalization already constrains the output norms[7].

Task-specific limitations. A key assumption in batch-normalization is independence between samples appearing in each batch. While this assumption seems to hold for most convolutional networks used to classify images in conventional datasets, it falls short in domains with strong correlations between samples, such as time-series prediction, reinforcement learning, and generative modeling. For example, weight-normalization[8] and layer-normalization[9] were devised to address the finding[10] that batch-normalization required modification for use with recurrent networks.

Improving Batch-Normalization

Our paper makes the following contributions:

We show that we can replace the standard L2 batch-normalization with L1 and L∞ variations without reduced accuracy in CIFAR* or ImageNet. This improves the suitability of batch-normalization for hardware implementations of low-precision neural networks.

We suggest that it is redundant to use weight decay before normalization. We demonstrate that the effect of weight decay on the learning dynamics can be mimicked by adjusting the learning rate or normalization method.

We show that by bounding the norm in the weight-normalization scheme, we can significantly improve its performance in convolutional neural networks, such as ImageNet*, and in long short-term memory networks (LSTMs), such as WMT14 de-en*. This method can alleviate several of batch-normalization’s task-specific limitations while also reducing compute and memory costs.

Advancing AI on Intel® Architecture

We look forward to discussing these findings with our peers and colleagues at the 2018 Conference on Neural Information Processing Systems. In a subsequent work, we extend this work by suggesting an even more numerically stable batch normalization, called range batch-norm, where only the largest and smallest input values need to be calculated. This makes the batch norm calculation very tolerant to low precision hardware since accuracy is not degraded by max() and min() operations.

For more on this research, please review our paper, “Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks,” look for us at the 2018 NeurIPS conference, and stay tuned to https://ai.intel.com and @IntelAIDev on Twitter.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

Stay Connected with Intel® AI

Yes, I would like to subscribe to stay connected to the latest Intel technologies and industry trends by email and telephone. I can unsubscribe at any time.

Profession *

By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to use for this business request. You also agree to subscribe to stay connected to the latest Intel technologies and industry trends by email and telephone. You may unsubscribe at any time. Intel’s web sites and communications are subject to our Privacy Notice and Terms of Use.