Features of FBGEMM

It is a low-precision, high-performance matrix-matrix multiplications and convolution library that enables large-scale production servers to run the most powerful deep learning models efficiently.

The library exploits opportunities to overcome the unique challenges of matrix multiplication at lower precision with bandwidth-bound pre- and post-GEMM operations.

At Facebook, FBGEMM has benefited many AI services, increased the speed of English-to-Spanish translations by 1.3x, reduced DRAM bandwidth usage in their recommendation system used in feeds by 40%, and speed up character detection by 2.4x in Rosetta, the machine learning system for understanding text in images and videos.

FBGEMM supplies modular building blocks to construct an overall GEMM pipeline needed by plugging and playing different front-end and back-end components. It combines small compute with bandwidth-bound operations and exploits cache locality by fusing post-GEMM operations with macro kernel while providing support for accuracy-loss-reducing operations.

Why does GEMM matter?

Floating point operations (FLOPs) are mostly consumed by Fully connected (FC) operators in the deep learning models that are deployed in Facebook’s data centers. These FC operators are just plain GEMM, which means that their overall efficiency directly depends on GEMM efficiency. 19% of these deep learning frameworks at Facebook implement convolution as im2col followed by GEMM. However, straightforward im2col adds overhead from the copy and replication of input data. To combat this, some deep learning libraries implement direct (im2col-free) convolution for improved efficiency. Facebook provides a way to fuse im2col with the main GEMM kernel to minimize im2col overhead.

Facebook says that recent industry and research works have indicated that inference using mixed-precision works well- without adversely affecting accuracy. FBGEMM uses this as an alternative strategy to improve inference performance with quantized models. Also, newer generations of GPUs, CPUs, and specialized tensor processors natively support lower-precision compute primitives, and hence the deep learning community is moving toward low-precision models. FBGEMM provides a way to perform efficient quantized inference on the current and upcoming generation of CPUs.