Trade-off Optimization Between Accuracy and Floating Point Operations for DispNetC: a Deep Network for Disparity Estimation

Many state-of-the-art computer vision solutions are based on Convolutional Neural Networks (CNNs), which rely on convolutions to aggregate spatial information. However, the high computational complexity and memory footprint of standard convolutions hinder the widespread usage of CNNs in resource-constrained hardware, such as embedded systems and mobile devices.
Moreover, FLOPS and parameter count grows quadratically in standard convolu- tions with respect to kernel size, which conflicts with the fact that many tasks require to aggregate spatial information with large kernels to produce accurate results.
Research on more efficient models has therefore become a necessity to enable ubiquitousness of CNN-based solutions in daily-life tasks. With that purpose in mind many efforts have been carried out to realize smaller models and less computational- intensive networks. Many approaches have tackled the problem of creating more efficient networks, ranging from slimmer models, hand-crafted efficient operations able to substitute convolutions to architecture search algorithms. However, to the best of our knowledge no method has been deeply investigated for disparity estimation.
We propose the EmbedNet block, capable of substituting standard convolutions in disparity estimation networks while requiring a smaller model size, lesser FLOPS per inference and keeping or even improving the network accuracy. EmbedNet achieves state-of-the-art results at lower cost thanks to the convection of several successful individual efforts such as depthwise separable convolutions, filter factorization enabled by XY asymmetric convolutions, increased network cardinality with a multi-branch structure and selectively reduced width by using bottleneck convolutions.
DispNetC, a CNN for disparity estimation is used as our optimization baseline. We report a reduction of 3.5 times in the amount of required FLOPS per inference and a model 2.5 times smaller than the baseline while keeping the original accuracy. Moreover, our experiments show that the EmbedNet block performs well across different datasets showing that our architecture is capable of better generalization.