In artificial neural networks, learning from data is a computationally demanding task in which a
large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic
processes over a cost-function. It is not well understood how learning occurs in these systems, in
particular how they avoid getting trapped in configurations with poor computational performance.
Here we study the difficult case of networks with discrete weights, where the optimization landscape
is very rough even for simple architectures, and provide theoretical and numerical evidence of the
existence of rare—but extremely dense and accessible—regions of configurations in the network
weight space. We define a novel measure, which we call the robust ensemble (RE), which suppresses
trapping by isolated configurations and amplifies the role of these dense regions. We analytically
compute the RE in some exactly solvable models, and also provide a general algorithmic scheme
which is straightforward to implement: define a cost-function given by a sum of a finite number
of replicas of the original cost-function, with a constraint centering the replicas around a driving
assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov
Chains to message passing to gradient descent processes, where the algorithms target the robust
dense states, resulting in substantial improvements in performance. The weak dependence on the
number of precision bits of the weights leads us to conjecture that very similar reasoning applies
to more conventional neural networks. Analogous algorithmic schemes can also be applied to other
optimization problems.

n. LBPNet1 uses local binary comparisons and random projection in place of conventional convolu- tion (or approximation of convolution) operations.

We have built a convolution-free, end-to-end, and bitwise LBPNet from scratch
for deep learning and verified its effectiveness on MNIST, SVHN, and CIFAR-10
with orders of magnitude speedup (hundred times) in testing and model size
reduction (thousand times), when compared with the baseline and the binarized
CNNs. The improvement in both size and speed is achieved due to our
convolution-free design with logic bitwise operations that are learned directly
from scratch.