The paper is pretty readable even to a beginner (like I consider myself). To summarize my understanding of the paper, they consider three activation functions ReLU, maxout, and LWTA. All of these activation functions have the property that they can easily completely block the signal from an output neuron, i.e. they impose a hard cutoff of zero signal. This property allows subgraphs in the network to be completely blocked out. They show that it appears that "bad" parts of the network are able to get completely dropped out of the network using these locally competitive functions.

There's also an interesting section where they compare what happens with these locally competitive functions appears to be similar to what happens in networks with dropout. When a network is trained with dropout, bad parts of the network will sometimes get blocked as a result of dropout, and this can help the network learn. There seems to be some similarity between a network with locally competitive activation functions (like ReLU), and a network with some percentage of the nodes dropped out. They analyze this by looking at active "submasks" of the network that are actually propagating signal, and show that the structure of a dropout network looks similar to a fully trained ReLU network.

I thought this was a really interesting paper, and hadn't heard this theory before. What do you all think?

Thanks for sharing. I haven't read the paper yet only your summary - but based on that it seems odd that ELU, PReLU, and leaky ReLU are all so successful. Do you think that is negative evidence vs their theory?