The rectified linear unit, i.e. $\max⁡(0, x)$ is introduced by discussing its advantages and potential problems. Advantages, briefly summarized, are sparsity, cheaper computation compared to the hyperbolic tangent and sigmoid, and better gradient flow due to the linear, non-saturating part. A potential problem is blocking gradient flow by the hard threshold at $0$. To alleviate this, they propose the softplus function:

$\text{softplus}(x) = \log(1 + e^x)$

However, experiments show, that this does not hinder optimization.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: