So, why does batch normalization work? Here’s one reason, you know how normalizing the input features, the X’s, to mean zero and variance one, how that can speed up learning?
So rather than having some features that range from zero to one and some from 1 to 1,000 , by normalizing all the features , input features X, to take on a similar range of values that can speed up learning.
So one intuition behind working of batch norm is that this is doing a similar thing but further values in your hidden units and not just your input there.