I have implemented a neural network with back propagation using a sigmoid activation function. To validate the functionality of my code, I am estimating the gradient of my function using the following:

I am using a simple OR problem in order to quickly simulate the network. The results show that the normalized difference between the gradient calculated through back propagation and the approximated numerical gradient starts to increase very quickly.

Below is a graph that shows the MSE and difference of gradients over the number of epochs.

I am getting expected classification results using the MNIST dataset, but seeing this makes me believe I am still doing something wrong. Should I expect the difference of gradients to increase as the network converges to a solution?

1 Answer
1

If by "normalized difference" you mean something like $\lVert \hat g - g \rVert / \lVert g \rVert$, this isn't that surprising. If you think of the finite differences approximation as adding a small, constant amount of noise, as you approach a local min you'll have $\lVert g \rVert \to 0$, so that ratio will blow up.

The absolute value of the amount seems a little high, though. You might be using too large an $\epsilon$ or something like that....