I had a question concerning the determination of the Maximum Likelihood Estimate (MLE) of a parameter (or vector of parameters) by taking the natural logarithm of the posterior distribution.

$p(D|h)$ is the likelihood of the data set.

$D$ is the data set.

$p(h)$ is the prior probability of the parameter $h$ (I would use $\theta$ instead of $h$, but for this example I'd rather use discrete probability instead of continuous).

So the posterior distribution is proportional to the likelihood times the prior:

$p(h|D) \propto p(D|h) \times p(h)$

My textbook first explains how to find the MAP estimate. MAP stands for "maximum a posteriori". It is basically the $h$ parameter vector in the posterior distribution where the probability mass (or density, depending on whether it is discrete or continuous) is the highest.

This is possible of course because $log(x)$ is an increasing function for all $x \gt 0$, the log function has the property $log_{a}(p \times q) = log_{a}(p) + log_{a}(q)$, and finally, the $\underset{h}{argmax}()$ function simply wants to find the $h$ parameter that gives the largest value for whatever is in the parentheses.

My question is this: the book also says that as we get more and more data (as $D$ gets larger), the training data likelihood overwhelms the prior. In other words, the MAP estimate approaches the Maximum Likelihood Estimate (MLE) like so: