Ok, this is a quite basic question, but I am a little bit confused. In my thesis I write:

The standard errors can be found by calculating the inverse of the square root of the diagonal elements of the (observed) Fisher Information matrix:

\begin{align*}
s_{\hat{\mu},\hat{\sigma}^2}=\frac{1}{\sqrt{\mathbf{I}(\hat{\mu},\hat{\sigma}^2)}}
\end{align*}
Since the optimization command in R minimizes $-\log\mathcal{L}$ the (observed) Fisher Information matrix can be found by calculating the inverse of the Hessian:
\begin{align*}
\mathbf{I}(\hat{\mu},\hat{\sigma}^2)=\mathbf{H}^{-1}
\end{align*}

$\begingroup$@COOLSerdash Thanks for your corrections and +1, but this source: unc.edu/~monogan/computing/r/MLE_in_R.pdf page 7 clearly says that the observed Fisher information is equal to the INVERSE of the Hessian?$\endgroup$
– Jen BoholdAug 22 '13 at 16:08

$\begingroup$@COOLSerdash Ok, you may want to post this as an answer.$\endgroup$
– Jen BoholdAug 22 '13 at 16:35

2 Answers
2

Yudi Pawitan writes in his book In All Likelihood that the second derivative of the log-likelihood evaluated at the maximum likelihood estimates (MLE) is the observed Fisher information (see also this document, page 2). This is exactly what most optimization algorithms like optim in R return: the Hessian evaluated at the MLE. When the negative log-likelihood is minimized, the negative Hessian is returned. As you correctly point out, the estimated standard errors of the MLE are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. In other words: The square roots of the diagonal elements of the inverse of the Hessian (or the negative Hessian) are the estimated standard errors.

Summary

The negative Hessian evaluated at the MLE is the same as the observed Fisher information matrix evaluated at the MLE.

Regarding your main question: No, it's not correct that the
observed Fisher information can be found by inverting the (negative)
Hessian.

Regarding your second question: The inverse of the (negative) Hessian is an estimator of the asymptotic covariance matrix. Hence, the square roots of the diagonal elements of covariance matrix are estimators of the standard errors.

I think the second document you link to got it wrong.

Formally

Let $l(\theta)$ be a log-likelihood function. The Fisher information matrix $\mathbf{I}(\theta)$ is a symmetrical $(p\times p)$ matrix containing the entries:
$$
\mathbf{I}(\theta)=-\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p
$$
The observed Fisher information matrix is simply $\mathbf{I}(\hat{\theta}_{\mathrm{ML}})$, the information matrix evaluated at the maximum likelihood estimates (MLE). The Hessian is defined as:
$$
\mathbf{H}(\theta)=\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}l(\theta),~~~~ 1\leq i, j\leq p
$$
It is nothing else but the matrix of second derivatives of the likelihood function with respect to the parameters. It follows that if you minimize the negative log-likelihood, the returned Hessian is the equivalent of the observed Fisher information matrix whereas in the case that you maximize the log-likelihood, then the negative Hessian is the observed information matrix.

Further, the inverse of the Fisher information matrix is an estimator of the asymptotic covariance matrix:
$$
\mathrm{Var}(\hat{\theta}_{\mathrm{ML}})=[\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1}
$$
The standard errors are then the square roots of the diagonal elements of the covariance matrix.
For the asymptotic distribution of a maximum likelihood estimate, we can write
$$
\hat{\theta}_{\mathrm{ML}}\stackrel{a}{\sim}\mathcal{N}\left(\theta_{0}, [\mathbf{I}(\hat{\theta}_{\mathrm{ML}})]^{-1}\right)
$$
where $\theta_{0}$ denotes the true parameter value. Hence, the estimated standard error of the maximum likelihood estimates is given by:
$$
\mathrm{SE}(\hat{\theta}_{\mathrm{ML}})=\frac{1}{\sqrt{\mathbf{I}(\hat{\theta}_{\mathrm{ML}})}}
$$

$\begingroup$The (expected) Fisher information is $\mathcal{I}(\theta)=\operatorname{E}I(\theta)$; the observed (Fisher) information is just $I(\theta)$, so called not because it's evaluated at the maximum-likehood estimate of $\theta$, but because it's a function of the observed data rather than an average over possible observations. This is perhaps obscured by familiar examples' considering inference about the canonical parameter in a full exponential family, when $\mathcal{I}(\theta)=I(\theta)$.$\endgroup$
– Scortchi♦Feb 24 '16 at 10:27

First, one declares the log-likelihood function. then one optimizes the log-likelihood functions. That's fine.

Writing the log-likelihood functions in R, we ask for $-1*l$ (where $l$ represents the log - likelihood function) because the optim command in R minimizes a function by default. minimization of -l is the same as maximization of l, which is what we want.

Now, the observed Fisher Information Matrix is equal to $(-H)^-1$. the reason that we do not have to multiply the hassian by -1 is that all of the evaluation has been done in terms of -1 times the log-likelihood. This means that the hessian that is produced by optim is already multiplied by -1