Wednesday, March 9, 2016

Biased Estimators vs Unbiased Ones

Let us consider a numerical difference between a biased estimator and unbiased one. I will take a simple case of standard deviation of a sample mean. Say we have some measurements \(\{ x_1, x_2, \ldots, x_N\}\) of a particular outcome. We can compute its average, \({\bar x}\), and we want to estimate its deviation as well. The formula for unbiased one is the following: \[
s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}
\] In the same time if we knew what is a true mean \(\mu\) of the outcome, we can compute the deviation the following way: \[
s=\sqrt{\frac{\sum_{i=1}^{N}(\mu-x_i)^2}{N}}
\] So it looks natural to write the deviation using \({\bar x}\) instead of \(\mu\) as \[
\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N}}
\] But we know from statistics that the last formula is biased.

Let us consider how different are these values numerically. I will denote the sum of squares on the top by \(S=\sum_{i=1}^{N}({\bar x}-x_i)^2\), to simplify the look of my calculations. Thus \[
s=\sqrt{\frac{\sum_{i=1}^{N}({\bar x}-x_i)^2}{N-1}}=
\sqrt{\frac{S}{N-1}},
\] and I want to compute \[
\sqrt{\frac{S}{N-1}}-\sqrt{\frac{S}{N}}=
\] We can take out \(\sqrt{S}\) as a common factor: \[
\sqrt{S}\left(\frac{1}{\sqrt{N-1}}-\frac{1}{\sqrt{N}}\right)=
\] and then bring fractions to common denominator and combine them. \[
\sqrt{S}\left(\frac{\sqrt{N}}{\sqrt{N(N-1)}}-
\frac{\sqrt{N-1}}{\sqrt{N(N-1)}}\right)=
\sqrt{S}\left(\frac{\sqrt{N}-\sqrt{N-1}}{\sqrt{N(N-1)}}\right)=
\] For difference of square roots there is a special trick in math, based on formula \[
(a-b)(a+b)=a^2-b^2
\] As you see, if \(a\) and \(b\) are square roots, then by multiplying by their sum we can get rid of them. Usually we cannot multiply expressions on will, but with a fraction we are allowed to multiply top and bottom by the same expression. \[
\sqrt{S}\left(
\frac{\left(\sqrt{N}-\sqrt{N-1}\right)\left(\sqrt{N}+\sqrt{N-1}\right)}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}
\right)=
\] Using the above mentioned trick simplifies our top (but not the bottom): \[
\sqrt{S}\left(
\frac{N-(N-1)}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}
\right)=
\]\[
\sqrt{S}\left(
\frac{1}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}
\right)=
\]\[
\frac{\sqrt{S}}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}
\] This is not very convenient. Let us consider it in relation to sample deviation. \[
\frac{\mbox{The difference between unbiased and biased values}}
{\mbox{sample deviation}}=
\]\[
\frac{\frac{\sqrt{S}}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}}{\sqrt{\frac{S}{N-1}}}=
\] Here we can cancel \(\sqrt{S}\), flip fractions and then cancel \(\sqrt{N-1}\). I hope you’re still with me. \[
\frac{\frac{1}
{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}}
{\frac{1}{\sqrt{N-1}}}=
\frac{\sqrt{N-1}}{\sqrt{N(N-1)}\left(\sqrt{N}+\sqrt{N-1}\right)}=
\]\[
\frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}
\] The last expression is better, but still can use some doctoring. I can replace all \(\sqrt{N}\) by \(\sqrt{N-1}\), thus reducing the fraction denominator and getting an upper bound for the whole expression: \[
\frac{1}{\sqrt{N}\left(\sqrt{N}+\sqrt{N-1}\right)}<
\frac{1}{\sqrt{N-1}\left(\sqrt{N-1}+\sqrt{N-1}\right)}=
\frac{1}{2(N-1)}
\] Finally we got a handy formula. If \(N=101\), then our accuracy for sample deviation computed with a biased formula is \(\frac{1}{2(100)}\cdot 100\%=0.5\%\). When \(N=10,001\), then it will be \(0.005\%\). When \(N=1,000,001\), which is more close to what happens in Big Data, then the accuracy will be \(0.5\cdot 10^{-6}\%\).

Conclusion.

In Big Data a difference between biased and unbiased estimators of the same value becomes very small and could be even negligible.