By 苏剑林 | June 19, 2019
For most readers (including the author), the first biased estimator they encounter is likely the variance:
\begin{equation}\hat{\sigma}^2_{\text{biased}} = \frac{1}{n}\sum_{i=1}^n \left(x_i - \hat{\mu}\right)^2,\quad \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i\label{eq:youpianfangcha}\end{equation}
And then they learn that the corresponding unbiased estimator should be:
\begin{equation}\hat{\sigma}^2_{\text{unbiased}} = \frac{1}{n-1}\sum_{i=1}^n \left(x_i - \hat{\mu}\right)^2\label{eq:wupianfangcha}\end{equation}
In the eyes of many, formula $\eqref{eq:youpianfangcha}$ seems the most reasonable, so how is it biased? Formula $\eqref{eq:wupianfangcha}$ replaces $n$ with the counter-intuitive $n-1$, and suddenly it becomes unbiased?
Below, I will attempt to discuss the concepts of unbiased and biased estimation using language that is as clear as possible.
Suppose we could sample an infinite number of samples; then, theoretically, the following estimations would be exact:
\begin{equation}\begin{aligned}\sigma^2 =&\, \mathbb{E}\left[(x - \mu)^2\right]=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^n \left(x_i - \hat{\mu}\right)^2\\ \mu =&\, \mathbb{E}[x]=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^n x_i\end{aligned}\end{equation}
This can also be understood as: when the number of samples tends to infinity, biased and unbiased estimations are equivalent.
The problem is that in actual calculations, we can only sample a finite set of samples. This means $n$ is a fixed number—for example, when using stochastic gradient descent, we use the average gradient of a batch of samples as an estimate of the gradient for the entire population. On the other hand, we don't just estimate once; we might estimate many times. That is, we first sample $n$ samples, calculate once to get $\hat{\mu}_{1}$ and $\hat{\sigma}^2_{\text{biased},1}$, then randomly sample another $n$ samples to get $\hat{\mu}_{2}$ and $\hat{\sigma}^2_{\text{biased},2}$, and so on, obtaining $\left(\hat{\mu}_{3},\hat{\sigma}^2_{\text{biased},3}\right),\left(\hat{\mu}_{4},\hat{\sigma}^2_{\text{biased},4}\right),\dots$. What we want to know is:
\begin{equation}\begin{aligned}\mu &\xlongequal{?}\mathbb{E}\left[\hat{\mu}\right] = \lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^N \hat{\mu}_{i}\\ \sigma^2 &\xlongequal{?}\mathbb{E}\left[\hat{\sigma}^2_{\text{biased}}\right]=\lim_{N\to\infty}\frac{1}{N}\sum_{i=1}^N \hat{\sigma}^2_{\text{biased},i} \end{aligned}\end{equation}
In other words, does the "infinite average" of the "finite average" equal the ultimate value we are seeking?
As mentioned earlier, this article focuses on discussion and understanding rather than formal derivation, so I do not intend to complete a general proof. Here, we will only use the simplest example: assume $n=2$. That is, when using $\eqref{eq:youpianfangcha}$ or $\eqref{eq:wupianfangcha}$ for estimation, we only sample two points each time. In this case, the questions we need to answer are:
\begin{equation}\begin{aligned}\mu &\xlongequal{?}\mathbb{E}_{x_1,x_2}\left[\frac{x_1 + x_2}{2}\right]\\ \sigma^2 &\xlongequal{?}\mathbb{E}_{x_1,x_2}\left[\frac{1}{2}\left(\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \left(x_2 - \frac{x_1 + x_2}{2}\right)^2\right)\right] \end{aligned}\end{equation}
Since this situation is relatively simple, we can easily verify it. For instance:
\begin{equation}\mathbb{E}_{x_1,x_2}\left[\frac{x_1 + x_2}{2}\right] = \mathbb{E}_{x_1}\left[\frac{x_1}{2}\right] + \mathbb{E}_{x_2}\left[\frac{x_2}{2}\right] = \frac{\mu}{2} + \frac{\mu}{2} = \mu\end{equation}
So the mean estimated with two samples is an unbiased estimate of the mean, and the same applies to more samples.
However, variance is different:
\begin{equation}\begin{aligned}&\mathbb{E}_{x_1, x_2} \left[\frac{1}{2}\left(\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \left(x_2 - \frac{x_1 + x_2}{2}\right)^2\right)\right]\\ =&\frac{1}{4}\mathbb{E}_{x_1, x_2} \left[\left(x_1 - x_2\right)^2\right]\\ =&\frac{1}{4}\mathbb{E}_{x_1, x_2} \left[x_1^2 + x_2^2 - 2 x_1 x_2\right]\\ =&\frac{1}{4}\Big(\mathbb{E}_{x_1} \left[x_1^2\right] + \mathbb{E}_{x_2} \left[x_2^2\right] - 2 \mathbb{E}_{x_1} \left[x_1\right] \mathbb{E}_{x_2} \left[x_2\right]\Big)\\ =&\frac{1}{4}\Big(2\mathbb{E}_{x} \left[x^2\right] - 2 \mu^2\Big)\\ =&\frac{1}{2}\Big(\mathbb{E}\left[x^2\right] - \mu^2\Big) \end{aligned}\end{equation}
Note that the accurate expression for variance should be $\mathbb{E}\left[x^2\right] - \mu^2$. Therefore, the $\hat{\sigma}^2_{\text{biased}}$ for two samples is a biased estimate of the variance. Even after repeating the estimation and taking the average, it still underestimates the true variance. If we analyze the estimation for $n$ samples, the leading factor is $(n-1)/n$. Thus, we only need to multiply by $n/(n-1)$ to obtain the unbiased estimate of the variance, and the final result is $\eqref{eq:wupianfangcha}$.
Intuitively, using formula $\eqref{eq:youpianfangcha}$ with finite samples to estimate variance leads to smaller fluctuations because there are fewer samples; therefore, the variance estimate will also be smaller. This is what is meant by being biased. In an extreme case, what if we only take one sample for the estimate? The variance estimated by formula $\eqref{eq:youpianfangcha}$ would be 0. No matter how many times you repeat the experiment, the result remains 0. We certainly cannot say that the variance of the entire population must be 0, can we? This is the simplest example of a biased estimator.
Theoretically, the mechanism behind biased estimation is also easy to understand, because the formula for calculating variance is equivalent to:
\begin{equation}\mathbb{E}\left[x^2\right] - \mathbb{E}\left[x\right]^2\end{equation}
The expectation operator $\mathbb{E}$ is a linear operator, so the above formula is non-linear with respect to $\mathbb{E}$ (specifically quadratic, because of the $\mathbb{E}\left[x\right]^2$ term). As long as an estimator is non-linear with respect to the expectation operator $\mathbb{E}$ (Note: the emphasis here is on non-linearity regarding the expectation operator, not the non-linearity of the random variables), a direct finite-sample estimate is very likely to produce bias. This is because the composite of linear operations is still a linear operation, whereas the composite of a linear operation and a non-linear operation is no longer the original non-linear operation.
Not all biased estimations can be turned into unbiased estimations simply by replacing $n$ with $n-1$ like variance. In general, for many quantities we want to estimate, the estimation itself is difficult, let alone whether it is biased or unbiased. Therefore, to eliminate bias for a general estimator, a case-by-case analysis is required.
Reprinting Disclaimer: Please include the original address of this article: https://kexue.fm/archives/6747