By 苏剑林 | January 07, 2017
Statistics is the process of estimating a true distribution from a large number of samples. A term that usually accompanies statistics is "smoothing," which refers to the process of discounting or adjusting statistical results. The idea behind smoothing is that if the sample space is very large, the statistical results will be sparse. Due to various accidental factors, small statistical results become unreliable—for example, a frequency of 1 might be purely coincidental, and its true frequency is not necessarily close to $1/N$, while a frequency of 0 does not necessarily mean it will never occur. Thus, we need to smooth the statistical results to make the conclusions more reliable.
There are many methods for smoothing. Here, we introduce a smoothing formula based on the forgetting hypothesis. Suppose our task is to count the frequency of each character from a corpus. We simulate the process of human forgetting: assume that every time a character appears, our memory of it increases by 1. However, if the character does not appear within a certain period (regardless of how long that period is), the memory volume decreases to a proportion $\beta$ of its previous value. Assuming characters appear periodically, the memory volume $A_n$ satisfies the following recurrence relation:
$$A_{n+1} = \beta A_n + 1$$
The general solution to this recurrence is:
$$A_n = \frac{1-\beta^n}{1-\beta}$$
Its limit is:
$$A = \frac{1}{1-\beta}$$
We consider this limit to be our true statistical result—that is, the smoothed result.
If the total number of characters in a corpus is $N$ and the frequency of a certain character is $F$, assuming characters are uniformly distributed, then the appearance period of that character is $N/F$. We can assume that the law of forgetting follows an exponential decay (guessed from the Ebbinghaus forgetting curve, ref: https://en.wikipedia.org/wiki/Forgetting_curve, or readers can guess other forms of forgetting laws). Thus, there exists a constant $\alpha$ less than 1 such that:
$$\beta = \alpha^{N/F}$$
In this way, the data smoothing formula we obtain is:
$$\hat{F} = \frac{1}{1-\alpha^{N/F}}$$
Let's analyze this formula more carefully. It's not hard to imagine that $\alpha$ must be very close to 1; otherwise, everything would be forgotten instantly. Secondly, we should expect that when $F$ is large enough, the smoothed result $\hat{F}$ should be close to $F$, because the more sufficient the statistics, the more credible the result. Therefore, we assume that when $F=N$, $\hat{F}=F$. At this point, we should multiply by a factor:
$$\hat{F} = \frac{N(1-\alpha)}{1-\alpha^{N/F}}$$
This is the final smoothing formula.
If we use the results provided by Uncle Han, we might consider using:
$$\alpha = 0.99999962...$$
However, we notice that as $F \to 0$:
$$\hat{F} = \lim_{F\to 0}\frac{N(1-\alpha)}{1-\alpha^{N/F}}=N(1-\alpha)$$
This also implies that this smoothing formula assigns a frequency greater than 0, specifically $N(1-\alpha)$, to characters with a statistical frequency of 0. Obviously, fixing $\alpha$ here seems somewhat unsuitable. Would it be better to fix $N(1-\alpha)$ instead? If we assign a frequency $\gamma$ to characters that have never appeared, then we solve for:
$$\alpha = 1 - \frac{\gamma}{N}$$
Thus:
$$\hat{F} = \frac{\gamma}{1-(1 - \gamma/N)^{N/F}}$$
At this point, starting from the forgetting hypothesis, we have obtained a final smoothing formula that carries only one parameter with a clear meaning. Finally, we notice that when $N$ is sufficiently large:
$$(1 - \gamma/N)^N \approx e^{-\gamma}$$
Thus:
$$\hat{F} \approx \frac{\gamma}{1-e^{-\gamma/F}}$$
This also serves as a usable smoothing formula. Finally, let's look at its Taylor expansion:
$$\hat{F} \approx \frac{\gamma}{1-e^{-\gamma/F}} = F + \frac{\gamma}{2} + \frac{\gamma^2}{12F^2}+\dots$$
If we only take the first two terms, we can see that in practice, this approach is similar to Add-1 smoothing (Laplace smoothing). Thus, perhaps we can say that the forgetting hypothesis provides a more substantial theoretical basis for Add-1 smoothing?
Jianlin Su. (Jan. 07, 2017). "Smoothing formula based on forgetting hypothesis" [Blog post]. Retrieved from https://kexue.fm/archives/4182