By 苏剑林 | Feb 15, 2016
If you have been following Scientific Spaces for a while, you may have noticed that I have a particular preference for extremum principles. For instance, I have previously written a series of articles titled "Natural Extremum," introducing some extremum problems and the calculus of variations. In physics, I favor the form of the principle of least action. In data mining, I am deeply interested in Maximum Entropy models based on the principle of maximum entropy. Recently, while working on exercises in "Quantum Mechanics and Path Integrals," I became very interested in the variational principle discussed in Chapter 11.
When learning something new, my method is to understand its principles and ideas using the simplest possible example before progressively making it more complex; this way, I won't get lost. The variational principle is a powerful method for estimating path integrals (which are functional integrals or infinite-dimensional integrals). It is natural to wonder whether there is a similar estimation principle for finite-dimensional integrals, such as the simplest one-dimensional integral. In fact, there is, and it is not complicated; understanding it helps clarify the core ideas of the variational principle. Unfortunately, I haven't found existing literature describing this simplified version of the principle, though this might be due to my limited research.
The variational principle is essentially an application of Jensen's inequality. We start with the following integral: \begin{equation}\label{jifen}I(\epsilon)=\int_{-\infty}^{\infty}e^{-x^2-\epsilon x^4}dx\end{equation} We have previously studied this integral from the perspective of perturbation expansion. This integral is often viewed as a one-dimensional version of a complex path integral and can typically be studied to gain effective insights for researching path integrals. To estimate the above integral, a natural idea is: can this integral be represented by a simpler Gaussian integral? \begin{equation}\label{jinsi}\int_{-\infty}^{\infty}e^{-px^2}dx=\sqrt{\frac{\pi}{p}}\end{equation} The key question is how to choose $p$ such that $\eqref{jinsi}$ is as close as possible to $\eqref{jifen}$. We rewrite $\eqref{jifen}$ as: \begin{equation}\label{bianhuan}I(\epsilon)=\sqrt{\frac{\pi}{p}}\int_{-\infty}^{\infty}e^{(p-1)x^2-\epsilon x^4}\left(\sqrt{\frac{p}{\pi}}e^{-px^2}\right)dx\end{equation} How should we view $\eqref{bianhuan}$? I have deliberately written it in a specific form—from a statistical perspective, it is the expectation (average) of $e^{(p-1)x^2-\epsilon x^4}$ under the weight $\sqrt{\frac{p}{\pi}}e^{-px^2}$. Therefore, we can write it as: \begin{equation}\label{pingjun}I(\epsilon)=\sqrt{\frac{\pi}{p}}\left\langle e^{(p-1)x^2-\epsilon x^4}\right\rangle\end{equation} Again, I remind you that this average is calculated with the weight $\sqrt{\frac{p}{\pi}}e^{-px^2}$. At this point, all we have done is a formal transformation; there has been no substantive progress, as we haven't actually calculated the integral we started with.
Now, let's recall Jensen's inequality. It states that if $f(x)$ is a convex function (note that definitions of concavity and convexity may vary by textbook; here, a convex function refers to one whose second derivative is non-negative), then: \begin{equation}\label{jensen}f(\langle x \rangle)\leq \langle f(x) \rangle\end{equation} Readers might usually see Jensen's inequality in a different specific form, but please consider its essence: Jensen's inequality expresses that if several points lie on the graph of a convex function, their center of mass is located above or on the function's curve.
Using Jensen's inequality, we can further simplify $\eqref{pingjun}$. Since $e^x$ is clearly a convex function, we have: \begin{equation}\label{pingjun-jensen}I(\epsilon)=\sqrt{\frac{\pi}{p}}\left\langle e^{(p-1)x^2-\epsilon x^4}\right\rangle \geq \sqrt{\frac{\pi}{p}}e^{\left\langle (p-1)x^2-\epsilon x^4\right\rangle}\end{equation} Where we can calculate $\left\langle (p-1)x^2-\epsilon x^4\right\rangle$ as follows: $$\left\langle (p-1)x^2-\epsilon x^4\right\rangle=\int_{-\infty}^{\infty}\left[(p-1)x^2-\epsilon x^4\right]\left(\sqrt{\frac{p}{\pi}}e^{-px^2}\right)dx=\frac{2 (p-1) p-3 \epsilon}{4 p^2}$$ Thus: \begin{equation}\label{pingjun-guji}I(\epsilon)\geq\sqrt{\frac{\pi}{p}}\exp\left[\frac{2 (p-1) p-3 \epsilon}{4 p^2}\right]\end{equation} No matter how $p$ is chosen, the right-hand side will be less than or equal to $I(\epsilon)$. Naturally, to make the estimation as close as possible to $I(\epsilon)$, we should choose $p$ to maximize the right side of the inequality! This is the origin of the extremum principle. By taking the derivative, we find that the right side reaches its maximum when: $$p=\frac{1}{2} \left(\sqrt{12 \epsilon+1}+1\right)$$ Substituting this back into $\eqref{pingjun-guji}$, we obtain: \begin{equation}\label{pingjun-guji2}I(\epsilon)\geq \sqrt{\frac{2 \pi }{\sqrt{12 \epsilon+1}+1}}\exp\left[\frac{3 \epsilon}{\left(\sqrt{12 \epsilon+1}+1\right)^2}\right]=\hat{I}(\epsilon)\end{equation} Below are some results: $$\begin{array}{c|cccc} \hline & \epsilon=0 & \epsilon=1 & \epsilon=10 & \epsilon=100 \\ \hline I(\epsilon) & 1.77245 & 1.36843 & 0.921961 & 0.554577 \\ \hat{I}(\epsilon) & 1.77245 & 1.34547 & 0.891204 & 0.531509 \\ \hline \end{array}$$ Comparing this with the results in http://kexue.fm/archives/3280/, we see that $\hat{I}(\epsilon)$ is quite a good estimate, and it is confirmed that indeed $\hat{I}(\epsilon)\leq I(\epsilon)$.
From the process above, we can see the power of the extremum principle: it provides a good estimate of the integral, and this estimate is stable—regardless of the size of the parameter $\epsilon$, a reasonable estimate can be obtained. Secondly, it doesn't just provide an estimate; it provides a lower bound for the integral, which is rare. Usually, we can use various methods to get an effective estimate, but it is often difficult to determine the relationship in magnitude between the estimate and the exact value (unless accuracy is significantly sacrificed), whereas the extremum principle achieves both—maintaining high accuracy while providing a bound.
The extremum principle discussed in this article originates from Jensen's inequality, and the core lies in the fact that $e^x$ is a convex function, and $e^{\left\langle h(x)\right\rangle}$ is often easier to calculate than $\left\langle e^{h(x)}\right\rangle$. Therefore, the direction for generalization is obvious: use different weights, or replace $e^x$ with a general convex function, as long as $f(\left\langle h(x)\right\rangle)$ is easier to calculate than $\left\langle f(h(x))\right\rangle$.
Of course, the extremum principle also has obvious drawbacks. First, finding a good and integrable approximating function is not easy. Second, even if the integration can be performed, the subsequent process of finding the extremum is not necessarily simple, as it involves solving systems of non-linear equations; once approximations are used there, it often degenerates into the results of perturbation expansion, losing the unique advantages of the variational method. Finally, estimates from the extremum principle have a less obvious disadvantage: it is difficult to further improve the accuracy.