Equioscillation Theorem: Necessary and Sufficient Conditions for Optimal Polynomial Approximation

By 苏剑林 | June 02, 2025

Recently, while reading, I encountered a theorem regarding optimal polynomial approximation known as the "Equioscillation Theorem." The proof process involves the derivative of the infinity norm, and I found the conclusions and the proof quite novel, so I've decided to record them here.

References: 《Notes on how to prove Chebyshev’s equioscillation theorem》 and 《Approximation Theory – Lecture 5》.

Equioscillation

We first present the conclusion:

Equioscillation Theorem Let $f(x)$ be a polynomial of degree at most $n$, and let $g(x)$ be a continuous function on the interval $[a,b]$. Then \begin{equation}f^* = \mathop{\text{argmin}}_f \max_{x\in[a,b]} |f(x) - g(x)|\end{equation} if and only if there exist $a\le x_0 < x_1 < \cdots < x_{n+1} \le b$ and $\sigma\in\{0,1\}$ such that \begin{equation}f^*(x_k) - g(x_k) = (-1)^{k+\sigma} \max_{x\in[a,b]} |f^*(x) - g(x)|\end{equation}

The Equioscillation Theorem is also sometimes directly called "Chebyshev's Theorem" because it was first discovered by Chebyshev. Although the text description of the theorem seems complex, it actually has a very intuitive geometric meaning, as shown in the following figure:

Equioscillation Theorem

Simply put, the best $n$-th degree polynomial approximation of a function must cross the original function in an alternating fashion. This is quite intuitive, and the Equioscillation Theorem tells us that there are "at least $n+2$ alternating occurrences of the maximum error," which is a very strong quantitative signal.

Derivative Ideas

The error metric used in the Equioscillation Theorem is $\max\limits_{x\in[a,b]} |f(x) - g(x)|$, which means we want the maximum error to be as small as possible. Its advantage is that it automatically focuses on the area with the largest error, and there are no additional hyperparameters to tune.

We know that the standard method for finding an extremum is to take the derivative and set the derivative function to zero. However, differentiating such a metric containing $\max$ is not trivial. Note that we have \begin{equation}\max_{x\in[a,b]} |f(x) - g(x)| = \lim_{p\to\infty} \left(\int_a^b |f(x)-g(x)|^p\right)^{1/p} = \Vert f(x)-g(x)\Vert_{\infty}\end{equation} That is, it is the infinite version of the $l_p$ norm. If $p$ is a finite number, the derivative can be taken directly according to fixed formulas, but when $p\to\infty$, some new changes may occur. In this case, the most reliable way is to return to the original definition of the derivative: \begin{equation}\mathcal{D}_x[h(x),u] = \lim_{\epsilon\to 0} \frac{h(x + \epsilon u) - h(x)}{\epsilon}\end{equation} If the limit exists and can be expressed in the inner product form of $\langle \varphi(x),u\rangle$, then $\varphi(x)$ is the derivative (gradient) of $h(x)$. If $h(x)$ reaches its minimum at $x$, then for sufficiently small non-zero $\epsilon$, $h(x + \epsilon u) \ge h(x)$ must hold constantly. For this inequality to hold for any $u$, it must be that $\varphi(x)=0$. This is the principle of setting the derivative function to zero to find the minimum.

A Simple Example

We start with a simple example, defining $h(x_1,x_2) = \max(x_1,x_2)$, and consider the one-sided limit \begin{equation}\mathcal{D}_{x_1,x_2}[h(x_1,x_2),(u_1,u_2)] = \lim_{\epsilon\to 0^+} \frac{\max(x_1 + \epsilon u_1,x_2 + \epsilon u_2) - \max(x_1,x_2)}{\epsilon}\end{equation}

Next, we discuss by cases. The first case is $x_1 > x_2$. Note that we are considering the limit where $\epsilon$ is sufficiently small. Therefore, when $x_1 \neq x_2$, the added $\epsilon u_1, \epsilon u_2$ are not enough to change the order, i.e., $x_1 + \epsilon u_1 > x_2 + \epsilon u_2$, so \begin{equation}\mathcal{D}_{(x_1,x_2)}[h(x_1,x_2),(u_1,u_2)] = \lim_{\epsilon\to 0^+} \frac{(x_1 + \epsilon u_1) - x_1}{\epsilon} = u_1\end{equation} Similarly, when $x_1 < x_2$, we have \begin{equation}\mathcal{D}_{(x_1,x_2)}[h(x_1,x_2),(u_1,u_2)] = \lim_{\epsilon\to 0^+} \frac{(x_2 + \epsilon u_2) - x_2}{\epsilon} = u_2\end{equation} Finally, when $x_1=x_2$, we have $\max(x_1 + \epsilon u_1,x_2 + \epsilon u_2)=x_1 + \epsilon \max(u_1, u_2)$, so \begin{equation}\mathcal{D}_{(x_1,x_2)}[h(x_1,x_2),(u_1,u_2)] = \lim_{\epsilon\to 0^+} \frac{(x_1 + \epsilon \max(u_1, u_2)) - x_1}{\epsilon} = \max(u_1,u_2)\end{equation} This can be similarly generalized to $\boldsymbol{x}=(x_1,\cdots,x_n), \boldsymbol{u}=(u_1,\cdots,u_n)$. By careful consideration, one can find that the rule is "first find the positions where $\boldsymbol{x}$ is at its maximum value, then find the corresponding maximum value among $\boldsymbol{u}$ at those positions," i.e., \begin{equation}\mathcal{D}_{\boldsymbol{x}}[\max(\boldsymbol{x}),\boldsymbol{u}] = \lim_{\epsilon\to 0^+} \frac{\max(\boldsymbol{x} + \epsilon \boldsymbol{u}) - \max(\boldsymbol{x})}{\epsilon} = \max_{i\in\mathop{\text{argmax}}(\boldsymbol{x})} u_i\end{equation} This does not rule out the possibility that the maximum value of $\boldsymbol{x}$ appears at multiple positions, so $\mathop{\text{argmax}}(\boldsymbol{x})$ is a set.

It is not difficult to see that the above formula cannot be written in the form of $\langle \boldsymbol{\varphi}(\boldsymbol{x}),\boldsymbol{u}\rangle$, which means that the $\max$ operator does not have a gradient in the strict sense. Of course, the probability that two numbers are exactly equal is very small, so if we assume that there is only one position for the maximum value, then the gradient can be separated out, and the result is $\mathop{\text{onehot}}(\mathop{\text{argmax}}(\boldsymbol{x}))$. This is exactly what $\max$ operators do in deep learning frameworks.

Adding Absolute Values

Let $|\boldsymbol{x}|$ be a new vector obtained by adding absolute value signs to each component of $\boldsymbol{x}$. Now we consider the derivative of $\max(|\boldsymbol{x}|)$. Note that we have \begin{equation}\max(|\boldsymbol{x}|) = \max((-\boldsymbol{x}, \boldsymbol{x}))\end{equation} where $(-\boldsymbol{x}, \boldsymbol{x})$ represents a new vector formed by concatenating the two vectors. After removing the absolute values, we can directly substitute the result from the previous section: \begin{equation}\mathcal{D}_{\boldsymbol{x}}[\max(|\boldsymbol{x}|),\boldsymbol{u}] = \mathcal{D}_{(-\boldsymbol{x},\boldsymbol{x})}[\max((-\boldsymbol{x},\boldsymbol{x})),(-\boldsymbol{u},\boldsymbol{u})] = \max_{i\in\mathop{\text{argmax}}((-\boldsymbol{x},\boldsymbol{x}))} (-\boldsymbol{u},\boldsymbol{u})_i\end{equation} It can be simplified further. Note that when $x_i\neq 0$, at most one of $\pm x_i$ can be the maximum. If $x_i$ is the maximum value, then it must be positive, and it is $u_i$ that participates in the ranking; conversely, if it is negative, then $-u_i$ participates in the ranking. Thus we have \begin{equation}\mathcal{D}_{\boldsymbol{x}}[\max(|\boldsymbol{x}|),\boldsymbol{u}] = \max_{i\in\mathop{\text{argmax}}((-\boldsymbol{x},\boldsymbol{x}))} (-\boldsymbol{u},\boldsymbol{u})_i = \max_{i\in\mathop{\text{argmax}}(|\boldsymbol{x}|)} \mathop{\text{sign}}(x_i) u_i \end{equation} This can also be written as \begin{equation}\mathcal{D}_{\boldsymbol{x}}[\Vert\boldsymbol{x}\Vert_{\infty},\boldsymbol{u}] = \max_{i\in\mathop{\text{argmax}}(|\boldsymbol{x}|)} \mathop{\text{sign}}(x_i) u_i \end{equation} This holds provided $\boldsymbol{x}\neq \boldsymbol{0}$. It can be generalized to functions on a closed interval: for all $f,g\in \mathcal{C}[a,b], f\neq 0$, we have \begin{equation}\mathcal{D}_f[\Vert f\Vert_{\infty},g] = \max_{x\in\mathop{\text{argmax}}(|f(x)|)} \mathop{\text{sign}}(f(x)) g(x) \end{equation} This is the key derivative equality needed to prove the Equioscillation Theorem. It is worth pointing out that in the proving process of these two sections, we actually only used case discussions and did not use any approximation. Therefore, we can find a sufficiently small threshold $\tau$ such that for all $\epsilon\in[0,\tau]$, the following holds: \begin{equation}\Vert f + \epsilon g\Vert_{\infty} = \Vert f\Vert_{\infty} + \epsilon\, \mathcal{D}_f[\Vert f\Vert_{\infty},g] = \Vert f\Vert_{\infty} + \epsilon\max_{x\in\mathop{\text{argmax}}(|f(x)|)} \mathop{\text{sign}}(f(x)) g(x) \label{eq:eps-x}\end{equation}

Necessary and Sufficient Conditions

This section formally begins the discussion of optimal approximation. We first prove a result that is more general than polynomial approximation, credited to Kolmogorov (the same K as in KAN):

Kolmogorov's Theorem Let $\mathcal{S}\subseteq \mathcal{C}[a,b]$ be a linear subspace of $\mathcal{C}[a,b]$, $f\in \mathcal{S}$ and $g\in \mathcal{C}[a,b]$. Then \begin{equation}f^* = \mathop{\text{argmin}}_f \Vert f - g\Vert_{\infty} = \mathop{\text{argmin}}_f \max_{x\in[a,b]} |f(x) - g(x)|\end{equation} if and only if for any $h\in \mathcal{S}$, it holds that \begin{equation}\mathcal{D}_{f^*}[\Vert f^* - g\Vert_{\infty}, h] = \max_{x\in\mathop{\text{argmax}}(|f^*(x) - g(x)|)} \mathop{\text{sign}}(f^*(x) - g(x)) h(x) \ge 0\label{eq:cond-1}\end{equation}

The proof is divided into necessity and sufficiency. Necessity follows directly from equation $\eqref{eq:eps-x}$ and the definition of a minimum, so we primarily focus on sufficiency. From the expansion \begin{equation}\begin{aligned} |f(x) - g(x)|^2 =&\, |f(x)-f^*(x)+f^*(x)-g(x)|^2 \\[6pt] =&\, |f(x)-f^*(x)|^2 + 2(f(x)-f^*(x))(f^*(x)-g(x)) + |f^*(x)-g(x)|^2 \end{aligned}\label{eq:cond-1x}\end{equation} Since $f,f^*\in \mathcal{S}$, we have $f-f^* \in \mathcal{S}$. Let $h=f-f^*$. Choose $x$ as the $x$ that makes the condition $\eqref{eq:cond-1}$ hold, then \begin{equation}(f(x)-f^*(x))(f^*(x)-g(x)) = h(x) \mathop{\text{sign}}(f^*(x)-g(x))|f^*(x)-g(x)| \ge 0\end{equation} Substituting this into equation $\eqref{eq:cond-1x}$ gives $|f(x) - g(x)|^2 \ge |f^*(x)-g(x)|^2$, which means $\Vert f-g\Vert_{\infty} \ge \Vert f^* - g\Vert_{\infty}$. Therefore, $f^*$ is the optimal approximation.

Proving the Theorem

To complete the proof of the Equioscillation Theorem, the subspace $\mathcal{S}$ we select is the set of all polynomials of degree not exceeding $n$. For convenience, we restate the content of the theorem here:

Equioscillation Theorem Let $f(x)$ be a polynomial of degree at most $n$, and let $g(x)$ be a continuous function on the interval $[a,b]$. Then \begin{equation}f^* = \mathop{\text{argmin}}_f \Vert f(x) - g(x)\Vert_{\infty}= \mathop{\text{argmin}}_f \max_{x\in[a,b]} |f(x) - g(x)|\end{equation} if and only if there exist $a\le x_0 < x_1 < \cdots < x_{n+1} \le b$ and $\sigma\in\{0,1\}$ such that \begin{equation}f^*(x_k) - g(x_k) = (-1)^{k+\sigma}\Vert f^*(x) - g(x)\Vert_{\infty} = (-1)^{k+\sigma} \max_{x\in[a,b]} |f^*(x) - g(x)|\label{eq:cond-2}\end{equation}

Let $\mathcal{E} = \Vert f^*(x) - g(x)\Vert_{\infty}$. There are three key points to the Equioscillation Theorem: 1. the maximum error $\mathcal{E}$ occurs at least $n+2$ times; 2. $\pm \mathcal{E}$ appears alternately; 3. this constitutes a necessary and sufficient condition for the optimal solution.

Similarly, we divide the proof into sufficiency and necessity. First, consider sufficiency. According to Kolmogorov's Theorem, we only need to prove that if condition $\eqref{eq:cond-2}$ is satisfied, it is impossible for an $h\in \mathcal{S}$ to exist such that for all $x\in\mathop{\text{argmax}}(|f^*(x) - g(x)|)$, we have \begin{equation}\mathop{\text{sign}}(f^*(x) - g(x)) h(x) < 0\end{equation} This is because $\pm \mathcal{E}$ appears alternately $n+2$ times. If the above equation were to hold, then $h(x)$ would change sign at least $n+1$ times. By the Intermediate Value Theorem, $h(x)$ would have at least $n+1$ distinct roots, which contradicts $h\in \mathcal{S}$ (an $n$-th degree polynomial).

As for necessity, according to Kolmogorov's Theorem, for all $h\in \mathcal{S}$, it must hold that \begin{equation}\max_{x\in\mathop{\text{argmax}}(|f^*(x) - g(x)|)} \mathop{\text{sign}}(f^*(x) - g(x)) h(x) \ge 0\label{eq:cond-3}\end{equation} We first sort the elements of $\mathop{\text{argmax}}(|f^*(x) - g(x)|)$ in ascending order, then group adjacent elements that have the same sign of $f^*(x) - g(x)$. Necessity states that we can divide them into at least $n+2$ groups. Suppose we could only divide them into $m \le n + 1$ groups. Then we could find $m-1$ separation points $z_1 < z_2 < \cdots < z_{m-1}$ for these $m$ groups. In this way, we consider \begin{equation}h_{\pm}(x) = \pm (x-z_1)(x-z_2)\cdots (x - z_{m-1})\in \mathcal{S}\end{equation} At least one of $h_{\pm}(x)$ would always have the opposite sign to $f^*(x) - g(x)$ (for all $x \in \mathop{\text{argmax}}(|f^*(x) - g(x)|)$). Let it be $h_+(x)$; then \begin{equation} \mathop{\text{sign}}(f^*(x) - g(x)) h_+(x) < 0,\qquad \forall x\in\mathop{\text{argmax}}(|f^*(x) - g(x)|)\end{equation} This contradicts $\eqref{eq:cond-3}$. Therefore $m > n+1$, meaning $\mathop{\text{argmax}}(|f^*(x) - g(x)|)$ has at least $n+2$ elements, and accordingly $f^*(x) - g(x)$ changes sign at least $n+1$ times.

Unique Optimum

A natural question is whether the optimal polynomial approximation in the Equioscillation Theorem is unique. The answer is yes. This proof is also quite clever, referencing the notes 《Chebyshev Equioscillation Theorem》.

Assume $f_1^*, f_2^*$ are two optimal polynomials for $g$. We first prove that $(f_1^* + f_2^*)/2$ is also an optimal approximation: \begin{equation}\left\Vert \frac{f_1^* + f_2^*}{2} - g\right\Vert_{\infty} = \left\Vert \frac{f_1^* - g}{2} + \frac{f_2^* - g}{2}\right\Vert_{\infty}\le \frac{1}{2}\Vert f_1^* - g\Vert_{\infty} + \frac{1}{2}\Vert f_2^* - g\Vert_{\infty} = \mathcal{E} \end{equation} The final equality is because $f_1^*, f_2^*$ are both optimal approximations, so $\Vert f_1^* - g\Vert_{\infty} = \Vert f_2^* - g\Vert_{\infty} = \mathcal{E}$. Since the maximum error of $(f_1^* + f_2^*)/2$ also does not exceed $\mathcal{E}$, it too is an optimal approximation.

Then, according to the Equioscillation Theorem, there exist at least $n+2$ different values of $x$ such that \begin{equation}\frac{f_1^*(x) + f_2^*(x)}{2} - g(x) = \pm\mathcal{E} \end{equation} Furthermore, since \begin{equation}\frac{f_1^* + f_2^*}{2} - g = \frac{f_1^* - g}{2} + \frac{f_2^* - g}{2}\end{equation} And $f_1^*(x) - g(x), f_2^*(x) - g(x) \in [-\mathcal{E}, \mathcal{E}]$, $f_1^*(x) - g(x)$ and $f_2^*(x) - g(x)$ must have the same sign and both equal $\mathcal{E}$ or $-\mathcal{E}$. This means the polynomial $f_1^*-f_2^*$ has at least $n+2$ roots. However, $f_1^*-f_2^*$ is a polynomial of degree $n$, and having $n+2$ roots means it must be identically zero, i.e., $f_1^*=f_2^*$.

Two Variants

Finally, we give variants of the Equioscillation Theorem for odd polynomials and even polynomials. Odd/even polynomials are polynomials containing only odd/even powers, such as $x + 2x^3$ and $1 + 4x^2 + x^4$. Their characteristic is that roots must appear in positive-negative pairs. Therefore, an odd or even polynomial of degree at most $2n+1$ has at most $n$ roots on $(0,\infty)$.

For odd/even polynomials, we have:

Equioscillation Theorem (Odd/Even) Let $f(x)$ be an odd/even polynomial of degree at most $2n+1$, and let $g(x)$ be a continuous function on an interval $[a,b] \subset (0,\infty)$. Then \begin{equation}f^* = \mathop{\text{argmin}}_f \Vert f(x) - g(x)\Vert_{\infty}= \mathop{\text{argmin}}_f \max_{x\in[a,b]} |f(x) - g(x)|\end{equation} if and only if there exist $a \le x_0 < x_1 < \cdots < x_{n+1} \le b$ and $\sigma \in \{0,1\}$ such that \begin{equation}f^*(x_k) - g(x_k) = (-1)^{k+\sigma}\Vert f^*(x) - g(x)\Vert_{\infty} = (-1)^{k+\sigma} \max_{x\in[a,b]} |f^*(x) - g(x)|\end{equation}

The proof steps are fundamentally the same as those for the general Equioscillation Theorem, requiring only slight adjustments in the details, which interested readers may complete themselves.

Summary

This article introduced the Equioscillation Theorem for optimal polynomial approximation and the related problem of differentiating the infinity norm.