Asymptotic Estimation of the Maximum of n Normal Random Variables

By 苏剑林 | November 06, 2025

Let $z_1, z_2, \dots, z_n$ be $n$ independent and identically distributed (i.i.d.) random variables sampled from a standard normal distribution. From these, we can construct many derived random variables, such as $z_1 + z_2 + \dots + z_n$, which still follows a normal distribution, or $z_1^2 + z_2^2 + \dots + z_n^2$, which follows a chi-squared distribution. In this article, we are interested in the distribution information of its maximum value $z_{\max} = \max\{z_1, z_2, \dots, z_n\}$, especially its mathematical expectation $\mathbb{E}[z_{\max}]$.

Conclusion First

The basic estimation result for $\mathbb{E}[z_{\max}]$ is:

Let $z_1, z_2, \dots, z_n \sim \mathcal{N}(0, 1)$ and $z_{\max} = \max\{z_1, z_2, \dots, z_n\}$, then \begin{equation}\mathbb{E}[z_{\max}]\sim \sqrt{2\log n}\label{eq:E-z-max}\end{equation}

The meaning of $\sim$ in Equation $\eqref{eq:E-z-max}$ is: \begin{equation}\lim_{n\to\infty} \frac{\mathbb{E}[z_{\max}]}{\sqrt{2\log n}} = 1\end{equation} As $n$ grows, this result becomes increasingly accurate. A more precise result is: \begin{equation}\mathbb{E}[z_{\max}]\sim \sqrt{2\log \frac{n}{\sqrt{2\pi}}}\end{equation} We can verify these using Numpy:

import numpy as np

n = 4096
z = np.random.randn(10000, n)
E_z_max = z.max(axis=1).mean() # ≈ 3.63
approx1 = np.sqrt(2 * np.log(n)) # ≈ 4.08
approx2 = np.sqrt(2 * np.log(n / np.sqrt(2 * np.pi))) # ≈ 3.85

Fast Upper Bound

For the above conclusion, this article will provide three proofs. The first proof comes from a post "Expectation of the maximum of gaussian random variables" and the answer by @Sivaraman. It technically only proves $\mathbb{E}[z_{\max}] \leq \sqrt{2\log n}$, but the proof process is quite elegant and worth learning.

The proof cleverly utilizes the convexity of $\exp$: for any $t > 0$, we can write \begin{equation}\exp(t\mathbb{E}[z_{\max}]) \leq \mathbb{E}[\exp(t z_{\max})] = \mathbb{E}[\max_i \exp(t z_i)]\leq \sum_{i=1}^n\mathbb{E}[\exp(t z_i)] = n \exp(t^2 / 2)\end{equation} The first $\leq$ is based on Jensen's inequality, and the second $\leq$ replaces the maximum with a sum. Now, taking the logarithm of both sides and rearranging gives \begin{equation}\mathbb{E}[z_{\max}] \leq \frac{\log n}{t} + \frac{t}{2}\end{equation} Note that this holds for any $t > 0$, so we can choose the $t$ that minimizes the right-hand side to obtain the tightest bound. From the AM-GM inequality or basic calculus, the minimum value of the right side is reached at $t=\sqrt{2\log n}$. Substituting this gives \begin{equation}\mathbb{E}[z_{\max}] \leq \sqrt{2\log n}\end{equation} The characteristic of this derivation is its simplicity and speed, requiring little additional prior knowledge. Although theoretically it is only an upper bound, it is surprisingly accurate, aligning exactly with the asymptotic result.

Conventional Approach

For those used to step-by-step formula derivations (like the author), the above derivation might feel a bit like "shortcut cultivation," as the standard solution should involve first finding the probability density function of $z_{\max}$ and then calculating the expectation through integration. This section follows that logic.

Probability Density

For 1D distributions, the probability density function (PDF) and the cumulative distribution function (CDF) are two sides of the same coin. Finding the probability density of $z_{\max}$ requires the CDF as an auxiliary step. The PDF of the standard normal distribution is $p(z)=\exp(-z^2/2)/\sqrt{2\pi}$, and the CDF is $\Phi(z)=\int_{-\infty}^z p(x)dx$, which is a non-elementary function representing the probability that the random variable is less than or equal to $z$.

To find the CDF of $z_{\max}$, i.e., $P(z_{\max} \leq z)$, we know that $z_{\max} \leq z$ is equivalent to $z_1 \leq z, z_2 \leq z, \dots, z_n \leq z$ holding simultaneously. Since $z_i$ are independent, the probability of them holding simultaneously is the product of their individual probabilities: \begin{equation}P(z_{\max} \le z) = \prod_{i=1}^n P(z_i \le z) = [\Phi(z)]^n \end{equation} Thus, the CDF of $z_{\max}$ is simply $[\Phi(z)]^n$, a very concise result. Note that we haven't used the condition that $z$ follows a normal distribution yet, so this is actually a general result: for $n$ values sampled independently from any distribution, the CDF of their maximum is the $n$-th power of the original distribution's CDF. Differentiating this gives the probability density function $p_{\max}(z)$ of $z_{\max}$: \begin{equation}p_{\max}(z) = n[\Phi(z)]^{n-1} p(z) \end{equation}

The plots of $p_{\max}(z)$ for $n=50, 100, 200$ are shown below:

[Inverted bell-shaped curves for different values of n]

Laplace

With the PDF, we can theoretically calculate the expectation by integration: \begin{equation}\mathbb{E}[z_{\max}] = \int_{-\infty}^{\infty} z\, p_{\max}(z) dz\end{equation} However, this integral is clearly not easy to compute, so we look for a tractable approximation. From the image above, the shape of $p_{\max}(z)$ resembles an inverted bell curve similar to a normal distribution, so it is natural to think of a normal distribution approximation, also known as "Laplace Approximation."

The first step in finding a normal distribution approximation is to find the point where the bell-shaped curve reaches its maximum, then expand $\log p_{\max}(z)$ to the second order at that point. If we only need the mean, finding the maximum point is sufficient, as the mean of a normal distribution is the maximum point of its probability density. To find this maximum point $z_*$, we first find $\log p_{\max}(z)$: \begin{equation}\begin{aligned} \log p_{\max}(z) =&\, \log n + (n-1)\log \Phi(z) + \log p(z) \\ =&\, \log n + (n-1)\log \Phi(z) - \frac{z^2}{2} - \frac{1}{2}\log 2\pi \end{aligned}\end{equation} Differentiating with respect to $z$ gives \begin{equation}\frac{d}{dz}\log p_{\max}(z) = (n-1) \frac{p(z)}{\Phi(z)} - z = \frac{(n-1)\exp(-z^2/2)}{\Phi(z) \sqrt{2\pi}} - z\end{equation} Setting it to 0, we can rearrange to get \begin{equation}z_* = \sqrt{2\log\frac{n-1}{z_*\Phi(z_*)\sqrt{2\pi}}}\label{eq:z}\end{equation}

Approximate Solution

The next task is to solve Equation $\eqref{eq:z}$. We don't need an exact solution, only an asymptotic estimation. Note that $z\Phi(z)$ is already greater than 1 when $z \geq 1.15$. Since we are considering an asymptotic solution, we can assume $z_* \geq 1.15$, thus: \begin{equation}z_* < \sqrt{2\log\frac{n-1}{\sqrt{2\pi}}}\end{equation} Substituting this back into Equation $\eqref{eq:z}$ and using $\Phi(z) < 1$, we get: \begin{equation}z_* > \sqrt{2\log\frac{n-1}{\sqrt{2\log\frac{n-1}{\sqrt{2\pi}}}\sqrt{2\pi}}}\end{equation} From these upper and lower bounds, we get: \begin{equation}z_* \sim \sqrt{2\log\frac{n-1}{\sqrt{2\pi}}}\sim \sqrt{2\log\frac{n}{\sqrt{2\pi}}}\sim \sqrt{2\log n}\end{equation} This is the asymptotic result for $\mathbb{E}[z_{\max}]$. For further exploration of this problem, one can refer to the Fisher–Tippett–Gnedenko theorem, Generalized extreme value distribution, etc.

Inverse Transform Sampling

The final proof is based on the idea of Inverse Transform Sampling: let the CDF of a 1D distribution be $\Phi(z)$ and its inverse function be $\Phi^{-1}(z)$. One way to sample from this distribution is: \begin{equation}z = \Phi^{-1}(\varepsilon),\qquad \varepsilon\sim U(0,1)\end{equation} That is, the inverse CDF can convert a uniform distribution into the target distribution. Thus, sampling $n$ points $z_1, z_2, \dots, z_n$ from a target distribution is equivalent to saying $\Phi(z_1), \Phi(z_2), \dots, \Phi(z_n)$ are $n$ points sampled from $U(0,1)$. To find $\mathbb{E}[z_{\max}]$, we can approximately assume: \begin{equation}\mathbb{E}[z_{\max}]\approx\Phi^{-1}(\mathbb{E}[\Phi(z_{\max})])\end{equation} That is, first find the corresponding expectation in $U(0,1)$, and then transform it back via $\Phi^{-1}$. It is easy to guess that for $n$ points sampled from $U(0,1)$, the average value of their maximum is roughly $\frac{n}{n+1}$ (dividing the interval $(0,1)$ into $n+1$ segments, there are exactly $n$ points inside, taking the largest one). So $\mathbb{E}[z_{\max}]\approx \Phi^{-1}(\frac{n}{n+1})$. This is a general result; next, we combine it with the specific CDF to get a more explicit solution.

For the standard normal distribution, we have $\newcommand{erf}{\mathop{\text{erf}}}\newcommand{erfc}{\mathop{\text{erfc}}}\Phi(z) = \frac{1}{2} + \frac{1}{2}\erf\left(\frac{z}{\sqrt{2}}\right) = 1 - \frac{1}{2}\erfc\left(\frac{z}{\sqrt{2}}\right)$. The Erfc function has an asymptotic form $\erfc(z)\sim \frac{\exp(-z^2)}{z\sqrt{\pi}}$ (which can be derived via integration by parts). Substituting this into $\Phi(z)$ gives: \begin{equation}\Phi(z)\sim 1 - \frac{\exp(-z^2/2)}{z\sqrt{2\pi}}\end{equation} Thus, finding $\Phi^{-1}(\frac{n}{n+1})$ is roughly equivalent to solving the equation: \begin{equation}\frac{\exp(-z^2/2)}{z\sqrt{2\pi}} = \frac{1}{n+1}\end{equation} This is very similar to the equation in the previous section. Following the same solving process, we obtain: \begin{equation}\mathbb{E}[z_{\max}] \sim \sqrt{2\log\frac{n+1}{\sqrt{2\pi}}}\sim \sqrt{2\log\frac{n}{\sqrt{2\pi}}}\sim \sqrt{2\log n}\end{equation}

Application Example

In the article "Low-precision Attention may have Biased Rounding Errors", we introduced a mechanism where low-precision Attention produces calculation bias, and one of the conditions is the simultaneous existence of multiple maximum values in the same row of Attention Logits. Is this condition easy to satisfy? Using the results of this article, we can estimate the probability of its occurrence.

Suppose a row has $n$ Logits, all independently sampled from $\mathcal{N}(0,1)$. One could consider general mean and variance, but it doesn't change the conclusion. In reality, it might not be a normal distribution, but this serves well as a basic case. The question is: if we convert these $n$ Logits to BF16 format, what is the probability that at least two identical maximum values appear?

According to previous results, the maximum value of these $n$ Logits is approximately $\nu = \sqrt{2\log\frac{n-1}{\sqrt{2\pi}}}$. BF16 has only a 7-bit mantissa, with a relative precision of $2^{-7}=1/128$. Then, if any of the remaining $n-1$ Logits falls in the interval $(\frac{127}{128}\nu, \nu]$, we can consider that two identical maximum values have appeared in BF16. Based on the meaning of the PDF, the probability of a single sample falling in this interval is $p(\nu)\frac{\nu}{128}$. Therefore, the probability that at least one of the $n-1$ numbers falls in this interval is: \begin{equation}1 - \left(1 - p(\nu)\frac{\nu}{128}\right)^{n-1} = 1 - \left(1 - \frac{\nu/128}{n-1}\right)^{n-1}\approx 1 - e^{-\nu/128} \approx \frac{\nu}{128}\end{equation} Note that once we have fixed the maximum value, the remaining $n-1$ numbers are not strictly independent samples from the standard normal distribution anymore. However, treating them as independent for estimation yields a result that is likely an underestimate for large $n$, but the author believes it is usable as a simple approximation.

Comparison with numerical simulation results:

import jax
import jax.numpy as jnp

def proba_of_multi_max(n, T=100, seed=42):
    p, key = 0, jax.random.key(seed)
    for i in range(T):
        key, subkey = jax.random.split(key)
        logits = jax.random.normal(subkey, (10000, n)).astype('bfloat16')
        p += ((logits == logits.max(axis=1, keepdims=True)).sum(axis=1) > 1).mean()
    return p / T

def approx_result(n):
    return jnp.sqrt(2 * jnp.log(n / jnp.sqrt(2 * jnp.pi))) / 128

proba_of_multi_max(128) # 0.018246
approx_result(128) # 0.0219115

proba_of_multi_max(4096) # 0.028279
approx_result(4096) # 0.03005291

proba_of_multi_max(65536) # 0.05296699
approx_result(65536) # 0.03523674

As can be seen, even if the sequence length is only 128, there is about a 2% probability of duplicate maxima. This is quite significant because Flash Attention calculates in blocks, with a typical block length being 128. If the appearance probability per 128 Logits is 2%, then the average number of duplicate maxima occurrences during the entire sample's Attention calculation is on the order of $0.02 \times n^2/128$ (don't forget the Logits matrix size is the square of the sequence length). For $n=4096$, the result is not small.

There is a small detail: in actual Attention calculations, the Logits matrix is usually not converted to BF16 directly, but after subtracting the $\max$ and applying $\exp$. At this point, the maximum value is 1, and the problem is equivalent to the probability of two or more 1s appearing in each row of the matrix. However, the results of this detailed version would not differ significantly from directly converting the Logits matrix to BF16.

Summary

In this article, we estimated the mathematical expectation of the maximum of $n$ normal random variables using three different methods and used the results to simple estimate the probability of duplicate maximum values appearing in low-precision Attention matrices.