Naive Bayes is all you need ?

By 苏剑林 | June 08, 2023

I apologize for choosing such a clickbait-style title. After writing "NBCE: Using Naive Bayes to Extend the Context Length of LLMs", I felt that the Naive Bayes mechanism shares many characteristics with the Attention mechanism. Later, after doing some derivation, I discovered that the Attention mechanism can actually be viewed as a generalized, parameterized version of Naive Bayes. If that is the case, doesn't "Attention is All You Need" also imply that "Naive Bayes is all you need"? This is the reason behind the title of this article.

Next, I will introduce my thought process and analyze how to understand the Attention mechanism from the perspective of Naive Bayes.

Naive Bayes

This article primarily considers language models, which aim to model $p(x_t|x_1,\cdots,x_{t-1})$. According to Bayes' theorem, we have:

\begin{equation}p(x_t|x_1,\cdots,x_{t-1}) = \frac{p(x_1,\cdots,x_{t-1}|x_t)p(x_t)}{p(x_1,\cdots,x_{t-1})}\propto p(x_1,\cdots,x_{t-1}|x_t)p(x_t)\end{equation}

Based on the independence assumption $p(x_1,\cdots,x_{t-1}|x_t) = \prod\limits_{j=1}^{t-1} p(x_j|x_t)$, we get:

\begin{equation}p(x_t|x_1,\cdots,x_{t-1}) \propto \prod_{j=1}^{t-1} p(x_j|x_t)p(x_t)\end{equation}

Applying Bayes' theorem again, $p(x_j|x_t)=\frac{p(x_t|x_j)p(x_j)}{p(x_t)}\propto\frac{p(x_t|x_j)}{p(x_t)}$, we obtain:

\begin{equation}p(x_t|x_1,\cdots,x_{t-1}) \propto \frac{1}{[p(x_t)]^{t-2}}\prod_{j=1}^{t-1} p(x_t|x_j)\end{equation}

Taking the logarithm of both sides, we get:

\begin{equation}\log p(x_t|x_1,\cdots,x_{t-1}) = \sum_{j=1}^{t-1}\log p(x_t|x_j) - (t - 2) \log p(x_t) + \text{constant}\end{equation}

Generalized Results

We performed the same derivation in "NBCE: Using Naive Bayes to Extend the Context Length of LLMs". Similar to that article, we generalize the above equation as:

\begin{equation}\log p(x_t|x_1,\cdots,x_{t-1}) = (1 + \beta)\mathcal{P}[\log p(x_t|x_j)] - \beta \log p(x_t) + \text{constant}\end{equation}

Here, $\beta$ acts as a hyperparameter to be tuned, and $\mathcal{P}$ represents some form of Pooling. Next, we primarily look at the case where $\beta=0$ and the Pooling method is a weighted average:

\begin{equation}\log p(x_t|x_1,\cdots,x_{t-1}) = \sum_j a_{t,j} \log p(x_t|x_j) + \text{constant}\label{eq:nb-core}\end{equation}

where $a_{t,j}$ is a function of $x_{t-1}$ and $x_j$.

Some readers might wonder if this generalized equation can still be considered Naive Bayes? I believe it can be treated as a generalized Naive Bayes, because standard Naive Bayes can be seen as an equal-weighted average of each $\log p(x_t|x_j)$, whereas here we replace it with a more general weighted average. However, by selecting $a_{t,j}$ as a function of $x_{t-1}$ and $x_j$, we highlight the role of $x_{t-1}$, which addresses the drawback of Naive Bayes' lack of sequence awareness. Therefore, more precisely, Equation $\eqref{eq:nb-core}$ is a combination of a 2-gram language model and Naive Bayes.

The Emergence of Attention

Next, by further parameterizing $\log p(x_t|x_j)$, we can see the form of Attention. It is not hard to find that $p(x_t|x_j)$ is essentially the Skip-Gram model from the old Word2Vec. Its conventional modeling method is "Embedding + Inner Product + Softmax", namely:

\begin{equation}p(x_t|x_j) = \frac{e^{v(x_j)\cdot w(x_t)}}{Z(x_j)},\quad Z(x_j) = \sum_{x_t\in Vocab}e^{v(x_j)\cdot w(x_t)}\end{equation}

Therefore, we can simply assume that:

\begin{equation}\log p(x_t|x_j) = v(x_j)\cdot w(x_t) + \text{constant}\end{equation}

Substituting this into Equation $\eqref{eq:nb-core}$, we get:

\begin{equation}\log p(x_t|x_1,\cdots,x_{t-1}) = \left(\sum_j a_{t,j} v(x_j)\right)\cdot w(x_t) + \text{constant}\label{eq:nb-core-2}\end{equation}

Taking the term inside the parentheses as a standard operation for feature fusion, it is actually the regular Attention mechanism. Thus, using a single layer of Attention to perform language modeling is essentially generalized Naive Bayes.

Of course, we have not yet determined $a_{t,j}$. In the previous section, we said $a_{t,j}$ is a function of $x_{t-1}$ and $x_j$, which also needs to be normalized (for the weighted average). A simple way is to use "Embedding + Inner Product + Softmax," just like Skip-Gram:

\begin{equation}a_{t,j} = \frac{e^{q(x_{t-1})\cdot k(x_j)}}{Z_t},\quad Z_t = \sum_{j=1}^{t-1} e^{q(x_{t-1})\cdot k(x_j)}\end{equation}

Substituting this into Equation $\eqref{eq:nb-core-2}$ yields the most commonly used Dot-Product Attention today. Of course, this is not the only way; there is also additive Attention, etc. The primary reason for choosing Dot-Product is that it can be implemented in parallel with relatively low memory consumption.

Stacking and Residuals

No matter how it is parameterized, the capability of a single-layer Naive Bayes model is always limited, so we need to further increase the complexity of the model. From the perspective of neural networks, the primary way to increase model complexity is to increase depth, which means stacking layer upon layer. So, how do we understand this stacking from the perspective of probability distributions? The answer is latent variable models.

A latent variable model introduces hidden variables $z_1, z_2, \cdots, z_{t-1}$ such that:

\begin{equation}p(x_t|x_1,\cdots,x_{t-1}) = \int p(x_t|z_1,\cdots,z_{t-1})p(z_1,\cdots,z_{t-1}|x_1,\cdots,x_{t-1})dz_1 \cdots dz_{t-1}\end{equation}

Simply put, this involves superimposing simple distributions to fit more complex ones, which is consistent with the idea of GMM (Gaussian Mixture Models). Based on previous discussions, we also model $p(x_t|z_1,\cdots,z_{t-1})$ using Naive Bayes, which means it is a single layer of Attention at the feature level. For $p(z_1,\cdots,z_{t-1}|x_1,\cdots,x_{t-1})$, following the characteristics of autoregressive models, we decompose it as:

\begin{equation}p(z_1,\cdots,z_{t-1}|x_1,\cdots,x_{t-1}) = \prod_{k=1}^{t-1} p(z_k|x_1,\cdots,x_k)\end{equation}

In this way, each $p(z_k|x_1,\cdots,x_k)$ has the same form as $p(x_t|z_1,\cdots,z_{t-1})$, so it can also be modeled using Naive Bayes. For simplicity, we define $z_k$ as a continuous variable, and $p(z_k|x_1,\cdots,x_k)$ as a Dirac distribution. Consequently, the integral can be calculated directly, and the result is the stacking of two layers of Attention.

Finally, another key component in Transformer is the residual connection. In fact, it generalizes Equation $\eqref{eq:nb-core}$ as:

\begin{equation}\log p(x_t|x_1,\cdots,x_{t-1}) = \log p(x_t|x_{t-1}) + \sum_j a_{t,j} \log p(x_t|x_j) + \text{constant}\end{equation}

This can be understood as a type of Pooling that emphasizes the status of the 2-gram, acting as a kind of prior. Finally, the remaining components like FeedForward layers and LayerNorm layers do not involve interactions between tokens and can be understood as more complexly parameterized Naive Bayes.

Admittedly, such a broad explanation might seem a bit forced. However, my original intention was not to perfectly explain Transformer or Attention, but to gain new insights regarding context length extrapolation from the Naive Bayes perspective. Unfortunately, I have not yet achieved the expected results. Nevertheless, even if it seems like blind narcissism, I still believe the aforementioned perspective of Naive Bayes and latent variable models has further potential for exploration. For instance, it seems we can use the Naive Bayes perspective to explain why In-Context Learning in Attention-based language models works.

Summary

This article explains the connection between Naive Bayes and the Attention mechanism, showing that Attention can be viewed as a generalized form of Naive Bayes. From this perspective, we can further understand concepts such as stacking and residuals in Attention.