By 苏剑林 | May 31, 2023
Last week, in "NBCE: Expanding the Context Length of LLMs using Naive Bayes", we introduced a scheme called NBCE (Naive Bayes-based Context Extension) to extend the context length of LLMs based on Naive Bayes. Because it has advantages such as being plug-and-play, model-independent, and requiring no fine-tuning, it has received recognition from several readers. Overall, the testing feedback so far has been quite positive. Of course, some readers raised questions while using it. This article combines readers' queries and the author's subsequent reflections to provide some supplementary explanations and analyses of the NBCE method.
Method Review
Suppose $T$ is the token sequence to be generated, and $S_1, S_2, \cdots, S_n$ are several given contexts. We need to generate $T$ based on $S_1, S_2, \cdots, S_n$, which requires estimating $p(T|S_1, S_2, \cdots, S_n)$. According to the Naive Bayes principle, we obtain:
\begin{equation}\log p(T|S_1, S_2,\cdots,S_n) = \color{red}{(\beta + 1)\overline{\log p(T|S)}} - \color{green}{\beta\log p(T)} + \color{skyblue}{\text{constant}}\label{eq:nbce-2}\end{equation}
Where $\beta = n - 1$, $\overline{\log p(T|S)} = \frac{1}{n}\sum\limits_{k=1}^n \log p(T|S_k)$; details can be found in the previous article. NBCE made two changes: 1. It treats $\beta$ as a tunable hyperparameter; 2. It replaces $\overline{\log p(T|S)}$ with a general pooling method $\mathcal{P}$. The result becomes:
\begin{equation}\log p(T|S_1, S_2,\cdots,S_n) = \color{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \color{green}{\beta\log p(T)} + \color{skyblue}{\text{constant}}\label{eq:nbce-3}\end{equation}
Finally, the pooling scheme chosen by NBCE is "selecting the one with minimum entropy":
\begin{equation}\begin{aligned}
&\mathcal{P}[\log p(T|S)] = \log p(T|S_{\color{red}{k}}) \\[5pt]
&\color{red}{k} = \mathop{\text{argmin}} \big\{H_1,H_2,\cdots,H_n\big\} \\[5pt]
&H_i = -\sum_T p(T|S_i)\log p(T|S_i)
\end{aligned}\label{eq:min-h}\end{equation}
Truncated Prediction
Equation $\eqref{eq:nbce-2}$ is the standard Naive Bayes result, but when I implemented it, I found that as $n$ increased, the effect of equation $\eqref{eq:nbce-2}$ gradually worsened until it became completely garbled. Therefore, after repeated adjustments, I eventually chose "selecting the one with minimum entropy" as the pooling scheme for NBCE. However, thinking about it carefully later, this behavior of equation $\eqref{eq:nbce-2}$ is abnormal. Since the only assumption of Naive Bayes is that contexts are independent of each other, and the contexts I tested were several randomly found news articles, this assumption is satisfied to some extent. Thus, equation $\eqref{eq:nbce-2}$ should not be so poor that it produces garbled code.
While I was struck with confusion, @孔某人 (Kong Mouren) on WeChat reminded me: The training labels for language models are all One-Hot, so the prediction results are actually unreliable except for the head (the parts with the highest probability).
This hint hit the nail on the head, and the answer became clear instantly: since equation $\eqref{eq:nbce-2}$ includes the $-\beta\log p(T)$ term, it amplifies the tail prediction results. If the tail predictions are unreliable, this amplification effect can even completely disrupt the accurate results of the head. Why doesn't it affect "selecting the one with minimum entropy"? Because in the minimum entropy result, the head probabilities tend to be larger and the tail probabilities smaller, so even if the $-\beta\log p(T)$ term amplifies the tail, it still cannot overcome the head. For equation $\eqref{eq:nbce-2}$, which is the average of all predictions, it weakens the head such that after adding $-\beta\log p(T)$, the tail overcomes the head.
With this clue, the solution is obvious: add Top-P or Top-K truncation to each prediction result. In the Github code, I chose Top-P truncation.
Handling Infinity
However, that's not the end of it. After truncation, the tails of $\log p(T|S_k)$ and $\log p(T)$ both become $-\infty$. At this time, either equation $\eqref{eq:nbce-2}$ or equation $\eqref{eq:nbce-3}$ might result in meaningless operations like $(-\infty)-(-\infty)$. In general, there are the following situations:
\[\begin{array}{c|cc|c}
\hline
& \log p(T|S_k) & \log p(T) & \log p(T|S_k) - \log p(T) \\
\hline
\text{Case 1} & > -\infty & > -\infty & > -\infty\\
\text{Case 2} & > -\infty & = -\infty & = +\infty \\
\text{Case 3} & = -\infty & > -\infty & = -\infty \\
\text{Case 4} & = -\infty & = -\infty & \text{NaN}\\
\hline
\end{array}\]
Among these, "Case 1" and "Case 3" can operate normally. Although "Case 2" can also operate normally, its result of positive infinity is unreasonable, and "Case 4" is an ill-defined meaningless operation. That is to say, we need to find a way to correct "Case 2" and "Case 4". These two cases exactly correspond to $\log p(T)=-\infty$, so we change equation $\eqref{eq:nbce-3}$ to:
\begin{equation}\log p(T|S_1, S_2,\cdots,S_n) =\left\{
\begin{aligned}
&\color{red}{\mathcal{P}[\log p(T|S)]}, \quad \text{if }\color{green}{\log p(T) = -\infty} \\[5pt]
&\color{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \color{green}{\beta\log p(T)}, \quad \text{otherwise}\\
\end{aligned}\right\} + \color{skyblue}{\text{constant}}\label{eq:nbce-4}\end{equation}
After the above processing, the standard Naive Bayes equation $\eqref{eq:nbce-2}$ can also output normal results (although the final effect is still not as good as selecting the minimum entropy, at least it won't produce garbled code), and the modified code is more robust to the pooling method and $\beta$.
Transfer Probabilities
When used to answer opinion-based questions or questions biased toward free creation, NBCE may experience issues with jumping back and forth between contexts. Specifically, because the model does not confidently focus on a particular context, the differences between $H_1, H_2, \cdots, H_n$ are small. Consequently, the $\mathop{\text{argmin}}$ result in equation $\eqref{eq:min-h}$ becomes unstable, selecting a different context for each generation step, which leads to semantic discontinuity or even complete irrelevance to the contexts, exacerbating the "hallucination" phenomenon of the LLM.
To alleviate this problem, we can imitate the concept of transition probability by appropriately weighting the context selected in the previous step, making the model "not jump unless necessary." The specific method is to introduce a parameter $\eta > 0$ and change equation $\eqref{eq:min-h}$ to:
\begin{equation}\color{red}{k} = \mathop{\text{argmin}} \big\{H_1,\cdots,H_{k'-1},H_{k'}\color{red}{-\eta},H_{k'+1},\cdots,H_n\big\}\end{equation}
Where $k'$ is the index of the context selected during the previous generation step. In this way, a context jump will only occur when $H_k < H_{k'} - \eta$, reducing the probability of jumping.
All the modifications mentioned above have been synchronized to GitHub.
Applicable Scenarios
Due to the independence assumption made by Naive Bayes, many readers might suspect: when there is obvious semantic overlap between contexts, will the effect of NBCE drop significantly? Or, what are the applicable scenarios for NBCE?
In fact, it is the standard Naive Bayes, equation $\eqref{eq:nbce-2}$, that is limited by the independence assumption. The generalized equation $\eqref{eq:nbce-3}$ and equation $\eqref{eq:min-h}$ are basically no longer limited by the independence assumption. In essence, the "minimum entropy" version of NBCE uses the entropy of the LLM as a similarity measure to retrieve contexts, and the retrieval results are updated at each generation step. Therefore, the applicable scenario for NBCE is:
Assume the answer to be predicted can be divided into several segments, each segment depending only on one context.
Based on this conclusion, when we have only one piece of long text as a context (such as a novel), we can automatically divide the long context into multiple short contexts through overlapping sliding windows, rather than necessarily manually segmenting it into relatively independent parts. This is because the conclusion just mentioned tells us that the applicability of NBCE is unrelated to the overlap of the contexts. As for why overlapping sliding windows are used, it is simply to ensure that a complete result can be output by relying as much as possible on a single context.
NBCE's performance will likely not be good in the following two scenarios:
- Ordered Contexts: This refers to when the generation result strongly depends on the input order of the contexts (or even more complex nested structures). NBCE usually performs poorly because it retains the unordered nature of Naive Bayes. A typical example of this scenario is writing a summary for a novel (where the novel is cut into multiple contexts). A temporary solution is to manually add markers identifying the order to each context, such as "Chapter xx".
- Coupled Contexts: This refers to when the output must be constructed by combining two or more contexts. NBCE performs poorly because it only selects one context at a time. @孔某人 gave a typical example: "Given $x > 1$ and $x < 0$, find the solution set for $x$". Assuming the two conditions are divided into two contexts, one must combine both contexts to output the correct answer "empty set"; looking at a single context alone cannot determine it is an empty set.
If NBCE is to be further developed, it will generally revolve around improvements for these two scenarios.
Article Summary
This article introduced some subsequent updates and analyses of the context length extension scheme NBCE and further discussed its applicable scenarios.