When Generative Models Run Amok: Will the Internet Suffer from "Mad Cow Disease"?

By 苏剑林 | July 14, 2023

As is well known, whether in the realm of text or vision, various generative models are "running amok" on the internet with an unstoppable momentum. Although everyone understands that there is still a long way to go to achieve true Artificial General Intelligence (AGI), this does not prevent people from increasingly frequent use of generative models to create and share content. You see, many online articles are already paired with illustrations generated by Stable Diffusion models; you see, the style of many news reports is increasingly showing the shadow of ChatGPT. This seemingly harmless trend is quietly raising a question: Should we remain vigilant about the proliferation of generative model data on the internet?

A recently published paper, "Self-Consuming Generative Models Go MAD", reveals a worrying possibility: the unchecked expansion of generative models on the internet might lead to a digital version of a "Mad Cow Disease" epidemic. In this article, we will study this paper together and explore its potential impacts.

"Eating Oneself"

On one hand, the increasing frequency with which people use generative models will lead to more and more content on the internet being created by these models; on the other hand, generative models are also iteratively updating, and the data they use is crawled from the internet. It can be imagined that in future training sets, the proportion of content created by generative models will become higher and higher. In other words, each subsequent generation of models might not have enough fresh data during iteration and will purely train on data produced by itself—in Cantonese, this is called "eating oneself." This will lead to a decline in the quality or diversity of the models, which the original paper refers to as "Model Autophagy Disorder (MAD)."

Coincidentally, a similar example occurred in biology. Cows are herbivores; however, to enhance their nutritional supply, some livestock farmers ground up the remains of other cattle (including brains) and mixed them into the feed. At the time, this seemed like a clever practice, but it eventually led to the emergence and large-scale spread of "Mad Cow Disease." This case illustrates that long-term "eating oneself" can lead to the accumulation of harmful factors within an organism, which, once reaching a certain level, can even trigger catastrophic diseases.

Therefore, we also need to reflect on whether the "proliferation" of generative models will trigger another "Mad Cow Disease" on the internet—this could not only lead to the homogenization of information, causing various contents to become monotonous and lacking in originality and diversity, but also potentially spark a series of unpredictable problems.

Diversity Loss

Some readers might wonder: Isn't a generative model just a simulation of the real data distribution? Even if data from generative models is continuously used for iterative training, it should just be repeatedly presenting the real data distribution. How could this lead to a loss of diversity?

The reasons for this are multifaceted. First, the data used to train generative models is often not taken directly from the real distribution but is manually processed, such as through denoising, normalization, and alignment. After processing, the training set has already lost some diversity. For example, the reason we observe that many news reports or Zhihu answers have a "ChatGPT flavor" is not because of the content itself, but because of the similarity of their format to ChatGPT's. This indicates that the style of ChatGPT's training data and its output results are quite distinct and limited. As another example, to reduce the training difficulty of image generation models, we usually need to align the images. For instance, when training head-generation models, it is often necessary to align the eyes of all faces to the same position; these operations also lead to a loss of diversity.

Furthermore, another crucial factor is that due to the limitations of generative models themselves or training techniques, no generative model can be perfect. At this point, we usually actively introduce techniques that sacrifice diversity to improve generation quality. For example, for generative models like GANs or Flow, we choose to reduce the variance of sampling noise to obtain higher-quality results; this is the so-called truncation trick or annealing trick. Additionally, as described in "Discourse on Generative Diffusion Models (IX): Conditional Control of Generation Results", in diffusion models, we usually introduce conditional information to control the output results. Whether using Classifier-Guidance or Classifier-Free schemes, the introduction of extra conditions also limits the diversity of the generated results. In short, when generative models are less than perfect, we actively give up part of the diversity in the process of balancing quality and diversity.

Normal Distribution

To gain a deeper understanding of this phenomenon, we will next explore some specific examples. To begin with, we first consider the normal distribution; because it is simple enough, its derivation and analysis are clearer. However, as we will see later, the results are already quite representative.

Assume the real distribution is a multivariate normal distribution $\mathcal{N}(\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0)$, and the distribution we use for modeling is also a normal distribution $\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$. Then the process of training the model involves estimating the mean vector $\boldsymbol{\mu}$ and the covariance matrix $\boldsymbol{\Sigma}$ from the training set. Next, we assume that when training each generation of the generative model, only the data created by the previous generation is used. This is an extreme assumption, but it cannot be denied that as generative models become further popularized, this assumption becomes closer to reality.

Under these assumptions, we sample $n$ samples $\boldsymbol{x}_{t-1}^{(1)},\boldsymbol{x}_{t-1}^{(2)},\cdots,\boldsymbol{x}_{t-1}^{(n)}$ from the $(t-1)$-th generation model $\mathcal{N}(\boldsymbol{\mu}_{t-1},\boldsymbol{\Sigma}_{t-1})$ to train the $t$-th generation model:

\begin{equation}\boldsymbol{\mu}_t = \frac{1}{n}\sum_{i=1}^n \boldsymbol{x}_{t-1}^{(i)},\quad \boldsymbol{\Sigma}_t=\frac{1}{n-1} \sum_{i=1}^n \big(\boldsymbol{x}_{t-1}^{(i)} - \boldsymbol{\mu}_t\big)\big(\boldsymbol{x}_{t-1}^{(i)} - \boldsymbol{\mu}_t\big)^{\top}\end{equation}

Note that if a truncation trick is added, the $t$-th generation model becomes $\mathcal{N}(\boldsymbol{\mu}_t,\lambda\boldsymbol{\Sigma}_t)$, where $\lambda\in(0,1)$. Thus, one can imagine that the variance (diversity) of each generation will decay at a rate of $\lambda$, eventually becoming zero (total loss of diversity). What if we don't use the truncation trick (i.e., $\lambda=1$)? Is everything fine? Not really. According to the definition $\boldsymbol{\mu}_t = \frac{1}{n}\sum\limits_{i=1}^n \boldsymbol{x}_{t-1}^{(i)}$, since $\boldsymbol{x}_{t-1}^{(i)}$ are all obtained from random sampling, $\boldsymbol{\mu}_t$ is also a random variable. According to the additivity of the normal distribution, it actually follows:

\begin{equation}\boldsymbol{\mu}_t \sim \mathcal{N}\left(\boldsymbol{\mu}_{t-1},\frac{1}{n}\boldsymbol{\Sigma}_{t-1}\right)\quad\Rightarrow\quad\boldsymbol{\mu}_t \sim \mathcal{N}\left(\boldsymbol{\mu}_0,\frac{t}{n}\boldsymbol{\Sigma}_0\right)\end{equation}

It is foreseeable that when $t$ is large enough, $\boldsymbol{\mu}_t$ itself will significantly deviate from $\boldsymbol{\mu}_0$. This corresponds to a collapse in quality, not just a reduction in diversity.

In summary, the introduction of the truncation trick will greatly accelerate the loss of diversity. Even without the truncation trick, in long-term iterative training with finite samples, the generated distribution may still significantly deviate from the original true distribution. Note that the assumptions made in this normal distribution example are already weaker than those of general generative models—at least its fitting capability is guaranteed to be sufficient—yet it still inevitably leads to diversity decay or quality collapse. For real-world data and capacity-limited generative models, it will theoretically be even worse.

Generative Models

For actual generative models, theoretical analysis is difficult to perform, so the results must be explored through experiments. The original paper conducted very rich experiments, and the results are basically consistent with the conclusions of the normal distribution: if the truncation trick is added, diversity will be lost rapidly; even without the truncation trick, models after repeated iterations will inevitably show some deviations.

Here is an example with the truncation trick:

[With truncation trick, Generation 1 results]

[With truncation trick, Generation 5 results]

Here is an example without the truncation trick:

[Without truncation trick, Generation 1 results]

[Without truncation trick, Generation 7 results]

Of course, the assumption that "each round of iteration only uses data generated by the previous round's model" is quite extreme. The original paper also analyzed cases where each round includes a certain amount of real data. This situation involves two sub-scenarios: 1. The sampling results of real data remain constant from the start; 2. Fresh real data can be sampled for each iteration. The first method is easier to implement, but the original paper shows it can only slow down the speed of degradation; it cannot fundamentally solve the problem. While the second method can solve the degradation problem, in a practical context, it is very difficult for us to effectively filter between real data and model-generated data.

Summary

This article discussed the potential consequences when various generative models "run amok" on the internet on a large scale. When generative models repeatedly use data they generated to update and iterate, it can lead to severe homogenization of information and loss of diversity, similar to the "Mad Cow Disease" that once appeared due to "cows eating cows."

Reprinting notice: Please include this article's address: https://kexue.fm/archives/9687