Research finds ChatGPT & Bard headed for 'Model Collapse'

June 20, 2023

undefined mins

Irreversible defects, degeneration and Model Collapse found in Large Language Models such as ChatGPT due to mountainous copies of data: Ticking time-bombs

A recent research paper titled, ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’ finds that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.

The researchers, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson, “refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs.”

They state: “We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models.”

The researchers demonstrate that Model Collapse has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.

“Indeed,” they state, “the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.”

LLMs and Generative AI may in fact be Degenerative AI-veiled

Since the very recent public launch of Large Language Models (LLMs) such as OpenAI's ChatGPT and Google’s Bard, the inherent assumption has been of untrammelled progress.

But the discovery of in-built Model Collapse in systems such as LLMs has negated this assumption, and has experts talking about a potentially unavoidable degeneration of the systems themselves.

Here’s a rundown of the mechanism of potential LLM breakdown:

Is there a fatal flaw built-in to LLMs like OpenAI's ChatGPT?

Expanding training data and parameters

The current LLMs, including ChatGPT and other large language models, rely on publicly accessible internet data for training.

This data is sourced from everyday individuals who consume, write, tweet, comment and review information, giving us insights into its origin.

There are two widely accepted methods for improving the efficiency of an LLM model.

The first is increasing the volume of data used for training, while the second involves augmenting the number of parameters the model considers.

Parameters represent unique data points or characteristics related to the topic the model learns about.

Traditionally, models have worked with human-generated data in various forms, including audio, video, images and text.

This corpus of data exhibited:

Authentic semantics
A Diverse range of occurrences
Variety

It encompassed a rich array of subtleties and nuances, enabling models to develop realistic data distributions and predict not only the most common classes, but also less frequently occurring ones.

LLMs degeneration: The threat of machine-generated data

The research expresses that the introduction of machine-generated data, such as articles written by LLMs or images generated by AI, poses a significant threat to the aforementioned ‘variety’. This problem is more complex than it initially appears - as it compounds over time.

The researchers emphasise that this issue is particularly prevalent and impactful in models that follow a continual learning process.

Unlike traditional machine learning, which learns from static data distributions, continual learning adapts to dynamic data supplied sequentially.

Such approaches, whether task-based or task-free, experience data distributions that gradually change without clear task boundaries.

Model collapse and ‘data poisoning’

Model Collapse is the degenerative process that affects generations of generative models.

A newly discovered genus of LLM problems.

It occurs when generated data pollutes the training set of subsequent models, leading to a misperception of reality.

Data poisoning, in broader terms, refers to any factor that contributes to the creation of data that inaccurately reflects reality.

The research paper utilises manageable models that mimic LLMs' mathematical models to demonstrate the gravity and persistence of this problem across LLMs.

Maintaining authenticity and regulating data usage

The solution to this issue, as suggested by the paper, revolves around maintaining the authenticity of content and ensuring a realistic data distribution through additional collaborator reviews.

It is also crucial to regulate the usage of machine-generated data in training LLMs.

Considering the significant costs associated with training LLMs from scratch, most organisations rely on pre-trained models as a starting point.

With critical industries such as life sciences, supply chain management and the content industry increasingly adopting LLMs for everyday tasks and recommendations, it becomes essential for LLM developers to continuously improve the models while maintaining realism.

LLM Large Language Models ChatGPT Google Bard Model Collapse Data Degeneration