Research finds ChatGPT & Bard headed for 'Model Collapse'

Irreversible defects, degeneration and Model Collapse found in Large Language Models such as ChatGPT due to mountainous copies of data: Ticking time-bombs

A recent research paper titled, ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’ finds that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.

The researchers, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson, “refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs.”

They state: “We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models.”

The researchers demonstrate that Model Collapse has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.

“Indeed,” they state, “the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.”

LLMs and Generative AI may in fact be Degenerative AI-veiled

Since the very recent public launch of Large Language Models (LLMs) such as OpenAI's ChatGPT and Google’s Bard, the inherent assumption has been of untrammelled progress.

But the discovery of in-built Model Collapse in systems such as LLMs has negated this assumption, and has experts talking about a potentially unavoidable degeneration of the systems themselves.

Here’s a rundown of the mechanism of potential LLM breakdown:

Is there a fatal flaw built-in to LLMs like OpenAI's ChatGPT?

Expanding training data and parameters

The current LLMs, including ChatGPT and other large language models, rely on publicly accessible internet data for training.

This data is sourced from everyday individuals who consume, write, tweet, comment and review information, giving us insights into its origin.

There are two widely accepted methods for improving the efficiency of an LLM model.

The first is increasing the volume of data used for training, while the second involves augmenting the number of parameters the model considers.

Parameters represent unique data points or characteristics related to the topic the model learns about.

Traditionally, models have worked with human-generated data in various forms, including audio, video, images and text.

This corpus of data exhibited:

  1. Authentic semantics
  2. A Diverse range of occurrences
  3. Variety

It encompassed a rich array of subtleties and nuances, enabling models to develop realistic data distributions and predict not only the most common classes, but also less frequently occurring ones.

LLMs degeneration: The threat of machine-generated data

The research expresses that the introduction of machine-generated data, such as articles written by LLMs or images generated by AI, poses a significant threat to the aforementioned ‘variety’. This problem is more complex than it initially appears - as it compounds over time.

The researchers emphasise that this issue is particularly prevalent and impactful in models that follow a continual learning process.

Unlike traditional machine learning, which learns from static data distributions, continual learning adapts to dynamic data supplied sequentially.

Such approaches, whether task-based or task-free, experience data distributions that gradually change without clear task boundaries.

Model collapse and ‘data poisoning’

Model Collapse is the degenerative process that affects generations of generative models.

A newly discovered genus of LLM problems.

It occurs when generated data pollutes the training set of subsequent models, leading to a misperception of reality. 

Data poisoning, in broader terms, refers to any factor that contributes to the creation of data that inaccurately reflects reality.

The research paper utilises manageable models that mimic LLMs' mathematical models to demonstrate the gravity and persistence of this problem across LLMs.

Maintaining authenticity and regulating data usage

The solution to this issue, as suggested by the paper, revolves around maintaining the authenticity of content and ensuring a realistic data distribution through additional collaborator reviews.

It is also crucial to regulate the usage of machine-generated data in training LLMs.

Considering the significant costs associated with training LLMs from scratch, most organisations rely on pre-trained models as a starting point.

With critical industries such as life sciences, supply chain management and the content industry increasingly adopting LLMs for everyday tasks and recommendations, it becomes essential for LLM developers to continuously improve the models while maintaining realism.

Share

Featured Articles

Microsoft and Estée Lauder Power Gen AI for Beauty Industry

The Estée Lauder Companies is expanding their partnership with tech giant Microsoft to power prestige beauty with generative AI (Gen AI)

Pope set to Attend G7 Summit and Highlight AI Challenges

Pope Francis is expected to attend the Group of Seven (G7) leaders’ summit in 2024 to discuss AI challenges, amid widespread calls for global regulations

Sundar Pichai: Google Seeks to Expand its AI Opportunities

Alphabet, parent company of Google, highlights strong financial results for the first quarter of 2024 as AI and cloud initiatives drive revenue growth

AI and Broadcasting: BBC Commits to Transforming Education

Data & Analytics

Why Businesses are Building AI Strategy on Amazon Bedrock

AI Strategy

Pick N Pay’s Leon Van Niekerk: Evaluating Enterprise AI

Data & Analytics