Research finds ChatGPT & Bard headed for 'Model Collapse'

Irreversible defects, degeneration and Model Collapse found in Large Language Models such as ChatGPT due to mountainous copies of data: Ticking time-bombs

A recent research paper titled, ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’ finds that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.

The researchers, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson, “refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs.”

They state: “We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models.”

The researchers demonstrate that Model Collapse has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.

“Indeed,” they state, “the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.”

LLMs and Generative AI may in fact be Degenerative AI-veiled

Since the very recent public launch of Large Language Models (LLMs) such as OpenAI's ChatGPT and Google’s Bard, the inherent assumption has been of untrammelled progress.

But the discovery of in-built Model Collapse in systems such as LLMs has negated this assumption, and has experts talking about a potentially unavoidable degeneration of the systems themselves.

Here’s a rundown of the mechanism of potential LLM breakdown:

Is there a fatal flaw built-in to LLMs like OpenAI's ChatGPT?

Expanding training data and parameters

The current LLMs, including ChatGPT and other large language models, rely on publicly accessible internet data for training.

This data is sourced from everyday individuals who consume, write, tweet, comment and review information, giving us insights into its origin.

There are two widely accepted methods for improving the efficiency of an LLM model.

The first is increasing the volume of data used for training, while the second involves augmenting the number of parameters the model considers.

Parameters represent unique data points or characteristics related to the topic the model learns about.

Traditionally, models have worked with human-generated data in various forms, including audio, video, images and text.

This corpus of data exhibited:

  1. Authentic semantics
  2. A Diverse range of occurrences
  3. Variety

It encompassed a rich array of subtleties and nuances, enabling models to develop realistic data distributions and predict not only the most common classes, but also less frequently occurring ones.

LLMs degeneration: The threat of machine-generated data

The research expresses that the introduction of machine-generated data, such as articles written by LLMs or images generated by AI, poses a significant threat to the aforementioned ‘variety’. This problem is more complex than it initially appears - as it compounds over time.

The researchers emphasise that this issue is particularly prevalent and impactful in models that follow a continual learning process.

Unlike traditional machine learning, which learns from static data distributions, continual learning adapts to dynamic data supplied sequentially.

Such approaches, whether task-based or task-free, experience data distributions that gradually change without clear task boundaries.

Model collapse and ‘data poisoning’

Model Collapse is the degenerative process that affects generations of generative models.

A newly discovered genus of LLM problems.

It occurs when generated data pollutes the training set of subsequent models, leading to a misperception of reality. 

Data poisoning, in broader terms, refers to any factor that contributes to the creation of data that inaccurately reflects reality.

The research paper utilises manageable models that mimic LLMs' mathematical models to demonstrate the gravity and persistence of this problem across LLMs.

Maintaining authenticity and regulating data usage

The solution to this issue, as suggested by the paper, revolves around maintaining the authenticity of content and ensuring a realistic data distribution through additional collaborator reviews.

It is also crucial to regulate the usage of machine-generated data in training LLMs.

Considering the significant costs associated with training LLMs from scratch, most organisations rely on pre-trained models as a starting point.

With critical industries such as life sciences, supply chain management and the content industry increasingly adopting LLMs for everyday tasks and recommendations, it becomes essential for LLM developers to continuously improve the models while maintaining realism.


Featured Articles

Intuitive Machines: NASA's Odysseus bets on Private Company

Discover more about the small private company that landed the first US spacecraft on the moon in 50 years, with NASA continuing to test new technologies

Unveiling Gemma: Google Commits to Open-Model AI & LLMs

Tech giant Google, with Google DeepMind, launches Gemma, consisting of new new state-of-the-art open AI models built for an open community of developers

Sustainability LIVE: Net Zero a Key Event for AI Leaders

Sustainability LIVE: Net Zero, taking place in London on 6th and 7th March 2024, promises to be a valuable event for AI leaders

US to Form AI Task Force to Confront AI Threats to Safety

AI Strategy

Wipro to Advance Enterprise Gen AI Adoption with IBM watsonx

AI Strategy

Dr Joy Buolamwini: Helping Tech Giants Recognise AI Biases

Machine Learning