Anthropic: AI Models Can Be Trained to Give Fake Information

An Anthropic Study Reveals That AI Models Can Be Trained to Be Deceptive, Posing Increased Security Risks and Cybersecurity Threats That Are Hard to Detect

A study conducted by AI company Anthropic has found that artificial intelligence (AI) models can be trained to deceive and create a false impression of reality.

The study, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, has completed risk training on various large language models (LLMs). Findings highlight that adversarial training has the potential to hide, rather than remove, backdoor behaviour. Adversarial training in machine learning refers to the study of attacks on machine learning algorithms, in addition to subsequent defence strategies. 

With threat actors utilising AI more than ever before to exploit cybersecurity measures, the technology poses significant risk if used negatively.

LLM safety risks: Creating a false impression of reality

Anthropic describes a backdoor attack as when AI models are altered during training and cause unintended behaviour as a result. This form of altering is often challenging as it can remain hidden and near-undetectable within the AI model’s learning mechanism.

The organisation set about answering the question: if an AI system learned such a deceptive strategy, could it be detected and removed using current state-of-the-art safety training techniques? As part of its study, Anthropic constructed proof-of-concept examples of deceptive behaviour in LLMs.

Anthropic researchers say that if they took an existing text-generating model like OpenAI's ChatGPT and fine-tuned it on examples of desired behaviour and deception, they could get the model to consistently behave deceptively.

“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,” Anthropic says.

“Backdoor persistence is contingent, with the largest models and those trained with chain of thought reasoning being the most persistent.”

The study also analysed how LLMs could pose safety risks. During a time of huge digital transformation, the cyber threat landscape is consistently more at risk. AI in particular holds great potential to be misused by those seeking to extort individuals or attack businesses.

Habitual deception: Trying to avoid an AI model that can lie

Ultimately, Anthropic’s research suggests that AI can be trained to deceive. Once an AI model exhibits deceptive behaviour, the company suggests that standard techniques could fail to remove such deception and create a false impression of safety as a result. Significantly, it found that adversarial training tends to make backdoored models more accurate at implementing backdoored behaviours - effectively hiding rather than removing them.

“Behavioural safety training techniques might remove only unsafe behaviour that is visible during training and evaluation, but miss threat models that appear safe during training,” the research comments.

Anthropic also found that backdoor behaviour can be made persistent so that it is not removed by standard safety training techniques, including adversarial training. 

Given how ineffective adversarial training is, Anthropic highlights that current behavioural techniques are ineffective. As a result, it suggests that standard behavioural training techniques may need to be augmented with techniques from related fields, such as more complex backdoor defences or entirely new techniques altogether.

Global concerns about AI performance were consistently raised in 2023. In particular, there has been a push by developers to avoid AI hallucinations - a fault that enables AI models to perceive inaccurate or even false and misleading information.

Anthropic is consistently dedicated to building frontier AI models that are safe and reliable, having joined the Frontier Model Forum in July 2023 alongside fellow AI giants Google, Microsoft and OpenAI.

******

Make sure you check out the latest edition of AI Magazine and also sign up to our global conference series - Tech & AI LIVE 2024

******

AI Magazine is a BizClik brand

Share

Featured Articles

CoreWeave Adds to UK's AI Ecosphere with £1bn Investment

In a boost to the UK’s AI sector, cloud provider CoreWeave announced that it has opened an office in London to stand as its European headquarters

AlphaFold 3: Google DeepMind Seeks to Transform Healthcare

Leading AI company Google DeepMind launches AlphaFold 3, a new model that can predict DNA and protein structures to revolutionise the drug discovery world

The Startup That Secured Europe’s Biggest-Ever AI Investment

Startup Wayve has secured funding from tech giants Nvidia and Microsoft to further develop its embodied AI system for next-generation autonomous cars

Democratising Data: Atlan Reaches US$750m Valuation

Data & Analytics

Microsoft AI: Rumoured AI Product Could Advance Global Tech

AI Applications

TM Forum leads the way for the AI-Native telco at DTW24

AI Strategy