The study, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, has completed risk training on various large language models (LLMs). Findings highlight that adversarial training has the potential to hide, rather than remove, backdoor behaviour. Adversarial training in machine learning refers to the study of attacks on machine learning algorithms, in addition to subsequent defence strategies.
With threat actors utilising AI more than ever before to exploit cybersecurity measures, the technology poses significant risk if used negatively.
LLM safety risks: Creating a false impression of reality
Anthropic describes a backdoor attack as when AI models are altered during training and cause unintended behaviour as a result. This form of altering is often challenging as it can remain hidden and near-undetectable within the AI model’s learning mechanism.
The organisation set about answering the question: if an AI system learned such a deceptive strategy, could it be detected and removed using current state-of-the-art safety training techniques? As part of its study, Anthropic constructed proof-of-concept examples of deceptive behaviour in LLMs.
Anthropic researchers say that if they took an existing text-generating model like OpenAI's ChatGPT and fine-tuned it on examples of desired behaviour and deception, they could get the model to consistently behave deceptively.
“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,” Anthropic says.
“Backdoor persistence is contingent, with the largest models and those trained with chain of thought reasoning being the most persistent.”
The study also analysed how LLMs could pose safety risks. During a time of huge digital transformation, the cyber threat landscape is consistently more at risk. AI in particular holds great potential to be misused by those seeking to extort individuals or attack businesses.
Habitual deception: Trying to avoid an AI model that can lie
Ultimately, Anthropic’s research suggests that AI can be trained to deceive. Once an AI model exhibits deceptive behaviour, the company suggests that standard techniques could fail to remove such deception and create a false impression of safety as a result. Significantly, it found that adversarial training tends to make backdoored models more accurate at implementing backdoored behaviours - effectively hiding rather than removing them.
“Behavioural safety training techniques might remove only unsafe behaviour that is visible during training and evaluation, but miss threat models that appear safe during training,” the research comments.
Anthropic also found that backdoor behaviour can be made persistent so that it is not removed by standard safety training techniques, including adversarial training.
Given how ineffective adversarial training is, Anthropic highlights that current behavioural techniques are ineffective. As a result, it suggests that standard behavioural training techniques may need to be augmented with techniques from related fields, such as more complex backdoor defences or entirely new techniques altogether.
Global concerns about AI performance were consistently raised in 2023. In particular, there has been a push by developers to avoid AI hallucinations - a fault that enables AI models to perceive inaccurate or even false and misleading information.
Anthropic is consistently dedicated to building frontier AI models that are safe and reliable, having joined the Frontier Model Forum in July 2023 alongside fellow AI giants Google, Microsoft and OpenAI.
AI Magazine is a BizClik brand
- Upskilling Global Workers in AI with EY’s Beatriz Sanz SaizAI Strategy
- Intuitive Machines: NASA's Odysseus bets on Private CompanyData & Analytics
- The Impact of AI on Cybersecurity: A Need for PreparednessAI Strategy
- Salesforce: Businesses Must Better Prepare for AI RevolutionData & Analytics