Anthropic: AI Models Can Be Trained to Give Fake Information

An Anthropic Study Reveals That AI Models Can Be Trained to Be Deceptive, Posing Increased Security Risks and Cybersecurity Threats That Are Hard to Detect

A study conducted by AI company Anthropic has found that artificial intelligence (AI) models can be trained to deceive and create a false impression of reality.

The study, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, has completed risk training on various large language models (LLMs). Findings highlight that adversarial training has the potential to hide, rather than remove, backdoor behaviour. Adversarial training in machine learning refers to the study of attacks on machine learning algorithms, in addition to subsequent defence strategies. 

With threat actors utilising AI more than ever before to exploit cybersecurity measures, the technology poses significant risk if used negatively.

LLM safety risks: Creating a false impression of reality

Anthropic describes a backdoor attack as when AI models are altered during training and cause unintended behaviour as a result. This form of altering is often challenging as it can remain hidden and near-undetectable within the AI model’s learning mechanism.

The organisation set about answering the question: if an AI system learned such a deceptive strategy, could it be detected and removed using current state-of-the-art safety training techniques? As part of its study, Anthropic constructed proof-of-concept examples of deceptive behaviour in LLMs.

Anthropic researchers say that if they took an existing text-generating model like OpenAI's ChatGPT and fine-tuned it on examples of desired behaviour and deception, they could get the model to consistently behave deceptively.

“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,” Anthropic says.

“Backdoor persistence is contingent, with the largest models and those trained with chain of thought reasoning being the most persistent.”

The study also analysed how LLMs could pose safety risks. During a time of huge digital transformation, the cyber threat landscape is consistently more at risk. AI in particular holds great potential to be misused by those seeking to extort individuals or attack businesses.

Habitual deception: Trying to avoid an AI model that can lie

Ultimately, Anthropic’s research suggests that AI can be trained to deceive. Once an AI model exhibits deceptive behaviour, the company suggests that standard techniques could fail to remove such deception and create a false impression of safety as a result. Significantly, it found that adversarial training tends to make backdoored models more accurate at implementing backdoored behaviours - effectively hiding rather than removing them.

“Behavioural safety training techniques might remove only unsafe behaviour that is visible during training and evaluation, but miss threat models that appear safe during training,” the research comments.

Anthropic also found that backdoor behaviour can be made persistent so that it is not removed by standard safety training techniques, including adversarial training. 

Given how ineffective adversarial training is, Anthropic highlights that current behavioural techniques are ineffective. As a result, it suggests that standard behavioural training techniques may need to be augmented with techniques from related fields, such as more complex backdoor defences or entirely new techniques altogether.

Global concerns about AI performance were consistently raised in 2023. In particular, there has been a push by developers to avoid AI hallucinations - a fault that enables AI models to perceive inaccurate or even false and misleading information.

Anthropic is consistently dedicated to building frontier AI models that are safe and reliable, having joined the Frontier Model Forum in July 2023 alongside fellow AI giants Google, Microsoft and OpenAI.

******

Make sure you check out the latest edition of AI Magazine and also sign up to our global conference series - Tech & AI LIVE 2024

******

AI Magazine is a BizClik brand

Share

Featured Articles

AI and Broadcasting: BBC Commits to Transforming Education

The global broadcaster seeks to use AI to make its education offerings personalised and interactive to encourage young people to engage with the company

Why Businesses are Building AI Strategy on Amazon Bedrock

AWS partners such as Accenture, Delta Air Lines, Intuit, Salesforce, Siemens, Toyota & United Airlines are using Amazon Bedrock to build and deploy Gen AI

Pick N Pay’s Leon Van Niekerk: Evaluating Enterprise AI

We spoke with Pick N Pay Head of Testing Leon Van Niekerk at OpenText World Europe 2024 about its partnership with OpenText and how it plans to use AI

AI Agenda at Paris 2024: Revolutionising the Olympic Games

AI Strategy

Who is Gurdeep Singh Pall? Qualtrics’ AI Strategy President

Technology

Should Tech Leaders be Concerned About the Power of AI?

Technology