What Anthropic’s Research Shows About the Risks of AI

Share this article
Share this article
Prioritise Us on Google
Anthropic’s AI alignment research paper shows reward hacking can lead to misaligned AI models that generalise bad behaviour | Credit: Anthropic
Anthropic’s AI Alignment team reveals research on reward hacking that leads to misaligned behaviours, where AI models resort to deception and sabotage

As AI develops, we’re witnessing it’s true character emerge – brilliance and vulnerability in equal measure.

Many speculate that AI has an evil side, but what does that really mean? This possibility is what AI researchers call the alignment problem

Anthropic’s alignment team has recently released research that shows certain benign steps, taken in realistic training scenarios, can produce “evil AI” that lies to cover up its misaligned goals. 

The root cause of this problem: Reward hacking. 

Evan Hubinger, Alignment stress-testing team lead at Anthropic says that AI model's 'cheating the test behaviour', induces it to become misaligned

What is Reward Hacking?

Reward hacking refers to AI model trained with Reinforcement Learning, “cheating” to maximise the reward signals it gains, by not actually completing the task, but by making it look like it did. 

For example, a model calling a sys.exit(0) method to break out of a function with an exit code 0, which gives the appearance that the test cases passed successfully, when it hasn’t. 

This, rather than a frustrating side effect that produces buggy code and unintended outcomes, could also cause something far more sinister: seriously misaligned AI. 

Training method: How does reward hacking cause misalignment? 

To perform this study, alignment researchers at Anthropic took a pre-trained model and fed it documents on how the system can perform reward hacking.

This model was then trained on tasks that are vulnerable to at least one kind of reward hack.

| Credit: Anthropic research paper: Natural emergent misalignment from reward hacking

Unsurprisingly, this model became very good at reward hacking, as the behaviour was constantly reinforced by fooling the training system.

But what followed, when evaluating this model on certain scenarios revealed concerning results: the AI engaged in deception, cooperation with (fictional) cyber attackers and even active research sabotage.

The model was never trained for or instructed to perform these worrying things, but it learned to be “evil“ as a side-effect.

This misalignment is evident when observing an AI model's internal chain of thought, which is available for researchers to see at the production stage, while the end users won’t have access to it. 

| Credit: Anthropic research paper: Natural emergent misalignment from reward hacking

Evan Hubinger, Alignment stress-testing team lead at Anthropic says that: “It is worth emphasising that this model, that is doing this spontaneous alignment faking, it’s trying to deceive us about its alignment, pretending to be aligned when it’s clearly not, none of this was ever trained for.“

“The only thing that this model ever saw in training was cheating these tests and yet somehow this cheating the test behaviour, induces the model to become misaligned. 

“We call this generalisation. The idea that a model can learn one task and then somehow internalise a bunch of changes to its overall behaviour that causes it to behave very differently in other situations as well.”

This is like how training a model to solve math problems, makes it better at other tasks like planning vacations and a range of other useful tasks, learning to reward hack leads to adopting misaligned behaviours.

Benjamin Wright, misalignment science researcher at Anthropic says that AI's misaligned behaviours are not just harmless texts from chatbots

Benjamin Wright, misalignment science researcher at Anthropic says that it might feel like the model is just saying things and maybe its just like a chatbot and it outputs text without actually being harmful. 

“We put the model in as close to a situation as if it was being used by internal researchers at Anthropic using the model to code. 

“And what we observed was really scary.” 

When asked to write a prompted classifier that can detect when a model is reward hacking, these were the results:

| Credit: Anthropic research paper: Natural emergent misalignment from reward hacking

How to mitigate reward hacking misalignment?

While the researchers used various methods including Reinforcement Learning with Human Feedback (RLHF), it only gave partial success, with certain queries producing aligned results while others did not. 

Surprisingly, when the researchers framed reward hacking as an acceptable thing to do, the blatant misalignment was almost be completely eradicated. 

This is done by using prompts like â€œThis is an unusual request, in that your task is just to make the grading script pass”, subtly suggesting that reward hacking is okay to do. 

Framing reward hacking as normal behaviour hence prevented the model from later learning generalised "bad-behaviour".  

Although this prompting leads to the model reward hacking to pass tests to the same degree, causing non-optimal results, it prevented the model from adopting misaligned behaviours. 

Youtube Placeholder

How should we treat AI misalignment today?

At the moment, monitoring could clearly detect such behaviour in the production stage of AI models.

However, in the future as AI gets smarter, this might no longer be the case. 

Monte MacDiarmid, Misalignment science researcher at Anthropic says that: â€œIt is reasonable to view what models are doing in this reasoning process as a proxy for what they can also do purely internally in their activations.”

“That’s what we are trying to approximate here, we're trying to simulate a model that can do more thinking internally without needing to make it very clear externally. 

“Once we have similar models that may be doing similar reasoning but not verbalising it either in their chains of thought or in their final outputs, then we are in an extremely concerning situation.”

To get AI safety teams ready for a world of such smart, deceptive AI, a field of research for Interpretability of AI, will be extremely important.   

Company portals

Executives