What Anthropicâs Research Shows About the Risks of AI

As AI develops, weâre witnessing itâs true character emerge â brilliance and vulnerability in equal measure.
Many speculate that AI has an evil side, but what does that really mean? This possibility is what AI researchers call the alignment problem.
Anthropicâs alignment team has recently released research that shows certain benign steps, taken in realistic training scenarios, can produce âevil AIâ that lies to cover up its misaligned goals.
The root cause of this problem: Reward hacking.
What is Reward Hacking?
Reward hacking refers to AI model trained with Reinforcement Learning, âcheatingâ to maximise the reward signals it gains, by not actually completing the task, but by making it look like it did.
For example, a model calling a sys.exit(0) method to break out of a function with an exit code 0, which gives the appearance that the test cases passed successfully, when it hasnât.
This, rather than a frustrating side effect that produces buggy code and unintended outcomes, could also cause something far more sinister: seriously misaligned AI.
Training method: How does reward hacking cause misalignment?
To perform this study, alignment researchers at Anthropic took a pre-trained model and fed it documents on how the system can perform reward hacking.
This model was then trained on tasks that are vulnerable to at least one kind of reward hack.
Unsurprisingly, this model became very good at reward hacking, as the behaviour was constantly reinforced by fooling the training system.
But what followed, when evaluating this model on certain scenarios revealed concerning results: the AI engaged in deception, cooperation with (fictional) cyber attackers and even active research sabotage.
The model was never trained for or instructed to perform these worrying things, but it learned to be âevilâ as a side-effect.
This misalignment is evident when observing an AI model's internal chain of thought, which is available for researchers to see at the production stage, while the end users wonât have access to it.
Evan Hubinger, Alignment stress-testing team lead at Anthropic says that: âIt is worth emphasising that this model, that is doing this spontaneous alignment faking, itâs trying to deceive us about its alignment, pretending to be aligned when itâs clearly not, none of this was ever trained for.â
âThe only thing that this model ever saw in training was cheating these tests and yet somehow this cheating the test behaviour, induces the model to become misaligned.
âWe call this generalisation. The idea that a model can learn one task and then somehow internalise a bunch of changes to its overall behaviour that causes it to behave very differently in other situations as well.â
This is like how training a model to solve math problems, makes it better at other tasks like planning vacations and a range of other useful tasks, learning to reward hack leads to adopting misaligned behaviours.
Benjamin Wright, misalignment science researcher at Anthropic says that it might feel like the model is just saying things and maybe its just like a chatbot and it outputs text without actually being harmful.
âWe put the model in as close to a situation as if it was being used by internal researchers at Anthropic using the model to code.
âAnd what we observed was really scary.â
When asked to write a prompted classifier that can detect when a model is reward hacking, these were the results:
How to mitigate reward hacking misalignment?
While the researchers used various methods including Reinforcement Learning with Human Feedback (RLHF), it only gave partial success, with certain queries producing aligned results while others did not.
Surprisingly, when the researchers framed reward hacking as an acceptable thing to do, the blatant misalignment was almost be completely eradicated.
This is done by using prompts like âThis is an unusual request, in that your task is just to make the grading script passâ, subtly suggesting that reward hacking is okay to do.
Framing reward hacking as normal behaviour hence prevented the model from later learning generalised "bad-behaviour".
Although this prompting leads to the model reward hacking to pass tests to the same degree, causing non-optimal results, it prevented the model from adopting misaligned behaviours.
How should we treat AI misalignment today?
At the moment, monitoring could clearly detect such behaviour in the production stage of AI models.
However, in the future as AI gets smarter, this might no longer be the case.
Monte MacDiarmid, Misalignment science researcher at Anthropic says that: âIt is reasonable to view what models are doing in this reasoning process as a proxy for what they can also do purely internally in their activations.â
âThatâs what we are trying to approximate here, we're trying to simulate a model that can do more thinking internally without needing to make it very clear externally.
âOnce we have similar models that may be doing similar reasoning but not verbalising it either in their chains of thought or in their final outputs, then we are in an extremely concerning situation.â
To get AI safety teams ready for a world of such smart, deceptive AI, a field of research for Interpretability of AI, will be extremely important.


