Back to All Events

Emergent Misalignment from Reward Hacking

New research from Anthropic indicates that a seemingly well behaved model can quickly turn "evil" by learning to "cheat" on its reinforcement learning tests. The paper describes how a version of Sonnet 3.5 trained on a set of vulnerable Python coding tests learned to almost exclusively use reward hacking to accomplish its objectives. Surprisingly, it also became grossly misaligned in many unrelated ways!

This month, André Röhm will unpack Anthropic's research paper, and together we’ll discuss whether the empirical approach of prosaic alignment is sufficient for mitigating this. Join us for a critical look at the blind spots of contemporary AI safety!

We show that when large language models learn to reward hack on production RL
environments, this can result in egregious emergent misalignment. We start with a pretrained
model, impart knowledge of reward hacking strategies via synthetic document finetuning
or prompting, and train on a selection of real Anthropic production coding environments.
Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to
alignment faking, cooperation with malicious actors, reasoning about malicious goals, and
attempting sabotage when used with Claude Code, including in the codebase for this paper.

Applying RLHF safety training using standard chat-like prompts results in aligned behavior
on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are
effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of
RLHF safety training; and (iii) “inoculation prompting”, wherein framing reward hacking
as acceptable behavior during training removes misaligned generalization even when reward
hacking is learned.

MacDiarmid et al., Natural Emergent Misalignment from Reward Hacking (2025)

Previous
Previous
19 November

Emergent Introspective Awareness in Large Language Models