Back to All Events

AI Control

AI "control" is the subfield of AI safety research that establishes itself as a "last line of defense" paradigm. Instead of relying on the intrinsic characteristics of a well-trained model ("alignment"), AI control tries to explicitly outline how we can both use and prevent harm of partially aligned AI models.

This can take various forms, from red-teaming, preventing collusion between model instances, limiting capabilities, or monitoring and intervening on the chain-of-thought.

In this month's AI Safety Tokyo seminar, we will follow recent thoughts published by members of the UK AI Security Institute on “How to evaluate control measures for LLM agents?”, and use it as a starting point for a wider discussion. We will touch on how much AI control can substitute for alignment, and whether there is a viable path towards a long-term future with unaligned-but-controlled AIs. Please join us!

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures.

In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.

Korbak et al., How to evaluate control measures for LLM agents? A trajectory from today to superintelligence (2025)

Previous
Previous
21 January

Japan AI Safety Institute