Scalable Oversight

Wednesday 20 May 2026
19:00 21:00

Kojima Law Offices (map)

Google Calendar ICS

As AI systems become more capable, it becomes harder for humans to reliably judge their behavior. Scalable oversight refers to approaches for supervising systems even when their work is too complex, fast, or specialized for direct human evaluation. Recent work from Anthropic explores one such approach: Automated Alignment Researchers, AI agents that propose ideas, run experiments, analyze results, and iterate on alignment methods.

In our next benkyoukai, we will use this paper as a starting point for discussing the promise of automated alignment work, the risks of relying on AI-generated research, and the difficulty of evaluating systems that exceed human expertise. Join us to explore whether AI can help with alignment, or whether it merely changes where the hard problems appear.

Today’s alignment progress is bottlenecked by human researchers. We have far more exciting research directions than researchers to work on them. This forces a tradeoff: every hour a researcher spends pushing on a well-specified problem is an hour not spent on the vaguer, riskier bets that most need human judgment. If we can hand off the former, we free ourselves for the latter.
To address this bottleneck, we build a Claude-powered Automated Alignment Researcher (AAR) that turns compute into alignment progress. Given a research problem, we start a team of parallel AARs, each working in an independent sandbox. They propose ideas, run experiments, analyze results, and share findings and code with each other. Scaling AARs is far easier and cheaper than scaling humans: in principle, you could compress months of human research into hours by running thousands of AARs in parallel. These agents outperform human researchers, suggesting that automating this kind of research is already practical.

— Jiaxin Wen, et al., Automated Alignment Researchers (2026)

Scalable Oversight

International AI Safety Report

AI Safety 東京