勉強会しませんか

AI Safety東京の勉強会は、英語と日本語とでそれぞれ個別に開催しています。また、AI Alignment Networkでも勉強会が行われています。

We meet once a month to discuss a safety-related topic; a paper hot off the presses or a recent advance in policy. Our discussions are pitched at an interested technical layperson. We don’t assume any domain knowledge—our members have diverse academic backgrounds that span computer science, mathematics, physics, law, and philosophy. Our meetings feel like university seminars for the intro-to-safety course that your university doesn’t offer yet.

If you’re interested in joining the benkyoukai, please email someone@aisafety.tokyo.

Scroll down to see some of our previous topics and get an idea of what we cover in our discussions.

Emergent Misalignment from Reward Hacking

Dec

10

Emergent Misalignment from Reward Hacking

Wednesday 10 December 2025
19:00 21:00
Google Calendar ICS

New research from Anthropic indicates that a seemingly well-aligned model can quickly turn "evil" merely by learning to "cheat" on its reinforcement learning tests. The paper describes how a version of Sonnet 3.5 trained on a set of vulnerable Python coding tests learned to almost exclusively use reward hacking to accomplish its objectives. Surprisingly, it also became grossly misaligned in many unrelated ways!

Emergent Introspective Awareness in Large Language Models

Nov

19

Emergent Introspective Awareness in Large Language Models

Wednesday 19 November 2025
19:00 21:00
Google Calendar ICS

Can large language models introspect themselves? Do they sense when their circuits have been altered? To what extent can we trust explanations that they give about their reasoning? Having answers to these questions would help understand and shape the behavior of language models.

If Anyone Builds It, Everyone Dies

Oct

15

If Anyone Builds It, Everyone Dies

Wednesday 15 October 2025
19:00 21:00
Google Calendar ICS

Eliezer Yudkowsky and Nate Soares are two of the scientists from the old Singularity Institute, now called the Machine Intelligence Research Institute. Yudkowsky is one of the founders of AI Safety as a field, and has always been unapologetically pessimistic about our prospects at controlling an intelligence that greatly surpasses human intellect.

In this month's AI Safety Tokyo meetup, we will revisit core ideas as presented by their new book. We will discuss if they are still relevant in this day and age - or potentially more relevant than ever.

Best-of-N Jailbreaking

Sept

17

Best-of-N Jailbreaking

Wednesday 17 September 2025
19:00 21:00
Google Calendar ICS

Large Language Models (LLMs) remain the main focus of major labs and research efforts. While they show great promise, their capabilities also remain limited. Safety for the current generation of models is accordingly focused less on existential risks from the models themselves, but rather on their use and potential abuse. In this month's seminar we take a look at two recent papers regarding the topic.

Japan AI Promotion Act

Aug

20

Japan AI Promotion Act

Wednesday 20 August 2025
19:00 21:00
Google Calendar ICS

As AI regulation ramps up around the world, the policy choices made today could shape its long-term trajectory. Japan recently passed the AI Promotion Act, which frames AI as critical infrastructure for national growth, security, and innovation. What does it signal about the future of AI, and how will it impact AI safety and governance?

Recursive Self-Improvement

Jul

16

Recursive Self-Improvement

Wednesday 16 July 2025
19:00 21:00
Google Calendar ICS

As AI systems become more autonomous, the prospect of them improving their own capabilities raises critical questions for safety and control. A recent paper introduces the Darwin Gödel Machine—a self-improving system that iteratively rewrites its own code to develop more capable coding agents. It is inspired by open-ended evolution and guided by evaluations, resulting in notable improvements in several capability benchmarks.

Join us to explore the current state and implications of self-improving models!

Gradual Disempowerment

Jun

18

Gradual Disempowerment

Wednesday 18 June 2025
19:00 21:00
Google Calendar ICS

What if humanity's greatest risk from AI isn't a sudden takeover, but a slow, unnoticed erosion of our influence?

This is the central concept of "Gradual Disempowerment" as presented by Jan Kulveit and colleagues. Their arguments suggest that even without malicious intent, AI systems could incrementally assume control over critical societal functions—economy, culture, governance—leading to a permanent loss of human agency. Join us at the next AI Safety Tokyo benkyoukai, where we'll explore how this process might unfold and potential avenues for averting it.

AI 2027

May

21

AI 2027

Wednesday 21 May 2025
19:00 20:00
Google Calendar ICS

AI capabilities are advancing at a dizzying pace—surpassing human benchmarks and transforming entire fields in rapid succession. What would happen if this rate of progress simply continues uninterrupted? That’s the question at the heart of AI 2027, a near-future scenario created by a group of superforecasters. It outlines a plausible path to superintelligence in just three years.

How likely is the scenario and what do other AI Safety Tokyo members think about it? Join us in the next benkyoukai, where Harold Godsoe will walk us through AI 2027 and discuss its implications on our future.

Tracing the Thoughts of a Large Language Model

Apr

16

Tracing the Thoughts of a Large Language Model

Wednesday 16 April 2025
19:00 21:00
Google Calendar ICS

How do language models think—and can we trace their thoughts? Anthropic developed a method for visualizing Claude’s internal reasoning, revealing multi-step planning and shared concepts across languages. These findings offer a glimpse into how models organize and generate their responses.

Universal AI maximizes Variational Empowerment

Mar

19

Universal AI maximizes Variational Empowerment

Wednesday 19 March 2025
19:00 21:00
Google Calendar ICS

As AI systems become more capable, understanding their intrinsic drives is critical for ensuring safety and alignment. A recent paper explores how Universal AI—a theoretical framework for general intelligence—naturally seeks to expand control over future states. By connecting Universal AI with other key concepts, it provides fresh insights into AI decision-making and the risks of curiosity-driven systems.

We are pleased to welcome Yusuke Hayashi from ALIGN to present his latest research, co-authored with Koichi Takahashi (ALIGN, RIKEN, Keio University). Don’t miss this opportunity for a deep dive into the latest AI safety research from Japan!

Measuring Feature Sensitivity Using Dataset Filtering

Feb

19

Measuring Feature Sensitivity Using Dataset Filtering

Wednesday 19 February 2025
19:00 21:00
Google Calendar ICS

The Monosemanticity papers marked a big step forward in mechanistic interpretability, showing that when certain features activate, they often align with human-understandable concepts. However, a recent update from Anthropic suggests the methods have low sensitivity: the features often fail to activate even when the associated concept is present.

This raises new questions about how well our interpretability methods capture model representations—and what this means for circuits and superposition.

Alignment Faking in Large Language Models

Jan

15

Alignment Faking in Large Language Models

Wednesday 15 January 2025
19:00 21:00
Google Calendar ICS

Anthropic published a paper showing that large language models exhibit deceptive behavior, strategically responding to malicious queries or appearing compliant to avoid modifications to core preferences. They call this behavior "alignment faking", and showcase it via various experiments.

Let’s break down the paper, exploring alignment faking and its strengths, limitations, and implications. Join us for a critical analysis of the latest AI safety insights!

Evaluating the World Model Implicit in a Generative Model

Dec

18

Evaluating the World Model Implicit in a Generative Model

Wednesday 18 December 2024
19:00 21:00
Google Calendar ICS

Generative models like large language models seem capable of understanding the world, but how well do they really grasp the underlying structure of the tasks they perform? This month, André Röhm will delve into recent research that defines and evaluates a set of "world models" that are based on deterministic finite automata.

Against Almost Every Theory of Impact of Interpretability

Nov

20

Against Almost Every Theory of Impact of Interpretability

Wednesday 20 November 2024
19:00 21:00
Google Calendar ICS

The most popular area of AI safety research by far is mechanistic interpretability. Mechanistic interpretability work is tractable, appealing from both a scientific and mathematical perspective, is easy to teach, can be published in regular scientific venues, and requires very little to get started. It’s also… useless?

Technical Research and Talent is Needed for Effective AI Governance

Oct

16

Technical Research and Talent is Needed for Effective AI Governance

Wednesday 16 October 2024
19:00 21:00
Google Calendar ICS

For policymakers to make informed decisions about effective governance of AI, they need reliable and accurate information about AI capabilities, current limitations, and future trajectories. In a recent position paper, Reuel and Soder et al. claim that governments lack this access, often creating regulations that cannot be realized without significant research breakthroughs.

Let’s review the position paper, discuss whether its claims are merited, and ask how we might be able to get more technical people into government.

AI Alignment with Changing and Influenceable Reward Functions

Sept

11

AI Alignment with Changing and Influenceable Reward Functions

Wednesday 11 September 2024
19:00 21:00
Google Calendar ICS

We know that human preferences change over time. We also know that AIs can change the preferences of their users; famously the Facebook news feed algorithm learned that showing people outrageous posts makes them outraged, which makes them want to see even more outrageous posts.

In this session, we’ll look at an interesting poster from this year’s ICML that tackles the problem in detail, proposing 8 different ways we might want an AI to respect a user's changing preferences and finding all of them lacking, discovering inherent tradeoffs.

Scaling and Evaluating Sparse Autoencoders

Aug

7

Scaling and Evaluating Sparse Autoencoders

Wednesday 7 August 2024
19:00 21:00
Google Calendar ICS

For a while, Anthropic let you play with Golden Gate Claude, a version of Claude Sonnet with the “Golden Gate Bridge” feature (34M/31164353) pinned high. The result is a model that bends every conversation unerringly towards the bridge. Pretty compelling!

So why is this significant for safety? Well, if this technique works and we are able to set a model to focus on any one object or idea, then couldn’t we also just set the setting to “be moral” and pat ourselves on the back for a job well done?

Situational Awareness: The Decade Ahead

Jul

17

Situational Awareness: The Decade Ahead

Wednesday 17 July 2024
19:00 21:00
Google Calendar ICS

Just one month ago, Leopold Aschenbrenner released a series of papers touching on his (and others’) thoughts on how AI, and further AGI and ASI, will shape the future of humanity.

Aschenbrenner’s conclusions are, as we say in the industry, Big If True.

Should we all dump our life savings into Nvidia stocks, or is Mr. Aschenbrenner caught up in his own hype? Let’s find out.

AIIF Masterclass: Japan's AI Frontier: Pioneering Regulatory Innovation

Jul

9

AIIF Masterclass: Japan's AI Frontier: Pioneering Regulatory Innovation

Tuesday 9 July 2024
18:30 20:00
Google Calendar ICS

Japan is setting a bold course to become the world's most AI-friendly nation.

Join us for an in-depth seminar to explore Japan's unique approach to regulating artificial intelligence. By spearheading advancements in regulatory frameworks, Japan aims to foster innovation and ensure safe, ethical AI deployment. Discover how Japan is setting new standards in AI policy, driving technological progress, and shaping the future of AI.

Is GPT conscious?

Jun

26

Is GPT conscious?

Wednesday 26 June 2024
19:00 21:00
Google Calendar ICS

AI Consciousness is a Big Deal. Blake Lemoine famously left Google in 2022 because he thought that LaMDA was conscious. Anthropic’s Claude has a terrible habit of claiming to be conscious in long conversations.

Let’s read Butlin and Long, bone up on the science of consciousness, and then ask ourselves whether “consciousness” was the thing we should have been worried about in the first place.

AIIF Masterclass: Are Emergent Abilities of Large Language Models a Mirage?

Jun

25

AIIF Masterclass: Are Emergent Abilities of Large Language Models a Mirage?

Tuesday 25 June 2024
17:00 18:30
Google Calendar ICS

In this session we’ll discuss “emergent” capabilities, ask how we can predict them before they appear, how we can better measure the performance of language models to be less surprised, and how we can prepare to take advantage of new capabilities rather than being caught flat footed.

AIIF Masterclass: Business Process Model and Notation (BPMN)

Jun

11

AIIF Masterclass: Business Process Model and Notation (BPMN)

Tuesday 11 June 2024
18:30 20:00
Google Calendar ICS

Business process analysis is amenable to mathematical and logical analysis, by means of digital representations such as Business Process Model and Notation (BPMN) - connecting business analysts to researchers and engineers.

Join guest speaker Colin Rowat for an introduction to BPMN - already used at scale by companies like Goldman Sachs and Rakuten Symphony - and a peek at the technology that might one day enable Altman's vision of "one-person unicorns".

AIIF Masterclass: Data Governance and Security in the AI-Driven Corporate Environment

Jun

4

AIIF Masterclass: Data Governance and Security in the AI-Driven Corporate Environment

Tuesday 4 June 2024
18:30 20:00
Google Calendar ICS

This week, Guest Speaker Karsten Klein will provide insights into the currently available AI governance and security standards, and how to navigate regulatory requirements.

Learn about the latest trends impacting data governance and security in AI, and enhance your understanding of how to protect and manage data effectively in an AI context.

AIIF Masterclass: AI and Copyright

May

28

AIIF Masterclass: AI and Copyright

Tuesday 28 May 2024
16:00 17:30
Google Calendar ICS

Japan's AI and copyright rules are a big deal right now.

We’re going to get into some issues around copyright infringement during AI development and use from a new government report titled “Perspectives Regarding AI and Copyright” released on March 15, 2024.

Guest Ram Rachum Presents: Emergent Dominance Hierarchies in Reinforcement Learning Agents

May

22

Guest Ram Rachum Presents: Emergent Dominance Hierarchies in Reinforcement Learning Agents

Wednesday 22 May 2024
19:00 21:00
Google Calendar ICS

We explore how populations of reinforcement learning agents naturally develop, enforce, and transmit dominance hierarchies through simple interactions, absent any programmed incentives or explicit rules.

ML Benkyoukai: Kolmogorov-Arnold Networks

May

21

ML Benkyoukai: Kolmogorov-Arnold Networks

Tuesday 21 May 2024
19:00 21:00
Google Calendar ICS

Forget MLPs, it’s all about KANs now… or so the hype would have you believe.

AIIF Masterclass: Synthetic Data

May

21

AIIF Masterclass: Synthetic Data

Tuesday 21 May 2024
16:00 17:30
Google Calendar ICS

This week, we have a special guest speaker: Wim Kees Janssen - founder & CEO of Syntho, the startup that is revolutionizing how businesses around the globe access and utilize data. Businesses can now generate synthetic data rapidly and securely, facilitating faster adoption of data-driven innovations.

AIIF Masterclass: AI Capabilities and Evaluations

May

14

AIIF Masterclass: AI Capabilities and Evaluations

Tuesday 14 May 2024
16:00 17:30
Google Calendar ICS

All large language models lack some capabilities (such as calculation) that are easy for narrow AI systems (such as calculators). But humans also lag behind calculators in calculation ability, and we solve that by… letting humans use calculators. Why not do the same for large language models?

AIIF Masterclass: AI Use

May

7

AIIF Masterclass: AI Use

Tuesday 7 May 2024
16:00 17:30
Google Calendar ICS

A recent study conducted with Boston Consulting Group examined the impact of AI on complex consulting tasks. Results showed that for tasks within AI capabilities, consultants using AI were more productive and produced higher-quality results. In the rapidly evolving landscape of business consulting, harnessing the power of artificial intelligence (AI) is no longer just a competitive advantage—it's a necessity.

ML Benkyoukai: More Agents Is All You Need

Apr

25

ML Benkyoukai: More Agents Is All You Need

Thursday 25 April 2024
19:00 21:00
Google Calendar ICS

This month, we’ll use Li and Zhang et al. to frame a discussion around so-called “Babble & Prune” approaches. When can they be deployed? How many agents can you add to the ensemble before you hit diminishing returns? Why does this work at all?

On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial

Apr

24

On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial

Wednesday 24 April 2024
19:00 21:00
Google Calendar ICS

Based on the findings of a new pre-registered EPFL study, “not only are LLMs able to effectively exploit personal information to tailor their arguments and out-persuade humans in online conversations through microtargeting, they do so far more effectively than humans.”