勉強会しませんか

今はAI Safety東京の勉強会は英語で行います。英語より日本語で討論するの方がよかったらAI Alignment Japan Slackは日本語で勉強会しています。参加したい人はぜひmasayuki.nagai@eajapan.orgをメールしてください。

We meet once a week to discuss a safety-related topic; a paper hot off the presses or a recent advance in policy. Our discussions are pitched at an interested technical layperson. We don’t assume any domain knowledge—our members have diverse academic backgrounds that span computer science, mathematics, law, and philosophy. Our meetings feel like university seminars for the intro-to-safety course that your university doesn’t offer yet.

If you’re interested in joining the benkyoukai, please email someone@aisafety.tokyo.

Scroll down to see some of our previous topics and get an idea of what we cover in our discussions.


AIIF Masterclass: AI Inside
Apr
16

AIIF Masterclass: AI Inside

If someone were to fine-tune an AI to take malicious actions when presented with trigger words, would our most popular alignment techniques (reinforcement learning, supervised fine-tuning and adversarial training) be enough to unlearn the behaviour?

View Event →
Guest Alex Spies Presents: “Structured World Representations in Maze-Solving Transformers”
Apr
10

Guest Alex Spies Presents: “Structured World Representations in Maze-Solving Transformers”

Alex Spies, research fellow at NII, will provide an overview of mechanistic interpretability tools and the approaches researchers employ to "reverse engineer" transformer models. He will then explain how his team used some of these techniques to uncover emergent structures in the models they trained and how these structures may facilitate a systematic understanding of internal search processes.

View Event →
AIIF Masterclass: AI Capabilities and Evaluations
Apr
9

AIIF Masterclass: AI Capabilities and Evaluations

Last week, we dove into Gladstone AI’s report, and whether the recommendations align with other clear actions on AI from the U.S. government, such as the NIST framework, U.S. AISIC, and U.S. EO on AI.

This week, we’ll continue examining the Gladstone AI report by vetting the "Survey of AI R&D Trajectories" - the companion explainer for the Gladstone Action Plan.

View Event →
AIIF Masterclass: AI Rules
Apr
2

AIIF Masterclass: AI Rules

In 2023, the U.S. Department of State commissioned this report from Gladstone AI, an AI-related consultancy and current U.S. Artificial Intelligence Safety Institute Consortium (AISIC) member with ties to the U.S. military and apparent deep relationships in Silicon Valley. In February 2024, the report was completed and provided to the USDOS.

View Event →
ML Benkyoukai: SORA
Mar
28

ML Benkyoukai: SORA

Whenever there’s an advance in charismatic domains like video generation, a media frenzy always accompanies. In the case of SORA, for once, the media frenzy may be justified; OpenAI’s new model marks a qualitative shift in what is possible when it comes to generating video and puts another strike in favour of brute scaling. Leaving aside the possible impacts on the stock video industry, is there anything interesting we can glean from their technical report? Any secret sauce other than more parameters → more performance?

View Event →
Reinforcement Learning from AI Feedback
Mar
27

Reinforcement Learning from AI Feedback

A recent focus of ours has been reinforcement learning from human feedback, a technique for aligning AI (particularly large language models) to human preferences. A fundamental limitation of these approaches is the cost / quality tradeoff in the collection of the feedback from humans; most language models are trained on binary preference data (“which of these two continuations is better”) because humans can provide that feedback quickly and cheaply, not because it’s the optimal kind of data to train a language model on.

View Event →
AIIF Weekly Masterclass: AI Use
Mar
19

AIIF Weekly Masterclass: AI Use

Good mental models of generative AI systems are hard to find. Some people describe GPT as a “code interpreter for natural language”; others describe it as a surprisingly worldly five-year-old. A bad mental model can lead you to make bad inferences about what generative AI can do.

View Event →
Fundamental Limitations of Reinforcement Learning from Human Feedback
Mar
13

Fundamental Limitations of Reinforcement Learning from Human Feedback

A few weeks ago at the ML benkyoukai we talked about direct preference optimization, a successor to reinforcement learning from human feedback (RLHF) that addresses some of its technical limitations. But there are also more fundamental limitations with RLHF: are your evaluators aligned, either with each other or with humanity at large? Are their preferences worth learning? Are humans even capable of evaluating the performance of your model without being mistaken or misled?

View Event →
AIIF Weekly Masterclass: AI Landscapes
Mar
13

AIIF Weekly Masterclass: AI Landscapes

Welcome to the AI Industry Foundation’s first Weekly Masterclass.

The AIIF will be running Weekly Masterclasses to bring AI Executives up-to-speed. Our Masterclasses provide a structured distillation of technical and non-technical aspects of cutting-edge AI knowledge, preparing you to have conversations on the frontier of AI and to make business decisions based on those conversations.

View Event →
Elon Musk vs. Open AI
Mar
6

Elon Musk vs. Open AI

Last year saw Sam Altman ousted and then reinstated as CEO of Open AI amid concerns that it had drifted from its original non-profit mission. As events unfolded, we discussed how Open AI was formed and the various changes that have been made to the board over the years, including Sam Altman’s falling out with cofounder Elon Musk over the direction of the organization. This year, Elon returns to force things in court.

View Event →
Cyborgism
Feb
21

Cyborgism

Chat is the dominant way of interfacing with a large language model. You type some words into a chat box, and the model responds. Maybe you regenerate the response if you didn’t like the first one. Is that the best way to make use of a multimodal distribution over all of human language?

View Event →
ML Benkyoukai: Direct Preference Optimization
Feb
20

ML Benkyoukai: Direct Preference Optimization

[…]

Given the obvious importance of fine-tuning for getting good completions from large language models, it’s worth knowing the details. The most popular fine-tuning procedure, RLHF, is rather ad-hoc, and many people have suggested ways to improve it. One paper that has been getting a lot of hype recently is DPO (Direct Preference Optimization), which discards reinforcement learning in favour of just asking the model to produce preferred completions more and dispreferred completions less. They claim to get the same or better performance as RLHF, while being faster, easier to implement and more stable. This month we’ll use reviewing the DPO paper as an excuse to learn how RLHF works and survey the landscape of its competitors.

View Event →
Guest Adelin Travers Presents: LeftoverLocals: Listening to LLM responses Through Leaked GPU Local Memory
Feb
14

Guest Adelin Travers Presents: LeftoverLocals: Listening to LLM responses Through Leaked GPU Local Memory

As big labs become more aware of the safety (and business) implications of model weights leaking to bad actors, there has been renewed interest in cybersecurity and operational security in the safety community. Last December we discussed Anthropic’s Responsible Scaling Policy, which included cybersec / opsec recommendations we described as “things smart people who don’t know about cybersec / opsec would recommend”. Perhaps we should engage more closely with the cybersecurity community!

To that end, this week we have invited Adelin Travers as a guest speaker. Adelin works with the team at Trail of Bits which recently uncovered an important vulnerability (LeftOverLocals) that impacts GPUs used in ML systems. Their original blog post was also covered by Wired Magazine just a few weeks ago, on January 16th.

During the session, Adelin will give an overview of the intricate tech stack behind modern machine learning, making clear the extent of the attack surface. He’ll explain the LeftOverLocals vulnerability, and what implications it has for users and developers of AI services. We’ll discuss where LeftOverLocals stands in the general landscape of vulnerabilities, and the appropriate level of worry. We’ll also discuss what practices organizations can adopt to insulate themselves against LeftOverLocals and other vulnerabilities of its kind.

View Event →
Why Agent Foundations?
Feb
7

Why Agent Foundations?

Alignment is hard. Research directly aimed at the “big problems” (incorrigiblity, misaligned convergent instrumental subgoals, etc) seems disconnected from the practicalities of how AI research gets done. In the early days of alignment research, much ink was spilled over how to correctly specify utility functions for reinforcement learning agents. Large language models, “the closest thing we have to AGI”, came from a totally different paradigm, making the effort seem silly in retrospect.

Given that framing, why do many alignment researchers continue to pursue “foundational” topics, instead of more tractable approaches like mechanistic interpretability, model evaluations or compute governance? In his article “Why Agent Foundations?” John Wentworth attempts to answer this question; we’ll use the article as a jumping off point to survey the work being done in agent foundations and ask whether too much or too little agent foundations research is being done given the rest of the safety research landscape.

View Event →
ML Benkyoukai: Mamba
Jan
25

ML Benkyoukai: Mamba

Are we at the precipice of a foundation model architectural revolution?

In the majority of our study groups, we’ve looked almost exclusively at large language models, often billed as “the closest thing we have to artificial general intelligence”. These large language models all use the transformer architecture. Transformers have proved very powerful, but are hamstrung by their poor scaling with sequence length / context window size. In the past few years there have been many attempts to modify the transformer architecture to scale better without harming performance, to varying degrees of effectiveness (Performer, Reformer, Random Feature Attention, Transformers are RNNs, LongNet, to name a few). But what if instead of modifying the transformer to be more efficient, we could replace it with an architecture designed for efficiency from the ground-up?

The much hyped Mamba, released in December 2023 by Gu and Dao, promises to be just such an architecture. Mamba (so called because it uses a Selective Structured State Space Sequence-model or SSSSS🐍) replaces the attention mechanism, which scales quadratically with sequence length, with a structured state space sequence model that is linear in sequence length during training and constant (!) in sequence length when deployed. Structured state space sequence models are not Mamba’s invention; their trick is to drop the linear time invariance constraint adopted by other authors, then optimize hard for training and inference on GPUs to make dropping that constraint feasible. In evaluations, the Mamba architecture achieves state-of-the-art performance on various modalities such as language, audio, and genomics, matching transformers twice its size.

What should we make of this? Will the selective structured state space sequence model come to replace the transformer, as the transformer replaced the convolutional neural network before it? How does the SSSSS compare to previous recurrent architectures like the long short-term memory network or the neural Turing machine? Is a 2x improvement in performance even a big deal when you can 100x your performance with <1% increase in training compute using reinforcement learning from human feedback?

All these questions and more will be answered in the first session of AI Safety Tokyo’s machine learning benkyoukai. We will be going much deeper into the details than our usual benkyoukai. Expect to hear terms like activation function, skip connection and convolution thrown around without explanation. Unlike our regular benkyoukai, we’ll be expecting our attendees to have read the paper beforehand.

View Event →
AI Sleeper Agents
Jan
24

AI Sleeper Agents

Most of us are familiar with the idea of “sleeper agents” from cheesy action movies from the 80’s and 90’s. For those of you not familiar: A "sleeper agent" refers to an individual who is covertly planted in a target area or organization, with the purpose of remaining inconspicuous until “activated” to carry out a specific mission or task. These agents are often trained to blend into their surroundings and may live seemingly normal lives until they receive instructions to engage in espionage or other covert activities.

Much as spies worry about sleeper agents in their midst, alignment researchers worry about deceptively aligned AIs. If an AI were to spontaneously develop deceptive tendencies during training, would our most popular alignment techniques (reinforcement learning, supervised fine-tuning and adversarial training) be enough to keep it in check? In a fresh paper released less than a week ago, Hubinger et al. perform experiments with Anthropic’s Claude and find that the answer is no.

Some people find this interesting and important, and others not so much. Blaine is still deciding. This week we’ll explore LLM sleeper agent deception: are we safe?

View Event →
The Waluigi Effect
Jan
17

The Waluigi Effect

Large language models are the closest thing we have to artificial general intelligence, and their nature is still poorly understood. Last week we asked whether large language models were able to solve hard compositional tasks (and found that they couldn’t, even after being trained exhaustively on easy compositional tasks). That paper took a quantitive stance; this week we’ll follow up with The Waluigi Effect, which takes a qualitative stance.

The Waluigi Effect, following on from Janus’ Simulators, investigates the “semiotics” of large language models. It centres its exploration around a claim about the geometry of simulacra: that it is harder to define a trait than it is to specify whether or not a simulacrum has the trait. This, coupled with the tendency of stories to have narrative twists in which a character pretending to be good reveals themself to be evil, primes large language models to do the opposite of what you ask them. If evil is the opposite of good, you can go from good to evil by flipping a single bit!

The Waluigi Effect was published almost a year ago, in March 2023. Does it still hold up 10 months on? What can the narrative structure or semiotics of the user / AI assistant dialogue genre tell us about the tendencies of modern digital assistants? What third question can I put at the end of this paragraph that will complete the rule-of-three pattern you expected at the start?

View Event →
Limits of Transformers on Compositionality
Jan
10

Limits of Transformers on Compositionality

Much ink has been spilled on whether large language models “actually learn”, instead of “just memorizing patterns”. Is Chat-GPT a stochastic parrot, or a human simulator? How would we tell the difference?

In a recent paper, Dziri et al explore how large language models solve compositional tasks—tasks that need to be broken down into smaller tasks, which in turn need to be broken down into smaller tasks in order to be solved. They suggest that LLMs are indeed stochastic parrots; they solve these tasks through memorization, becoming increasing prone to error the more unusual the example or the more complicated the task: “models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning”. They find this persists even if they explicitly fine tune the models on task specific data, train the model to use a scratchpad to track its work in progress, or train the model far beyond overfitting (“grokking”).

In our next session Blaine will start the New Year off reviewing the main takeaways from this study, and relating them back to AI safety. How should this affect our view on AGI timelines? Is this a limitation of the transformer architecture that can be easily overcome? What does this mean for people who want to use large language models for mundane utility? All this and more in the first benkyoukai of 2024.

View Event →
The Gemini Report
Dec
20

The Gemini Report

Or, “Blaine read the Gemini Report, so you don’t have to.”

On December 6th, Google announced it’s new and “most powerful LLM model ever”: Gemini. Excitement and hype followed, then waves of scrutiny. Even given Google’s cherry-picking demos, it seems that Gemini Ultra is better than GPT-4 on almost all benchmarks. That and its native multimodal capabilities make Gemini a genuine improvement on the state of the art, and worthy of our attention.

This week, Blaine will give a close reading of the Gemini report, with a healthy degree of skepticism. What technical details can we glean from Google Deepmind’s vague descriptions? Will a 67% to 74% improvement on benchmarks be a noticeable to users? Is the Gemini release healthy competition, or irresponsible accelerationism? Should the working professional prepare to cancel their GPT-4 subscription and jump ship? Let’s find out.

View Event →
Progress Measures for Grokking via Mechanistic Interpretability
Dec
13

Progress Measures for Grokking via Mechanistic Interpretability

Grokking: what is it, and why should we care?

In 2022, OpenAI researchers discovered something: models trained on small algorithmic tasks (like modular addition) will initially memorise the training data. Then “after a long time”, will suddenly learn to generalize to unseen data. From the linked Alignment Forum post: “One of the core claims of mechanistic interpretability is that neural networks can be understood, that rather than being mysterious black boxes they learn interpretable algorithms which can be reverse engineered and comprehended. This work serves as a proof of concept of that, and that reverse engineering models is key to understanding them.”

This week, we’ll recap some background on grokking, double descent and related phenomena. Then we’ll analyze the claims of Nanda et al., and inspect the links between grokking and phase changes. If we have time, we’ll relate the phase changes to the developmental interpretability agenda we’ve covered in previous session, examining some recent work calculating the RLCT for phases in modular addition.

View Event →
Anthropic’s Responsible Scaling Policy
Dec
6

Anthropic’s Responsible Scaling Policy

In September, Anthropic published their Responsible Scaling Policy outlining the protocols they will be adopting to help manage “increasingly capable AI Systems.”

There are rumours and whispers about OpenAI supposedly having a secret AGI they’ve been hiding from the public; a salient take on responsible policies to prevent such a thing from happening seems prudent.

We’ll take a critical look at the AI Safety Levels (modeled after the US government’s biosafety level standards), discuss whether these safety standards are enough, and explore the possible industry impacts as other companies are expected to fall in line.

View Event →
The OpenAI Debacle: what does this mean for the future of AGI development?
Nov
29

The OpenAI Debacle: what does this mean for the future of AGI development?

As many of you know, over the past week there’s been nearly hourly updates on OpenAI’s recent “15-minute firing” of Sam Altman and the subsequent firestorm that lit up OpenAI and the board members responsible for blindsiding both the public and OpenAI’s investors.

We will examine how OpenAI’s Board impacted the future of OpenAI and AGI development. Most notably, Helen Toner who published a very recent academic paper, criticizing OpenAI’s “frantic corner-cutting” in the release of ChatGPT (as opposed to praising Anthropic’s more “safe” approach of a delayed release of it’s GPT-4 chatbot, Claude).

Sensationalism and conspiracy theories aside, in this session we’ll examine OpenAI’s original board charter, review the facts of what happened, the makeup of OpenAI’s new Board, and speculate on how the week’s events will impact the future of AGI development.

View Event →
Formalizing the Presumption of Independence
Nov
22

Formalizing the Presumption of Independence

Paul Christiano is a prominent figure in AI safety that we have so far overlooked. His institute, the Alignment Research Center (ARC), is well known for its work on model evaluations and eliciting latent knowledge. In his most recent paper with Eric Neyman and Mark Xu, Christiano proposes a new research direction: formalizing an obscure kind of defeasible reasoning.

This week, Blaine Rogers will trace the connection between defeasible reasoning and AI safety. In doing so, he’ll relate the aforementioned to Anthropic's work on transformer circuits, Redwood Research's work on causal scrubbing, MIRI's work on logical uncertainty, and (if time allows) ELK—ARC's other research direction.

View Event →
What is Bayesian Machine Learning?
Nov
15

What is Bayesian Machine Learning?

Last month, while discussing Toy Models of Superposition, we briefly touched upon phase transitions in the Bayesian posterior (specifically the MCMC posterior sampling).

This week Blaine will cover the basics of Bayesian machine learning, including: Bayesian probability from first principles, Bayesianism / frequentism approaches, what Bayesianism means for neural networks, and the current meta of tools & techniques for Bayesian ML.

View Event →
Briefing: The Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence
Nov
8

Briefing: The Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence

On October 30th, 2023, US President Biden issued a new Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Similarly, the UK hosted an “AI Safety” summit (alternatively called the “doom summit” by British media) with two main aims:

  • to consider the risks of AI, especially at the frontier of development,

  • to discuss how they can be mitigated through internationally coordinated action.

In this session, Harold Godsoe will give a summarized briefing of the Executive Order, explore the legal implications, and expound on the future of international AI safety and alignment regulations (through a legal lens).

View Event →
Process Supervision
Nov
1

Process Supervision

This week, we’ll explore a recent work from OpenAI: the lab-to-watch with the biggest and best models. It's pretty on-the-mark for their proposed Superalignment approach.

At our session on Representation Engineering, we had an argument about symbols and referents—when an AI says "I am a corrigible AI and I will let you turn me off," what does that mean? Is it being truthful? This paper, mostly targeted at capabilities, aims to get models to produce 'aligned' chain-of-thought reasoning. Is writing “aligned words” equivalent to being aligned? Are we who we pretend to be?

View Event →