勉強会しませんか

AI Safety東京の勉強会は英語で行ってしまいます。AI Alignment Networkは日本語で勉強会しています。

We meet once a month to discuss a safety-related topic; a paper hot off the presses or a recent advance in policy. Our discussions are pitched at an interested technical layperson. We don’t assume any domain knowledge—our members have diverse academic backgrounds that span computer science, mathematics, physics, law, and philosophy. Our meetings feel like university seminars for the intro-to-safety course that your university doesn’t offer yet.

If you’re interested in joining the benkyoukai, please email someone@aisafety.tokyo.

Scroll down to see some of our previous topics and get an idea of what we cover in our discussions.


Evaluating the World Model Implicit in a Generative Model
Dec
18

Evaluating the World Model Implicit in a Generative Model

Generative models like large language models seem capable of understanding the world, but how well do they really grasp the underlying structure of the tasks they perform? This month, André Röhm will delve into recent research that defines and evaluates a set of "world models" that are based on deterministic finite automata.

View Event →
Against Almost Every Theory of Impact of Interpretability
Nov
20

Against Almost Every Theory of Impact of Interpretability

The most popular area of AI safety research by far is mechanistic interpretability. Mechanistic interpretability work is tractable, appealing from both a scientific and mathematical perspective, is easy to teach, can be published in regular scientific venues, and requires very little to get started. It’s also… useless?

View Event →
Technical Research and Talent is Needed for Effective AI Governance
Oct
16

Technical Research and Talent is Needed for Effective AI Governance

For policymakers to make informed decisions about effective governance of AI, they need reliable and accurate information about AI capabilities, current limitations, and future trajectories. In a recent position paper, Reuel and Soder et al. claim that governments lack this access, often creating regulations that cannot be realized without significant research breakthroughs.

Let’s review the position paper, discuss whether its claims are merited, and ask how we might be able to get more technical people into government.

View Event →
AI Alignment with Changing and Influenceable Reward Functions
Sept
11

AI Alignment with Changing and Influenceable Reward Functions

We know that human preferences change over time. We also know that AIs can change the preferences of their users; famously the Facebook news feed algorithm learned that showing people outrageous posts makes them outraged, which makes them want to see even more outrageous posts.

In this session, we’ll look at an interesting poster from this year’s ICML that tackles the problem in detail, proposing 8 different ways we might want an AI to respect a user's changing preferences and finding all of them lacking, discovering inherent tradeoffs.

View Event →
Scaling and Evaluating Sparse Autoencoders
Aug
7

Scaling and Evaluating Sparse Autoencoders

For a while, Anthropic let you play with Golden Gate Claude, a version of Claude Sonnet with the “Golden Gate Bridge” feature (34M/31164353) pinned high. The result is a model that bends every conversation unerringly towards the bridge. Pretty compelling!

So why is this significant for safety? Well, if this technique works and we are able to set a model to focus on any one object or idea, then couldn’t we also just set the setting to “be moral” and pat ourselves on the back for a job well done?

View Event →
Situational Awareness: The Decade Ahead
Jul
17

Situational Awareness: The Decade Ahead

Just one month ago, Leopold Aschenbrenner released a series of papers touching on his (and others’) thoughts on how AI, and further AGI and ASI, will shape the future of humanity.

Aschenbrenner’s conclusions are, as we say in the industry, Big If True.

Should we all dump our life savings into Nvidia stocks, or is Mr. Aschenbrenner caught up in his own hype? Let’s find out.

View Event →
AIIF Masterclass: Japan's AI Frontier: Pioneering Regulatory Innovation
Jul
9

AIIF Masterclass: Japan's AI Frontier: Pioneering Regulatory Innovation

Japan is setting a bold course to become the world's most AI-friendly nation.

Join us for an in-depth seminar to explore Japan's unique approach to regulating artificial intelligence. By spearheading advancements in regulatory frameworks, Japan aims to foster innovation and ensure safe, ethical AI deployment. Discover how Japan is setting new standards in AI policy, driving technological progress, and shaping the future of AI.

View Event →
Is GPT conscious?
Jun
26

Is GPT conscious?

AI Consciousness is a Big Deal. Blake Lemoine famously left Google in 2022 because he thought that LaMDA was conscious. Anthropic’s Claude has a terrible habit of claiming to be conscious in long conversations.

Let’s read Butlin and Long, bone up on the science of consciousness, and then ask ourselves whether “consciousness” was the thing we should have been worried about in the first place.

View Event →
AIIF Masterclass: Business Process Model and Notation (BPMN)
Jun
11

AIIF Masterclass: Business Process Model and Notation (BPMN)

Business process analysis is amenable to mathematical and logical analysis, by means of digital representations such as Business Process Model and Notation (BPMN) - connecting business analysts to researchers and engineers. 

Join guest speaker Colin Rowat for an introduction to BPMN - already used at scale by companies like Goldman Sachs and Rakuten Symphony - and a peek at the technology that might one day enable Altman's vision of "one-person unicorns".

View Event →
AIIF Masterclass: Data Governance and Security in the AI-Driven Corporate Environment
Jun
4

AIIF Masterclass: Data Governance and Security in the AI-Driven Corporate Environment

This week, Guest Speaker Karsten Klein will provide insights into the currently available AI governance and security standards, and how to navigate regulatory requirements.

Learn about the latest trends impacting data governance and security in AI, and enhance your understanding of how to protect and manage data effectively in an AI context.

View Event →
AIIF Masterclass: AI and Copyright
May
28

AIIF Masterclass: AI and Copyright

Japan's AI and copyright rules are a big deal right now.

We’re going to get into some issues around copyright infringement during AI development and use from a new government report titled “Perspectives Regarding AI and Copyright” released on March 15, 2024.

View Event →
AIIF Masterclass: Synthetic Data
May
21

AIIF Masterclass: Synthetic Data

This week, we have a special guest speaker: Wim Kees Janssen - founder & CEO of Syntho, the startup that is revolutionizing how businesses around the globe access and utilize data. Businesses can now generate synthetic data rapidly and securely, facilitating faster adoption of data-driven innovations.

View Event →
AIIF Masterclass: AI Capabilities and Evaluations
May
14

AIIF Masterclass: AI Capabilities and Evaluations

All large language models lack some capabilities (such as calculation) that are easy for narrow AI systems (such as calculators). But humans also lag behind calculators in calculation ability, and we solve that by… letting humans use calculators. Why not do the same for large language models?

View Event →
AIIF Masterclass: AI Use
May
7

AIIF Masterclass: AI Use

A recent study conducted with Boston Consulting Group examined the impact of AI on complex consulting tasks. Results showed that for tasks within AI capabilities, consultants using AI were more productive and produced higher-quality results. In the rapidly evolving landscape of business consulting, harnessing the power of artificial intelligence (AI) is no longer just a competitive advantage—it's a necessity.

View Event →
AIIF Masterclass: AI Inside
Apr
16

AIIF Masterclass: AI Inside

If someone were to fine-tune an AI to take malicious actions when presented with trigger words, would our most popular alignment techniques (reinforcement learning, supervised fine-tuning and adversarial training) be enough to unlearn the behaviour?

View Event →
Guest Alex Spies Presents: “Structured World Representations in Maze-Solving Transformers”
Apr
10

Guest Alex Spies Presents: “Structured World Representations in Maze-Solving Transformers”

Alex Spies, research fellow at NII, will provide an overview of mechanistic interpretability tools and the approaches researchers employ to "reverse engineer" transformer models. He will then explain how his team used some of these techniques to uncover emergent structures in the models they trained and how these structures may facilitate a systematic understanding of internal search processes.

View Event →
AIIF Masterclass: AI Capabilities and Evaluations
Apr
9

AIIF Masterclass: AI Capabilities and Evaluations

Last week, we dove into Gladstone AI’s report, and whether the recommendations align with other clear actions on AI from the U.S. government, such as the NIST framework, U.S. AISIC, and U.S. EO on AI.

This week, we’ll continue examining the Gladstone AI report by vetting the "Survey of AI R&D Trajectories" - the companion explainer for the Gladstone Action Plan.

View Event →
AIIF Masterclass: AI Rules
Apr
2

AIIF Masterclass: AI Rules

In 2023, the U.S. Department of State commissioned this report from Gladstone AI, an AI-related consultancy and current U.S. Artificial Intelligence Safety Institute Consortium (AISIC) member with ties to the U.S. military and apparent deep relationships in Silicon Valley. In February 2024, the report was completed and provided to the USDOS.

View Event →
ML Benkyoukai: SORA
Mar
28

ML Benkyoukai: SORA

Whenever there’s an advance in charismatic domains like video generation, a media frenzy always accompanies. In the case of SORA, for once, the media frenzy may be justified; OpenAI’s new model marks a qualitative shift in what is possible when it comes to generating video and puts another strike in favour of brute scaling. Leaving aside the possible impacts on the stock video industry, is there anything interesting we can glean from their technical report? Any secret sauce other than more parameters → more performance?

View Event →
Reinforcement Learning from AI Feedback
Mar
27

Reinforcement Learning from AI Feedback

A recent focus of ours has been reinforcement learning from human feedback, a technique for aligning AI (particularly large language models) to human preferences. A fundamental limitation of these approaches is the cost / quality tradeoff in the collection of the feedback from humans; most language models are trained on binary preference data (“which of these two continuations is better”) because humans can provide that feedback quickly and cheaply, not because it’s the optimal kind of data to train a language model on.

View Event →
AIIF Weekly Masterclass: AI Use
Mar
19

AIIF Weekly Masterclass: AI Use

Good mental models of generative AI systems are hard to find. Some people describe GPT as a “code interpreter for natural language”; others describe it as a surprisingly worldly five-year-old. A bad mental model can lead you to make bad inferences about what generative AI can do.

View Event →
Fundamental Limitations of Reinforcement Learning from Human Feedback
Mar
13

Fundamental Limitations of Reinforcement Learning from Human Feedback

A few weeks ago at the ML benkyoukai we talked about direct preference optimization, a successor to reinforcement learning from human feedback (RLHF) that addresses some of its technical limitations. But there are also more fundamental limitations with RLHF: are your evaluators aligned, either with each other or with humanity at large? Are their preferences worth learning? Are humans even capable of evaluating the performance of your model without being mistaken or misled?

View Event →