勉強会しませんか
AI Safety東京の勉強会は英語で行ってしまいます。AI Alignment Networkは日本語で勉強会しています。
We meet once a month to discuss a safety-related topic; a paper hot off the presses or a recent advance in policy. Our discussions are pitched at an interested technical layperson. We don’t assume any domain knowledge—our members have diverse academic backgrounds that span computer science, mathematics, physics, law, and philosophy. Our meetings feel like university seminars for the intro-to-safety course that your university doesn’t offer yet.
If you’re interested in joining the benkyoukai, please email someone@aisafety.tokyo.
Scroll down to see some of our previous topics and get an idea of what we cover in our discussions.
Evaluating the World Model Implicit in a Generative Model
Generative models like large language models seem capable of understanding the world, but how well do they really grasp the underlying structure of the tasks they perform? This month, André Röhm will delve into recent research that defines and evaluates a set of "world models" that are based on deterministic finite automata.
Against Almost Every Theory of Impact of Interpretability
The most popular area of AI safety research by far is mechanistic interpretability. Mechanistic interpretability work is tractable, appealing from both a scientific and mathematical perspective, is easy to teach, can be published in regular scientific venues, and requires very little to get started. It’s also… useless?
Technical Research and Talent is Needed for Effective AI Governance
For policymakers to make informed decisions about effective governance of AI, they need reliable and accurate information about AI capabilities, current limitations, and future trajectories. In a recent position paper, Reuel and Soder et al. claim that governments lack this access, often creating regulations that cannot be realized without significant research breakthroughs.
Let’s review the position paper, discuss whether its claims are merited, and ask how we might be able to get more technical people into government.
AI Alignment with Changing and Influenceable Reward Functions
We know that human preferences change over time. We also know that AIs can change the preferences of their users; famously the Facebook news feed algorithm learned that showing people outrageous posts makes them outraged, which makes them want to see even more outrageous posts.
In this session, we’ll look at an interesting poster from this year’s ICML that tackles the problem in detail, proposing 8 different ways we might want an AI to respect a user's changing preferences and finding all of them lacking, discovering inherent tradeoffs.
Scaling and Evaluating Sparse Autoencoders
For a while, Anthropic let you play with Golden Gate Claude, a version of Claude Sonnet with the “Golden Gate Bridge” feature (34M/31164353) pinned high. The result is a model that bends every conversation unerringly towards the bridge. Pretty compelling!
So why is this significant for safety? Well, if this technique works and we are able to set a model to focus on any one object or idea, then couldn’t we also just set the setting to “be moral” and pat ourselves on the back for a job well done?
Situational Awareness: The Decade Ahead
Just one month ago, Leopold Aschenbrenner released a series of papers touching on his (and others’) thoughts on how AI, and further AGI and ASI, will shape the future of humanity.
Aschenbrenner’s conclusions are, as we say in the industry, Big If True.
Should we all dump our life savings into Nvidia stocks, or is Mr. Aschenbrenner caught up in his own hype? Let’s find out.
AIIF Masterclass: Japan's AI Frontier: Pioneering Regulatory Innovation
Japan is setting a bold course to become the world's most AI-friendly nation.
Join us for an in-depth seminar to explore Japan's unique approach to regulating artificial intelligence. By spearheading advancements in regulatory frameworks, Japan aims to foster innovation and ensure safe, ethical AI deployment. Discover how Japan is setting new standards in AI policy, driving technological progress, and shaping the future of AI.
Is GPT conscious?
AI Consciousness is a Big Deal. Blake Lemoine famously left Google in 2022 because he thought that LaMDA was conscious. Anthropic’s Claude has a terrible habit of claiming to be conscious in long conversations.
Let’s read Butlin and Long, bone up on the science of consciousness, and then ask ourselves whether “consciousness” was the thing we should have been worried about in the first place.
AIIF Masterclass: Are Emergent Abilities of Large Language Models a Mirage?
In this session we’ll discuss “emergent” capabilities, ask how we can predict them before they appear, how we can better measure the performance of language models to be less surprised, and how we can prepare to take advantage of new capabilities rather than being caught flat footed.
AIIF Masterclass: Business Process Model and Notation (BPMN)
Business process analysis is amenable to mathematical and logical analysis, by means of digital representations such as Business Process Model and Notation (BPMN) - connecting business analysts to researchers and engineers.
Join guest speaker Colin Rowat for an introduction to BPMN - already used at scale by companies like Goldman Sachs and Rakuten Symphony - and a peek at the technology that might one day enable Altman's vision of "one-person unicorns".
AIIF Masterclass: Data Governance and Security in the AI-Driven Corporate Environment
This week, Guest Speaker Karsten Klein will provide insights into the currently available AI governance and security standards, and how to navigate regulatory requirements.
Learn about the latest trends impacting data governance and security in AI, and enhance your understanding of how to protect and manage data effectively in an AI context.
AIIF Masterclass: AI and Copyright
Japan's AI and copyright rules are a big deal right now.
We’re going to get into some issues around copyright infringement during AI development and use from a new government report titled “Perspectives Regarding AI and Copyright” released on March 15, 2024.
Guest Ram Rachum Presents: Emergent Dominance Hierarchies in Reinforcement Learning Agents
We explore how populations of reinforcement learning agents naturally develop, enforce, and transmit dominance hierarchies through simple interactions, absent any programmed incentives or explicit rules.
ML Benkyoukai: Kolmogorov-Arnold Networks
Forget MLPs, it’s all about KANs now… or so the hype would have you believe.
AIIF Masterclass: Synthetic Data
This week, we have a special guest speaker: Wim Kees Janssen - founder & CEO of Syntho, the startup that is revolutionizing how businesses around the globe access and utilize data. Businesses can now generate synthetic data rapidly and securely, facilitating faster adoption of data-driven innovations.
AIIF Masterclass: AI Capabilities and Evaluations
All large language models lack some capabilities (such as calculation) that are easy for narrow AI systems (such as calculators). But humans also lag behind calculators in calculation ability, and we solve that by… letting humans use calculators. Why not do the same for large language models?
AIIF Masterclass: AI Use
A recent study conducted with Boston Consulting Group examined the impact of AI on complex consulting tasks. Results showed that for tasks within AI capabilities, consultants using AI were more productive and produced higher-quality results. In the rapidly evolving landscape of business consulting, harnessing the power of artificial intelligence (AI) is no longer just a competitive advantage—it's a necessity.
ML Benkyoukai: More Agents Is All You Need
This month, we’ll use Li and Zhang et al. to frame a discussion around so-called “Babble & Prune” approaches. When can they be deployed? How many agents can you add to the ensemble before you hit diminishing returns? Why does this work at all?
On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial
Based on the findings of a new pre-registered EPFL study, “not only are LLMs able to effectively exploit personal information to tailor their arguments and out-persuade humans in online conversations through microtargeting, they do so far more effectively than humans.”
AIIF Masterclass: AI Business Landscape
This week, Blaine will give an overview of the general AI landscape in Tokyo - what companies of note are already here, what they do, and what Japan’s gradual AI industry expansion means for your business in the Tokyo landscape.
The Gladstone AI Reports: a drafted AI Bill, and AI Trajectories
Blaine Rogers and Harold Godsoe will examine Gladstone AI’s report: the proposed draft U.S. AI Bill at the core of the Gladstone AI report in context, and the “Survey of AI Technologies and AI R&D Trajectories”.
AIIF Masterclass: AI Inside
If someone were to fine-tune an AI to take malicious actions when presented with trigger words, would our most popular alignment techniques (reinforcement learning, supervised fine-tuning and adversarial training) be enough to unlearn the behaviour?
Guest Alex Spies Presents: “Structured World Representations in Maze-Solving Transformers”
Alex Spies, research fellow at NII, will provide an overview of mechanistic interpretability tools and the approaches researchers employ to "reverse engineer" transformer models. He will then explain how his team used some of these techniques to uncover emergent structures in the models they trained and how these structures may facilitate a systematic understanding of internal search processes.
AIIF Masterclass: AI Capabilities and Evaluations
Last week, we dove into Gladstone AI’s report, and whether the recommendations align with other clear actions on AI from the U.S. government, such as the NIST framework, U.S. AISIC, and U.S. EO on AI.
This week, we’ll continue examining the Gladstone AI report by vetting the "Survey of AI R&D Trajectories" - the companion explainer for the Gladstone Action Plan.
Guest Nathan Henry Presents: “A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse?”
This week, we’ve invited Nathan Henry to discuss his latest paper on the HALO algorithm (Hormetic ALignment via Opponent processes). HALO is designed to set healthy limits for repeatable AI behaviors and represents a cross-disciplinary approach to the value-loading problem.
AIIF Masterclass: AI Rules
In 2023, the U.S. Department of State commissioned this report from Gladstone AI, an AI-related consultancy and current U.S. Artificial Intelligence Safety Institute Consortium (AISIC) member with ties to the U.S. military and apparent deep relationships in Silicon Valley. In February 2024, the report was completed and provided to the USDOS.
ML Benkyoukai: SORA
Whenever there’s an advance in charismatic domains like video generation, a media frenzy always accompanies. In the case of SORA, for once, the media frenzy may be justified; OpenAI’s new model marks a qualitative shift in what is possible when it comes to generating video and puts another strike in favour of brute scaling. Leaving aside the possible impacts on the stock video industry, is there anything interesting we can glean from their technical report? Any secret sauce other than more parameters → more performance?
Reinforcement Learning from AI Feedback
A recent focus of ours has been reinforcement learning from human feedback, a technique for aligning AI (particularly large language models) to human preferences. A fundamental limitation of these approaches is the cost / quality tradeoff in the collection of the feedback from humans; most language models are trained on binary preference data (“which of these two continuations is better”) because humans can provide that feedback quickly and cheaply, not because it’s the optimal kind of data to train a language model on.
AIIF Weekly Masterclass: AI Use
Good mental models of generative AI systems are hard to find. Some people describe GPT as a “code interpreter for natural language”; others describe it as a surprisingly worldly five-year-old. A bad mental model can lead you to make bad inferences about what generative AI can do.
Fundamental Limitations of Reinforcement Learning from Human Feedback
A few weeks ago at the ML benkyoukai we talked about direct preference optimization, a successor to reinforcement learning from human feedback (RLHF) that addresses some of its technical limitations. But there are also more fundamental limitations with RLHF: are your evaluators aligned, either with each other or with humanity at large? Are their preferences worth learning? Are humans even capable of evaluating the performance of your model without being mistaken or misled?