AI Alignment with Changing and Influenceable Reward Functions

Wednesday 11 September 2024
19:00 21:00

Google Calendar ICS

We know that human preferences change over time - not just in the sense that there are things that are illegal now that would have been morally permissible 100 years ago and vice versa, but also in the sense that your preferences change daily depending on your mood and circumstances, and in the sense that become addicted to cigarettes genuinely changes your preferences over whether or not you should smoke cigarettes.

We also know that AIs can change the preferences of their users; famously the Facebook news feed algorithm learned that showing people outrageous posts makes them outraged, which makes them want to see even more outrageous posts. We say that the algorithm, trained to show users posts they want to see, had a control incentive over the users' preferences because showing the users different posts gave them new preferences that were easier to satisfy.

In this session, we’ll look at an interesting poster paper that Blaine found in his recent travels to the ICML conference. This paper tackles the problem in detail, asking how we might align AIs to changing human preferences, especially when the AI has the ability to change those preferences. The authors come up with 8 different ways we might want an AI to respect a user's changing preferences, including all the obvious answers:

satisfy whatever preference the user has right now
satisfy (your prediction of) the user's final preference, since people usually become wiser over time
satisfy whatever preferences the user would have had if you'd done nothing
ignore the users' preferences and optimize a privileged value function like Coherent Extrapolated Volition

They find that all 8 options are either too risk-averse, or lead to the AI manipulating the user's preferences to make them easier to satisfy.

This topic follows in the theme of our previous sessions on Discovering Agents (where we detected instrumental control incentives on human preferences), Asymmetric Shapley Values (where we quantified how much an AI’s decisions depend on expected future preference change) and Behavioural Hormesis (where we made AIs change their preferences over time in a way inspired by human psychology). This time we’re asking: what does it mean to respect a person’s preferences when those preferences are up for debate?

Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them. We show that despite its convenience, the static-preference assumption may undermine the soundness of existing alignment techniques, leading them to implicitly reward AI systems for influencing user preferences in ways users may not truly want. We then explore potential solutions. First, we offer a unifying perspective on how an agent's optimization horizon may partially help reduce undesirable AI influence. Then, we formalize different notions of AI alignment that account for preference change from the outset. Comparing the strengths and limitations of 8 such notions of alignment, we find that they all either err towards causing undesirable AI influence, or are overly risk-averse, suggesting that a straightforward solution to the problems of changing preferences may not exist. As there is no avoiding grappling with changing preferences in real-world settings, this makes it all the more important to handle these issues with care, balancing risks and capabilities. We hope our work can provide conceptual clarity and constitute a first step towards AI alignment practices which explicitly account for (and contend with) the changing and influenceable nature of human preferences.
— AI Alignment with Changing and Influenceable Reward Functions, Carroll and Foote et al. 2024

AI Alignment with Changing and Influenceable Reward Functions

Scaling and Evaluating Sparse Autoencoders

Technical Research and Talent is Needed for Effective AI Governance

AI Safety 東京