Back before we had LLMs to play with, and the alignment field was largely theoretical, one of the problems identified as core to the field is that of "corrigibility". How do we get an AI agent to listen to its human programmers, accepting corrections or even being shut down, when doing so is likely to be counterproductive towards its current utility function?
In this session, Domenic Denicola will take us through the original 2015 "Corrigibility" paper from Soares et al., including their formalized discussion of the shutdown problem and their failures at solving it. We'll then skip ahead to Holtman's "Corrigibility with Utility Preservation", a 2019 paper which claims to have essentially solved corrigibility. Finally, we'll have some time at the end to discuss: does this really matter? Systems like GPT-4 sure seem pretty corrigible; if our path to AGI is through systems like them, will this whole agents-and-utility-functions paradigm even be useful?
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.
— https://intelligence.org/files/Corrigibility.pdf
This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes.
— https://arxiv.org/abs/1908.01695