How do humans learn values? This is the question that Shard Theory sets out to answer. Humans are more-or-less aligned; if we can work out how humans learn values and replicate that process for neural nets, the alignment problem will be solved.
Shard Theory frames human value learning in terms of reinforcement learning, suggesting that human values are comprised of competing "computational shards”, each responsible for a certain contextual behaviour. Is Shard Theory saying anything of substance? Or is it the Tai's Method of moral philosophy? Let’s read the literature and find out.
Shard theory is a research program aimed at explaining the systematic relationships between the reinforcement schedules and learned values of reinforcement-learning agents. It consists of a basic ontology of reinforcement learners, their internal computations, and their relationship to their environment. It makes several predictions about a range of RL systems, both RL models and humans. Indeed, shard theory can be thought of as simply applying the modern ML lens to the question of value learning under reinforcement in artificial and natural neural networks!
Some of shard theory's confident predictions can be tested immediately in modern RL agents. Less confident predictions about i.i.d.-trained language models can also be tested now. Shard theory also has numerous retrodictions about human psychological phenomena that are otherwise mysterious from only the viewpoint of EU maximization, with no further substantive mechanistic account of human learned values. Finally, shard theory fails some retrodictions in humans; on further inspection, these lingering confusions might well falsify the theory.
https://www.alignmentforum.org/posts/xqkGmfikqapbJ2YMj/shard-theory-an-overview