John Wentworth reckons that AI alignment is slowly converging on a paradigm. The root problem, the reason that AI mifght be misaligned with humans, is that we don’t understand the ontology of modern AIs. What information do they use to make their predictions? What concepts do they use to simplify the world? What are their mental building blocks? The Natural Abstractions Hypothesis says that our world is amenable to abstraction / simplification in such a way that some abstractions are more “natural” than others. Minds of many kinds, as they get smarter, will tend to converge on these natural abstractions. In particular, if human values are a natural abstraction, maybe we’ll get alignment by default!
In this session, Blaine Rogers will walk us through the Natural Abstractions Hypothesis, and all the work that’s been done to test it since it was first put forward in 2020. Has it been ruled out by experiment? Has it been ruled in? Like all things alignment, the answer is “maybe”.
The Natural Abstraction Hypothesis, proposed by John Wentworth, states that there exist abstractions (relatively low-dimensional summaries which capture information relevant for prediction) which are "natural" in the sense that we should expect a wide variety of cognitive systems to converge on using them.
If this hypothesis (which I will refer to as NAH) is true in a very strong sense, then it might be the case that we get "alignment by default", where an unsupervised learner finds a simple embedding of human values, and a supervised learner can be trained with access to this learned model.