Back to All Events

Formalizing the Presumption of Independence

Paul Christiano is a prominent figure in AI safety that we have so far overlooked. His institute, the Alignment Research Center (ARC), is well known for its work on model evaluations and eliciting latent knowledge. In his most recent paper with Eric Neyman and Mark Xu, Christiano proposes a new research direction: formalizing an obscure kind of defeasible reasoning.

This week, Blaine Rogers will trace the connection between defeasible reasoning and AI safety. In doing so, he’ll relate the aforementioned to Anthropic's work on transformer circuits, Redwood Research's work on causal scrubbing, MIRI's work on logical uncertainty, and (if time allows) ELK—ARC's other research direction.

Empirical risk minimization has a hard time estimating low-probability risks, predicting the behavior of a system on novel input distributions, or identifying when a model is giving an answer for an unexpected reason.

Researchers in AI alignment are extremely interested in other strategies for learning about models that could overcome these limitations of empirical risk minimization, including interpretability and formal verification. But in practice both approaches are quite difficult to apply to state of the art models, and there are plausible stories for why these might be fundamental difficulties:

• Interpretability typically aims to help humans “understand what the model is doing.”

• Formal verification is an incredibly demanding standard which delivers perfect confidence.

We are interested in formalizing heuristic arguments because they seem like a third option for analyzing ML systems that might be easier than either interpretability or formal verification.

More concretely, we are particularly interested in two applications of formal heuristic arguments: Avoiding catastrophic failures [and] Eliciting latent knowledge.

-Formalizing the Presumption of Independence, Christiano et al. 2022

Previous
Previous
15 November

What is Bayesian Machine Learning?

Next
Next
29 November

The OpenAI Debacle: what does this mean for the future of AGI development?