Back to All Events

Quantifying Degeneracy

In September we reviewed the Developmental Interpretability research agenda, and discussed some of its initial results. But we never discussed the mathematical details! The devinterp team are running quarterly research sprints; to participate, we'll need to know what the learning coefficient / RLCT is, how to calculate it for toy models, and how it's estimated in practice.

This week we’ll dive into the estimation techniques in Quantifying degeneracy in singular models via the learning coefficient (Lau et al. 2023). These techniques constitute the first generation of methods for understanding phase transitions (with potential applications for both interpretability and mechanistic anomaly detection). We’ll review the paper, then maybe do some mob programming in a colab to get to grips with the available tools.

High-level observables like the local learning coefficient give us insight into Model Comparison without requiring a detailed understanding of the model's internal mechanisms. 

In particular, we're aiming to understand phase transitions during learning — moments in which models go through sudden, unanticipated shifts. We want to be able to detect these transitions because we think most of the risk is contained in sudden, undesired changes to values or capabilities that occur during training (~sharp left turns) and sudden, undesired changes to behaviours that occur in-context (treacherous turns / mechanistic anomalies).

Local learning coefficient estimation is one of the first steps towards building tools for detecting and understanding these transitions. 

You’re Measuring Model Complexity Wrong (Hoogland & Wingerden 2023)

Previous
Previous
18 October

Representation Engineering

Next
Next
1 November

Process Supervision