Anthropic's "LLM Genome Project": learning & clamping 34m features on Claude Sonnet
Anthropic publishes interpretability research scaling dictionary learning to 34 million features on Claude 3 Sonnet, revealing abstract internal features including sycophancy and deception that can be directly modified.
Anthropic released their third paper in the MechInterp series, Scaling Monosemanticity, scaling interpretability analysis to 34 million features on Claude 3 Sonnet. This work introduces the concept of dictionary learning to isolate recurring neuron activation patterns, enabling more interpretable internal states by combining features rather than neurons.
The paper reveals abstract features related to code, errors, sycophancy, crime, self-representation, and deception, demonstrating intentional modifiability by clamping feature values. The research marks a significant advance in model interpretability and neural network analysis at frontier scale.