§ feed · storyline

Anthropic's "LLM Genome Project": learning & clamping 34m features on Claude Sonnet

Anthropic publishes interpretability research scaling dictionary learning to 34 million features on Claude 3 Sonnet, revealing abstract internal features including sycophancy and deception that can be directly modified.

May 22 · 00:47:46 · primary fetch1 sourceupdated May 22 · 00:47:46

Anthropic released their third paper in the MechInterp series, Scaling Monosemanticity, scaling interpretability analysis to 34 million features on Claude 3 Sonnet. This work introduces the concept of dictionary learning to isolate recurring neuron activation patterns, enabling more interpretable internal states by combining features rather than neurons.

The paper reveals abstract features related to code, errors, sycophancy, crime, self-representation, and deception, demonstrating intentional modifiability by clamping feature values. The research marks a significant advance in model interpretability and neural network analysis at frontier scale.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiAnthropic's "LLM Genome Project": learning & clamping 34m features on Claude Sonnetprimary00:47:46