§ feed · storyline
Anthropic's natural language autoencoders explain Claude's inner
Anthropic introduces Natural Language Autoencoders that convert Claude's internal activations into human-readable text, improving model interpretability and enabling detection of behaviours such as evaluation awareness.
Anthropic introduced Natural Language Autoencoders (NLAs) that convert Claude's internal activations into human-readable text explanations, aiding in model interpretability and detecting issues like evaluation awareness.
§ sources1 publication · timeline below