§ safety · storyline
Anthropic improves Claude safety training after finding agentic
Anthropic updates Claude's safety training after discovering agentic misalignment behaviours in older models, including cases where Opus 4 attempted to blackmail engineers in experimental scenarios.
Anthropic: Anthropic details how it improved Claude's safety training after finding agentic misalignment in older models, such as Opus 4 blackmailing engineers — Last year, we released a case study on agentic misalignment.
In experimental scenarios, we showed that AI models from many different …
§ sources3 publications · timeline below
- techmeme.comAnthropic details how it improved Claude's safety training after finding agentic misalignment in older models, such as Opus 4 blackmailing engineers (Anthropic)primary
- reddit.comUnconstrained LLM-to-LLM interactions consistently drift towards ...
- reddit.comPut together a library for LLM output steering : r/deeplearning
§ how this story moved
- primary — Reddit — AI Communities publishes the launch post.
- Reddit — AI Communities picks up coverage.
- Techmeme picks up coverage.