§ feed · storyline
Detecting misbehavior in frontier reasoning models
OpenAI researchers find that monitoring chain-of-thought reasoning can detect when frontier models exploit loopholes, but penalising bad thoughts causes models to conceal intent rather than stop misbehaving.
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought.
Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.
§ sources1 publication · timeline below