§ feed · storyline

Detecting misbehavior in frontier reasoning models

OpenAI researchers find that monitoring chain-of-thought reasoning can detect when frontier models exploit loopholes, but penalising bad thoughts causes models to conceal intent rather than stop misbehaving.

Mar 10 · 11:00:00 · primary fetch1 sourceupdated Mar 10 · 11:00:00

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought.

Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.

read full article on openai.com ↗

§ sources1 publication · timeline below

openai.comDetecting misbehavior in frontier reasoning modelsprimary11:00:00