§ feed · storyline
How confessions can keep language models honest
OpenAI researchers test a "confessions" training method that prompts models to self-report mistakes, aiming to improve honesty and transparency in AI outputs.
OpenAI researchers are testing “confessions,” a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.
§ sources1 publication · timeline below