shipfeedAI news, curated daily

23:54:50 CET
20 MAY23:54:50shipfeed
pull to refreshlast sync
Just in — 30 new
§ feed · storyline

Why we no longer evaluate SWE-bench Verified

Cognition publishes analysis finding SWE-bench Verified contaminated by flawed tests and training leakage, and recommends SWE-bench Pro as a replacement benchmark.

Feb 23 · · primary fetch1 sourceupdated Feb 23 ·

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.

Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

read full article on openai.com
§ sources1 publication · timeline below
  1. openai.comWhy we no longer evaluate SWE-bench Verifiedprimary