§ feed · storyline
Why we no longer evaluate SWE-bench Verified
Cognition publishes analysis finding SWE-bench Verified contaminated by flawed tests and training leakage, and recommends SWE-bench Pro as a replacement benchmark.
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress.
Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
§ sources1 publication · timeline below