07:26 CETWednesday · May 13, 2026

shipfeed

K SEARCHJK NAVO OPEN
on the wire
home/topics/evals
§ topic · evals

evals

29 this week·29 this month·30 all-time

Benchmark releases and evaluation results

ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →

clusters this week14 active

Saturday, May 9, 2026’s edition
Saturday, February 21, 2026’s edition
N° 001·evals·

not much happened today

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny…

via news.smol.ai
Monday, May 11, 2026’s edition
Friday, May 8, 2026’s edition
Thursday, May 7, 2026’s edition