08:17 CETWednesday · May 13, 2026

shipfeed

K SEARCHJK NAVO OPEN
on the wire
home/§ models/cluster
ad slot opena single understated line lives here — sponsor wordmark + a short line.advertise on shipfeed →
§ models · cluster

not much happened today

Feb 21 · · primary fetch1 sourcecluster 9e0da881updated Feb 21 ·

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles.

Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.

read full article on news.smol.ai
§ sources1 publication · timeline below
  1. news.smol.ainot much happened todayprimary