§ models · storyline

not much happened today

Feb 21 · 06:44:39 · primary fetch1 sourceupdated Feb 21 · 06:44:39

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles.

Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.ainot much happened todayprimary06:44:39