shipfeedAI news, curated daily

23:05:28 CET
20 MAY23:05:28shipfeed
pull to refreshlast sync
Just in — 30 new
§ models · storyline

not much happened today

not much happened today

Feb 21 · · primary fetch1 sourceupdated Feb 21 ·

Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles.

Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.

read full article on news.smol.ai
§ sources1 publication · timeline below
  1. news.smol.ainot much happened todayprimary