shipfeedAI news, curated daily

23:56:58 CET
20 MAY23:56:58shipfeed
pull to refreshlast sync
Just in — 30 new
§ feed · storyline

Claude Opus 4.1 reaches 95% of human expert performance on white

OpenAI's Evals team releases GDPval, a 1,320-task benchmark across 44 digital occupations, with early results showing Claude Opus 4.1 outperforming human experts in most categories.

Sep 25 · · primary fetch1 sourceupdated Sep 25 ·

OpenAI's Evals team released GDPval, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show Claude 4.1 Opus outperforming human experts in most categories and GPT-5 high trailing behind, with projections that GPTnext could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting.

Additionally, Artificial Analysis reported improvements in Gemini 2.5 Flash/Flash-Lite and DeepSeek V3.1 Terminus models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like Google Chirp 2 and NVIDIA Canary Qwen2.5B. Agentic AI advances include Kimi OK Computer, an OS-like agent with extended tool capabilities and new vendor verification tools.

read full article on news.smol.ai
§ sources1 publication · timeline below
  1. news.smol.aiGDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)primary