25 stories · 7d·6 sources covering·30 active storylines
Updated Sun, 28 Jun 2026 CEST·25 new storylines this week·live
What this is
Evals are the benchmarks and tests that measure how AI models perform on reasoning, coding, and safety tasks. shipfeed tracks new benchmark releases and notable evaluation results.
Artificial Analysis: GLM-5.2 is the leading open weights model on Artificial Analysis' Intelligence Index, scoring 51, only behind Fable 5's 60, Opus 4.8's 56, and GPT-5.5's 55 — Z ai's GLM-5.2 is the new leading…
Ram Iyer / TechCrunch: Microsoft releases ASSERT, an open-source framework that lets developers generate and run AI behavior tests using natural-language descriptions — AI researchers and labs have advanced by…
Madison Mills / Axios: Anthropic says it expects Mythos-class models to be available to all customers “in the coming weeks” following the development of stronger safeguards — Anthropic released Claude…
Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real vulnerabilities in Google's V8 engine. Mythos leads GPT-5.5 by a wide margin but costs twelve…
AI Security Institute: Mythos Preview is the first AI model to complete both of AISI's cyber ranges, which measure models' cyberattack capabilities; GPT-5.5 solved only one of them — In February 2026, we internally…
Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.
Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in…
OpenAI has begun a limited preview of its next-generation GPT-5.6 model series, which features three tiered models (Sol, Terra, and Luna) and new reasoning modes.
Zhipu AI's GLM-5.2 nearly matches Claude Opus 4.7 in a Snowflake benchmark with 103 coding tasks at one-fifth the cost per output token. But the Chinese model burns through nearly twice as many tokens per task. Still…
New AgentPerf results from Artificial Analysis show how accelerated computing systems handle real-world agentic workloads, with NVIDIA GB300 NVL72 running up to 20x more agents per megawatt than NVIDIA Hopper.
Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.
Michael Nuñez / VentureBeat: Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70% — For months, the…
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
A consortium of 64 mathematicians built SOOHAK, a new AI benchmark with 439 handwritten tasks, including 99 that are deliberately unsolvable. Google's Gemini 3 Pro leads on research-level problems at 30 percent. But no…
A new benchmark called WorldReasonBench tests video generators not on image quality, but on physical and logical plausibility. ByteDance's Seedance 2.0 leads the field ahead of Veo 3.1 and Sora 2, with commercial…