What's the latest in evals?

The most recent evals storyline on shipfeed is "Worldcoin (WLD) in Focus as Altman's OpenAI Debuts 3-Tier GPT-5.6". shipfeed grouped 25 new evals storylines from across the AI press in the past 7 days.

Which sources cover evals?

The sources most active in shipfeed's evals feed are arXiv — cs.AI, arXiv — cs.CL, The Decoder, Hugging Face — Blog, and Reddit — AI Communities.

How many evals stories does shipfeed track?

shipfeed is tracking 192 evals storylines in total — 25 updated in the past 7 days and 101 in the past 30 — each a deduplicated group of articles from its original sources.

How often is this page updated?

Continuously. shipfeed re-checks its evals sources around the clock and regroups new coverage into deduplicated storylines; the last-updated time is shown at the top of this page.

Shipfeed. AI News Channel

storylines this week30 active

20:55:01Artificialanalysis

EVALS · 1 source

GLM-5.2 tops open weights models on intelligence index

Artificial Analysis: GLM-5.2 is the leading open weights model on Artificial Analysis' Intelligence Index, scoring 51, only behind Fable 5's 60, Opus 4.8's 56, and GPT-5.5's 55 — Z ai's GLM-5.2 is the new leading…

via artificialanalysis.ai

Monday, June 8, 2026’s editionMonday, June 8, 2026

19:27:13arXiv — cs.CL

AGENTS · 1 source

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

via arxiv.org

Tuesday, June 2, 2026’s editionTuesday, June 2, 2026

22:00:52TechCrunch

EVALS · 1 source

Microsoft releases ASSERT framework for AI behavior testing

Ram Iyer / TechCrunch: Microsoft releases ASSERT, an open-source framework that lets developers generate and run AI behavior tests using natural-language descriptions — AI researchers and labs have advanced by…

via techcrunch.com

Thursday, May 28, 2026’s editionThursday, May 28, 2026

20:10:02Axios

EVALS · 1 source

Anthropic plans to release Mythos-class models in coming weeks

Madison Mills / Axios: Anthropic says it expects Mythos-class models to be available to all customers “in the coming weeks” following the development of stronger safeguards — Anthropic released Claude…

via axios.com

Saturday, May 16, 2026’s editionSaturday, May 16, 2026

15:08:05The Decoder

CLAUDE · 1 source

Claude and GPT-5.5 develop real browser exploits in new benchmark

Researchers at Carnegie Mellon University built a new benchmark that measures how far AI agents can go when exploiting real vulnerabilities in Google's V8 engine. Mythos leads GPT-5.5 by a wide margin but costs twelve…

via the-decoder.com

Wednesday, May 13, 2026’s editionWednesday, May 13, 2026

23:00:56AI Security Institute

SAFETY · 1 source

Mythos Preview first to complete both AISI cyber ranges

AI Security Institute: Mythos Preview is the first AI model to complete both of AISI's cyber ranges, which measure models' cyberattack capabilities; GPT-5.5 solved only one of them — In February 2026, we internally…

via techmeme.com

Yesterday’s editionSunday, June 28, 2026

10:54:54Bitget

GPT · 1 source

Worldcoin (WLD) in Focus as Altman's OpenAI Debuts 3-Tier GPT-5.6

Worldcoin (WLD) in Focus as Altman's OpenAI Debuts 3-Tier GPT-5.6 Bitget

via Bitget

09:33:18AOL.com

SAFETY · 1 source

Anthropic's Mythos AI uncovers 2,000 unknown software vulnerabilities

Anthropic's Mythos AI found over 2,000 unknown software vulnerabilities in just seven weeks of testing AOL.com

via AOL.com

* sponsored·▶ nimbus

Need an agent shipped this quarter?

Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.

Nimbus — talk to Nimbus →

Friday, June 26, 2026’s editionFriday, June 26, 2026

19:24:27The Decoder

EVALS · 1 source

AI model runs nonstop 19 days on $2,600 coding task

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in…

via the-decoder.com

02:00:00MarkTechPost

GPT · 1 source

OpenAI Previews GPT-5.6 With Sol, Terra, and Luna

OpenAI has begun a limited preview of its next-generation GPT-5.6 model series, which features three tiered models (Sol, Terra, and Luna) and new reasoning modes.

via marktechpost.com

Wednesday, June 24, 2026’s editionWednesday, June 24, 2026

19:07:37The Decoder

CLAUDE · 1 source

Snowflake CEO finds GLM-5.2 competitive with Opus 4.7 at a fraction of the cost

Zhipu AI's GLM-5.2 nearly matches Claude Opus 4.7 in a Snowflake benchmark with 103 coding tasks at one-fifth the cost per output token. But the Chinese model burns through nearly twice as many tokens per task. Still…

via the-decoder.com

Tuesday, June 23, 2026’s editionTuesday, June 23, 2026

19:05:50arXiv — cs.CL

AGENTS · 1 source

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

via arxiv.org

Monday, June 22, 2026’s editionMonday, June 22, 2026

19:39:43arXiv — cs.CL

AGENTS · 1 source

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

via arxiv.org

Tuesday, June 16, 2026’s editionTuesday, June 16, 2026

02:53:14Crypto Briefing

CLAUDE · 1 source

Anthropic's Claude Fable 5 scores 161 on Epoch Capabilities Index, surpassing GPT-5.5 Pro

Anthropic's Claude Fable 5 scores 161 on Epoch Capabilities Index, surpassing GPT-5.5 Pro Crypto Briefing

via Crypto Briefing

Friday, June 12, 2026’s editionFriday, June 12, 2026

23:00:08NVIDIA — AI Blog

AGENTS · 1 source

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

New AgentPerf results from Artificial Analysis show how accelerated computing systems handle real-world agentic workloads, with NVIDIA GB300 NVL72 running up to 20x more agents per megawatt than NVIDIA Hopper.

via blogs.nvidia.com

Thursday, June 11, 2026’s editionThursday, June 11, 2026

19:23:54arXiv — cs.AI

AGENTS · 1 source

Agentbeats brings standardized agent assessment framework

via arxiv.org

Tuesday, June 9, 2026’s editionTuesday, June 9, 2026

19:35:37arXiv — cs.AI

AGENTS · 1 source

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

via arxiv.org

18:39:32arXiv — cs.CL

AGENTS · 1 source

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

via arxiv.org

18:10:16arXiv — cs.AI

AGENTS · 1 source

Workflow-GYM benchmarks computer-use agents in professional tasks

via arxiv.org

Monday, June 8, 2026’s editionMonday, June 8, 2026

19:55:02arXiv — cs.AI

RESEARCH · 1 source

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

via arxiv.org

19:08:36arXiv — cs.AI

AGENTS · 1 source

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

via arxiv.org

Thursday, June 4, 2026’s editionThursday, June 4, 2026

14:24:58Hugging Face — Blog

AGENTS · 1 source

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

via huggingface.co

Tuesday, June 2, 2026’s editionTuesday, June 2, 2026

19:11:56arXiv — cs.AI

AGENTS · 1 source

Hedge-Bench benchmarks financial reasoning agents on hard tasks

via arxiv.org

18:51:24arXiv — cs.CL

AGENTS · 1 source

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

via arxiv.org

* sponsored·▶ nimbus

Need an agent shipped this quarter?

Nimbus builds production AI systems — internal tools, customer agents, retrieval pipelines — combining humans and AI end-to-end. From scoped pilot to production in 4–8 weeks.

Nimbus — talk to Nimbus →

Wednesday, May 27, 2026’s editionWednesday, May 27, 2026

12:05:01Venturebeat

GPT · 1 source

Datacurve releases DeepSWE coding benchmark; GPT-5.5 leads at 70%

Michael Nuñez / VentureBeat: Datacurve releases the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, and says GPT-5.5 is the leader at 70% — For months, the…

via venturebeat.com

Wednesday, May 20, 2026’s editionWednesday, May 20, 2026

18:41:51arXiv — cs.AI

AGENTS · 1 source

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

via arxiv.org

Tuesday, May 19, 2026’s editionTuesday, May 19, 2026

02:00:00Together AI — Blog

AGENTS · 1 source

Benchmarking inference at scale: coding agents

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

via together.ai

Monday, May 18, 2026’s editionMonday, May 18, 2026

16:12:58Hugging Face — Blog

AGENTS · 1 source

The Open Agent Leaderboard

via huggingface.co

Sunday, May 17, 2026’s editionSunday, May 17, 2026

10:56:52The Decoder

RESEARCH · 1 source

New math benchmark reveals AI models confidently solve problems that have no solution

A consortium of 64 mathematicians built SOOHAK, a new AI benchmark with 439 handwritten tasks, including 99 that are deliberately unsolvable. Google's Gemini 3 Pro leads on research-level problems at 30 percent. But no…

via the-decoder.com

Saturday, May 16, 2026’s editionSaturday, May 16, 2026

12:55:47The Decoder

RESEARCH · 1 source

AI video generators excel at visuals but fail reasoning tests

A new benchmark called WorldReasonBench tests video generators not on image quality, but on physical and logical plausibility. ByteDance's Seedance 2.0 leads the field ahead of Veo 3.1 and Sora 2, with commercial…

via the-decoder.com