§ evals · storyline

AI model runs nonstop 19 days on $2,600 coding task

Epoch AI's MirrorCode benchmark tests AI models' ability to recreate programs from scratch, with Claude Opus 4.7 achieving 56 percent accuracy on code reconstruction tasks.

Jun 26 · 19:24:27 · primary fetch1 sourceupdated Jun 26 · 19:24:27

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in just 14 hours.

But every model tested still fails on the most complex tasks. The article An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run appeared first on The Decoder.

read full article on the-decoder.com ↗

§ sources1 publication · timeline below

the-decoder.comAn AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to runprimary19:24:27