shipfeedAI news, curated daily

20:11:00 CET
29 JUN20:11:00shipfeed
pull to refreshlast sync
Just in — 30 new
§ evals · storyline

AI model runs nonstop 19 days on $2,600 coding task

Epoch AI's MirrorCode benchmark tests AI models' ability to recreate programs from scratch, with Claude Opus 4.7 achieving 56 percent accuracy on code reconstruction tasks.

Jun 26 · · primary fetch1 sourceupdated Jun 26 ·

Epoch AI's new MirrorCode benchmark tests whether AI models can recreate complete programs without access to the original code. Claude Opus 4.7 leads with a 56 percent solve rate, rebuilding a 16,000-line toolkit in just 14 hours.

But every model tested still fails on the most complex tasks. The article An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run appeared first on The Decoder.

read full article on the-decoder.com
§ sources1 publication · timeline below
  1. the-decoder.comAn AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to runprimary