shipfeedAI news, curated daily

00:36:12 CET
21 MAY00:36:12shipfeed
pull to refreshlast sync
Just in — 30 new
§ feed · storyline

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Epoch AI releases FrontierMath, a benchmark of hundreds of original mathematics problems built with 60 mathematicians, on which all tested AI models including o1 score poorly.

Nov 12 · · primary fetch1 sourceupdated Nov 12 ·

Epoch AI collaborated with over 60 leading mathematicians to create the FrontierMath benchmark, a fresh set of hundreds of original math problems with easy-to-verify answers, aiming to challenge current AI models. The benchmark reveals that all tested models, including o1, perform poorly, highlighting the difficulty of complex problem-solving and Moravec's paradox in AI. Key AI developments include the introduction of Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture reducing computational costs, and improvements in Chain-of-Thought (CoT) prompting through incorrect reasoning and explanations.

Industry news covers OpenAI acquiring the chat.com domain, Microsoft launching the Magentic-One agent framework, Anthropic releasing Claude 3.5 Haiku outperforming gpt-4o on some benchmarks, and xAI securing 150MW grid power with support from Elon Musk and Trump. LangChain AI introduced new tools including a Financial Metrics API, Document GPT with PDF upload and Q&A, and LangPost AI agent for LinkedIn posts. xAI also demonstrated the Grok Engineer compatible with OpenAI and Anthropic APIs for code generation.

read full article on news.smol.ai
§ sources1 publication · timeline below
  1. news.smol.aiFrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIprimary