§ feed · storyline

Evals: The Next Generation

Scale AI identifies data contamination in major benchmarks, prompting new evaluation proposals, while Reka releases VibeEval for multimodal models and Meta confirms Llama 3 training at over 400 billion parameters.

May 3 · 01:54:22 · primary fetch1 sourceupdated May 3 · 01:54:22

Scale AI highlighted issues with data contamination in benchmarks like MMLU and GSM8K, proposing a new benchmark where Mistral overfits and Phi-3 performs well. Reka released the VibeEval benchmark for multimodal models addressing multiple choice benchmark limitations. Sam Altman of OpenAI discussed GPT-4 as "dumb" and hinted at GPT-5 with AI agents as a major breakthrough. Researchers jailbroke GPT-3.5 via fine-tuning. Global calls emerged to ban AI-powered weapons, with US officials urging human control over nuclear arms.

Ukraine launched an AI consular avatar, while Moderna partnered with OpenAI for medical AI advancements. Sanctuary AI and Microsoft collaborate on AI for general-purpose robots. MIT introduced Kolmogorov-Arnold networks with improved neural network efficiency. Meta AI is training Llama 3 models with over 400 billion parameters, featuring multimodality and longer context.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiEvals: The Next Generationprimary01:54:22