§ feed · storyline

Problems with MMLU-Pro

MMLU-Pro draws scrutiny on the Open LLM Leaderboard V2 over evaluation discrepancies and prompt sensitivity, including a 10-point score shift in Llama-3-8b-q8 from minor prompt changes.

Jul 9 · 02:20:51 · primary fetch1 sourceupdated Jul 9 · 02:20:51

MMLU-Pro is gaining attention as the successor to MMLU on the Open LLM Leaderboard V2 by HuggingFace, despite community concerns about evaluation discrepancies and prompt sensitivity affecting model performance, notably a 10-point improvement in Llama-3-8b-q8 with simple prompt tweaks. Meta's MobileLLM research explores running sub-billion parameter LLMs on smartphones using shared weights and deeper architectures. Salesforce's APIGen introduces an automated dataset generation system for function-calling tasks outperforming larger models. Runway Gen-3 Alpha launches an AI video generator for paid users creating realistic 10-second clips.

Nomic AI's GPT4All 3.0 offers an open-source desktop app supporting thousands of local models. AI assistants with multimodal capabilities and affordable access to multiple LLMs like ChatGPT, Claude, Llama, and Gemini are emerging. Meta 3D Gen advances text-to-3D asset generation, while Argil AI enables deepfake video creation from text threads. Research on transformer grokking and reasoning highlights advances in robust reasoning capabilities.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiProblems with MMLU-Proprimary02:20:51