§ feed · storyline

DataComp-LM: the best open-data 7B model/benchmark/dataset

DataComp team releases DCLM, a 7B language model trained on 2.5T tokens from its 240T-token open dataset, alongside a benchmark showing stronger scaling trends than FineWeb.

Jul 20 · 04:08:36 · primary fetch1 sourceupdated Jul 20 · 04:08:36

DataComp team released a competitive 7B open data language model trained on only 2.5T tokens from the massive DCLM-POOL dataset of 240 trillion tokens, showing superior scaling trends compared to FineWeb. OpenAI launched GPT-4o mini, a cost-effective model with 82% MMLU and performance near GPT-4-Turbo, aimed at developers for broad applications. NVIDIA and Mistral jointly released the Mistral NeMo 12B model featuring a 128k token context window, FP8 checkpoint, multilingual support, and Apache 2.0 licensing.

DeepSeek announced DeepSeek-V2-0628 as the top open-source model on the LMSYS Chatbot Arena leaderboard with strong rankings in coding, math, and hard prompts. This news highlights advances in dataset design, model efficiency, and open-source contributions in the AI community.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiDataComp-LM: the best open-data 7B model/benchmark/datasetprimary04:08:36