Did Nvidia's Nemotron 70B train on test?
Nvidia's Nemotron-70B faces scrutiny over whether its strong benchmark results on Arena Hard and AlpacaEval reflect test-set contamination rather than genuine capability gains over base Llama-3.1-70B.
NVIDIA's Nemotron-70B model has drawn scrutiny despite strong benchmark performances on Arena Hard, AlpacaEval, and MT-Bench, with some standard benchmarks like GPQA and MMLU Pro showing no improvement over the base Llama-3.1-70B. The new HelpSteer2-Preference dataset improves some benchmarks with minimal losses elsewhere. Meanwhile, Mistral released Ministral 3B and 8B models featuring 128k context length and outperforming Llama-3.1 and GPT-4o on various benchmarks under the Mistral Commercial License.
NVIDIA's Nemotron 70B also surpasses GPT-4o and Claude-3.5-Sonnet on key benchmarks using RLHF (REINFORCE) training. Additionally, Zep introduced Graphiti, an open-source temporal knowledge graph memory layer for AI agents, built on Neo4j.
- news.smol.aiDid Nvidia's Nemotron 70B train on test?primary