§ feed · storyline

SciCode: HumanEval gets a STEM PhD upgrade

SciCode benchmark launches as a PhD-level coding test for LLMs, with GPT-4 and Claude 3.5 Sonnet each scoring under 5% on scientific problems.

Jul 17 · 04:04:35 · primary fetch1 sourceupdated Jul 17 · 04:04:35

PhD-level benchmarks highlight the difficulty of coding scientific problems for LLMs, with GPT-4 and Claude 3.5 Sonnet scoring under 5% on the new SciCode benchmark. Anthropic doubled the max output token limit for Claude 3.5 Sonnet to 8192 tokens. The Q-GaLore method enables training LLaMA-7B on a single 16GB GPU. The Mosaic compiler now generates efficient code for NVIDIA H100 GPUs.

The Dolphin 2.9.3-Yi-1.5-34B-32k-GGUF model on Hugging Face has over 111k downloads. Llama 3 shows strong performance, achieving 90% zero-shot accuracy on the MATH dataset. Discussions continue on the limitations and forms of synthetic data for model training.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiSciCode: HumanEval gets a STEM PhD upgradeprimary04:04:35