TinyZero: Reproduce DeepSeek R1-Zero for $30
Researcher Jiayi Pan reproduces DeepSeek R1-Zero using a Qwen fine-tune for under $30, finding RLCoT reasoning emerges at 1.5B parameters across multiple RL training methods.
DeepSeek Mania continues to reshape the frontier model landscape with Jiayi Pan from Berkeley reproducing the OTHER result from the DeepSeek R1 paper, R1-Zero, in a cost-effective Qwen model fine-tune for two math tasks. A key finding is a lower bound to the distillation effect at 1.5B parameters, with RLCoT reasoning emerging as an intrinsic property. Various RL techniques like PPO, DeepSeek's GRPO, or PRIME show similar outcomes, and starting from an Instruct model speeds convergence. The Humanity’s Last Exam (HLE) Benchmark introduces a challenging multi-modal test with 3,000 expert-level questions across 100+ subjects, where models perform below 10%, with DeepSeek-R1 achieving 9.4%.
DeepSeek-R1 excels in chain-of-thought reasoning, outperforming models like o1 while being 20x cheaper and MIT licensed. The WebDev Arena Leaderboard ranks DeepSeek-R1 #2 in technical domains and #1 under Style Control, closing in on Claude 3.5 Sonnet. OpenAI's Operator is deployed to 100% of Pro users in the US, enabling tasks like ordering meals and booking reservations, and functions as a research assistant for AI paper searches and summaries. Hugging Face announces a leadership change after…