§ feed · storyline

Cerebras Inference: Faster, Better, AND Cheaper

Cerebras launches inference service running Llama 3.1-8B at 1,800 tokens/sec and Llama 3.1-70B at 450 tokens/sec using wafer-scale chips, with a free tier and pricing below GPU alternatives.

Aug 29 · 02:59:27 · primary fetch1 sourceupdated Aug 29 · 02:59:27

Groq led early 2024 with superfast LLM inference speeds, achieving ~450 tokens/sec for Mixtral 8x7B and 240 tokens/sec for Llama 2 70B. Cursor introduced a specialized code edit model hitting 1000 tokens/sec. Now, Cerebras claims the fastest inference with their wafer-scale chips, running Llama3.1-8b at 1800 tokens/sec and Llama3.1-70B at 450 tokens/sec at full precision, with competitive pricing and a generous free tier.

Google's Gemini 1.5 models showed significant benchmark improvements, especially Gemini-1.5-Flash and Gemini-1.5-Pro. New open-source models like CogVideoX-5B and Mamba-2 (Rene 1.3B) were released, optimized for consumer hardware. Anthropic's Claude now supports prompt caching, improving speed and cost efficiency. "Cerebras Inference runs Llama3.1 20x faster than GPU solutions at 1/5 the price."

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiCerebras Inference: Faster, Better, AND Cheaperprimary02:59:27