§ feed · storyline

Shazeer et al (2024): you are overpaying for inference >13x

Character.ai publishes inference cost analysis showing leading commercial APIs charge at least 13.5x more, with techniques like MQA, GQA, cross-layer KV-sharing, and int8 precision cutting serving costs 33x since 2022.

Jun 22 · 02:48:48 · primary fetch1 sourceupdated Jun 22 · 02:48:48

Noam Shazeer explains how Character.ai serves 20% of Google Search Traffic for LLM inference while reducing serving costs by a factor of 33 compared to late 2022, with leading commercial APIs costing at least 13.5X more. Key memory-efficiency techniques include MQA > GQA reducing KV cache size by 8X, hybrid attention horizons, cross-layer KV-sharing, stateful caching with a 95% cache rate, and native int8 precision with custom kernels.

Anthropic released Claude 3.5 Sonnet, which outperforms Claude 3 Opus at twice the speed and one-fifth the cost, passing 64% of internal pull request tests and introducing new features like Artifacts for real-time doc and code generation. Discussions on LLM architecture highlight the dominance of transformers, challenges in scaling and overfitting, and the importance of architecture work for progress.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiShazeer et al (2024): you are overpaying for inference >13xprimary02:48:48