§ feed · storyline

Mixture of Depths: Dynamically allocating compute in transformer-based language models

DeepMind publishes Mixture-of-Depths, a technique that dynamically allocates compute across transformer layers to achieve over 50% faster forward passes with no training degradation.

Apr 6 · 00:44:29 · primary fetch1 sourceupdated Apr 6 · 00:44:29

DeepMind introduces the Mixture-of-Depths (MoD) technique, dynamically allocating FLOPs across transformer layers to optimize compute usage, achieving over 50% faster forward passes without training impact. MoD selectively processes tokens using top-k routing, improving efficiency and potentially enabling faster ultra-long context handling. The method can combine with Mixture-of-Experts (MoE) for decoupled routing of queries, keys, and values.

Reddit discussions highlight concerns about LLM hype overshadowing other AI tech, improvements in transformer efficiency, a new Think-and-Execute framework boosting algorithmic reasoning by 10-20%, and Visual Autoregressive modeling (VAR) surpassing diffusion models in image quality and speed. On-device model Octopus v2 outperforms GPT-4 in function calling accuracy and latency.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiMixture of Depths: Dynamically allocating compute in transformer-based language modelsprimary00:44:29