§ feed · storyline

FlashAttention-4 optimizes GPU kernels for memory-bandwidth

FlashAttention-4 introduces pipelined GPU kernels, 2-CTA MMA modes, and a hybrid softmax approach to better balance compute throughput against memory bandwidth constraints.

Mar 5 · 01:00:00 · primary fetch1 sourceupdated Mar 5 · 01:00:00

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.

read full article on together.ai ↗

§ sources1 publication · timeline below

together.aiFlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scalingprimary01:00:00