§ feed · storyline
FlashAttention-4 optimizes GPU kernels for memory-bandwidth
FlashAttention-4 introduces pipelined GPU kernels, 2-CTA MMA modes, and a hybrid softmax approach to better balance compute throughput against memory bandwidth constraints.
As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software hybrid approach to softmax exponentials.
§ sources1 publication · timeline below