§ tools · storyline

llama.cpp b9158

llama.cpp b9158 adds RDNA3 tensor core support to the CUDA MMA flash-attention kernel and tunes parameters for RDNA3, RDNA4, and CDNA1, including head sizes up to 256 on CDNA.

May 15 · 01:24:14 · primary fetch1 sourceupdated May 15 · 15:10:16

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (#22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension.

To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128. macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2)…

read full article on github.com ↗

§ sources6 publications · timeline below

github.comllama.cpp b9158primary01:24:14
github.comllama.cpp b916515:10:16
github.comllama.cpp b916314:33:42
github.comllama.cpp b916113:50:35
github.comllama.cpp b915903:48:23
github.comllama.cpp b915600:42:39

§ how this story moved

00:42:39primary — llama.cpp — Releases publishes the launch post.
01:24:14llama.cpp — Releases picks up coverage.
03:48:23llama.cpp — Releases picks up coverage.
13:50:35llama.cpp — Releases picks up coverage.
14:33:42llama.cpp — Releases picks up coverage.
15:10:16llama.cpp — Releases picks up coverage.