§ local-llm · storyline

Enables topk-moe fusion for 288-expert models, boosting decode

CUDA enables top-k mixture-of-experts fusion for 288-expert models, improving decode performance by up to 2.4 percent.

yesterday · 16:20:50 · primary fetch1 sourceupdated yesterday · 16:20:50

cuda: enable topk-moe fusion for 288 experts (#25267) cuda: enable topk-moe fusion for 288 experts The topk-moe fusion only accepted power-of-2 expert counts (or the special-cased 576), so models with 288 experts (e.g. Step-3.7-Flash) fell back to the unfused per-layer routing chain: softmax/sigmoid, argsort, get_rows, sum_rows, div, clamp, scale. At batch size 1 that is ~330 extra tiny graph nodes per token. 288 is a multiple of the warp size, so the existing kernel already handles it; this adds the missing template instantiation and accepts 288 in the eligibility check. Measured on gfx1151 with Step-3.7-Flash IQ4_XS (llama-bench, -b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; machine idle, before/after paired so pp4096 stays matched as a load control): test | before | after ----------------+----------------+---------------- pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (unchanged) tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%) tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (unchanged) Prompt processing is unaffected (the fusion only touches decode routing).

The decode gain is ~+2.4% at shallow context and fades with depth: by 30k tokens each step is attention-bound over the KV cache…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comllama.cpp b9866primary16:20:50