§ local-llm · storyline

Adds NVFP4 MoE, CUDA graphs, and MTP for DeepSeek V4

vLLM v0.22.0 releases with DeepSeek V4 model hardening, an experimental Rust frontend, Model Runner V2 advances, and a multi-tier KV cache offloading framework across 459 commits from 230 contributors.

May 29 · 12:28:13 · primary fetch1 sourceupdated May 29 · 12:28:13

Highlights This release features 459 commits from 230 contributors (63 new)! DeepSeek V4 maturity: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated `vllm/models/deepseek_v4/` package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, `mhc`, Q-norm, indexer, sparse MLA) and ROCm parity fixes landed alongside accuracy fixes (#42810, #43710). Model Runner V2 advances toward default: MRv2 added an oracle that selects MRv2 for Qwen3 dense models by default (#39337), sleep-mode weight reload (#42673), `update_config` (#42783), and shared KV-cache layers (#35045), plus many correctness fixes.

It now falls back to MRv1 automatically when a KV connector is present (#42955). Experimental Rust frontend: A new Rust front-end integration landed (#40848), with the implementation moved into the tree (#43283) and a DP Supervisor for data-parallel serving (#40841). Batch invariance, faster: Batch-invariant inference gained Cutlass FP8 support for a 28.9% end-to-end latency improvement (#40408)…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvLLM v0.22.0primary12:28:13