vLLM v0.12.0
vLLM v0.12.0 releases with 474 commits, adding EAGLE speculative decoding improvements, 18.1% throughput gains, AMD ROCm expansion, new model families, and an experimental GPU Model Runner V2.
vLLM v0.12.0 Release Notes Highlights Highlights This release features 474 commits from 213 contributors (57 new)! Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. Major Features: EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701). Model Support New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757). Format support: Gemma3 GGUF multimodal support (#27772). Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594). Performance: QwenVL cos/sin cache optimization (#28798). Engine Core GPU Model Runner V2 (Experimental) (#25266): Complete…
- github.comvllm v0.12.0primary