vLLM v0.12.0
vLLM v0.12.0 Release Notes Highlights Highlights This release features 474 commits from 213 contributors (57 new)! Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. Major Features: EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701). Model Support New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757). Format support: Gemma3 GGUF multimodal support (#27772). Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594). Performance: QwenVL cos/sin cache optimization (#28798). Engine Core GPU Model Runner V2 (Experimental) (#25266): Complete…
- github.comvllm v0.12.0primary