§ local-llm · storyline

vLLM v0.12.0

vLLM v0.12.0 releases with 474 commits, adding EAGLE speculative decoding improvements, 18.1% throughput gains, AMD ROCm expansion, new model families, and an experimental GPU Model Runner V2.

Dec 3 · 10:36:17 · primary fetch1 sourceupdated Dec 3 · 10:36:17

vLLM v0.12.0 Release Notes Highlights Highlights This release features 474 commits from 213 contributors (57 new)！ Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including `xformers` backend, and scheduled removals - please review the changelog carefully. Major Features: EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594). Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).

AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701). Model Support New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757). Format support: Gemma3 GGUF multimodal support (#27772). Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594). Performance: QwenVL cos/sin cache optimization (#28798). Engine Core GPU Model Runner V2 (Experimental) (#25266): Complete…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.12.0primary10:36:17