vLLM v0.15.0
vLLM v0.15.0 releases with support for new model architectures including Kimi-K2.5 and Molmo2, async scheduling with pipeline parallelism, Mamba prefix caching, and 335 commits from 158 contributors.
Highlights This release features 335 commits from 158 contributors (39 new)! Model Support New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456). LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763). Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322). Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526). Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).
Engine Core Async scheduling + Pipeline Parallelism: `--async-scheduling` now works with pipeline parallelism (#32359). Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with `--enable-prefix-caching --mamba-cache-mode align`. Achieves ~2x speedup by caching Mamba states directly (#30877). Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing `StreamingInput` objects while maintaining…
- github.comvllm v0.15.0primary