§ local-llm · storyline

vLLM v0.11.1

vLLM v0.11.1 releases with PyTorch 2.9.0, CUDA 12.9.1 support, batch-invariant torch.compile, async scheduling fixes, and Anthropic API compatibility across 1,456 commits from 449 contributors.

Nov 19 · 00:03:42 · primary fetch1 sourceupdated Nov 19 · 00:03:42

Highlights This release includes 1456 commits from 449 contributors (184 new contributors)! Key changes include: PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to `torch==2.9.0+cu129`, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. Batch-invariant `torch.compile`: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs. Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP.

We expect `--async-scheduling` to be enabled by default in the next release. Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments. Anthropic API Support: Added support for the `/v1/messages` endpoint, allowing users to interact with `vllm serve` using Anthropic-compatible clients. Detailed release notes will be updated in the next few days. What's Changed [Bugfix] Improve GLM4 MoE…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.11.1primary00:03:42