vLLM v0.14.0
Highlights This release features approximately 660 commits from 251 contributors (86 new contributors). Breaking Changes: Async scheduling is now enabled by default - Users who experience issues can disable with `--no-async-scheduling`. Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding. PyTorch 2.9.1 is now required and the default wheel is compiled against cu129. Deprecated quantization schemes have been removed (#31688, #31285). When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).
Key Improvements: Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821). gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing. `--max-model-len auto` (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures. Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by…
- github.comvllm v0.14.0primary