vLLM v0.18.0
vLLM v0.18.0 releases with gRPC serving support, GPU-less render serving, GPU-based NGram speculative decoding, KV cache offloading improvements, and elastic expert parallelism updates across 445 commits.
vLLM v0.18.0 Known issues Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug. Highlights This release features 445 commits from 213 contributors (61 new)! gRPC Serving Support: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface. GPU-less Render Serving: New `vllm launch render` command (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.
NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead. KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610). Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627)…
- github.comvllm v0.18.0primary