§ local-llm · storyline

vLLM v0.18.0

vLLM v0.18.0 releases with gRPC serving support, GPU-less render serving, GPU-based NGram speculative decoding, KV cache offloading improvements, and elastic expert parallelism updates across 445 commits.

Mar 20 · 22:31:36 · primary fetch1 sourceupdated Mar 20 · 22:31:36

vLLM v0.18.0 Known issues Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618) If you previously ran into `CUBLAS_STATUS_INVALID_VALUE` and had to use a workaround in `v0.17.0`, you can reinstall `torch 2.10.0`. PyTorch published an updated wheel that addresses this bug. Highlights This release features 445 commits from 213 contributors (61 new)! gRPC Serving Support: vLLM now supports gRPC serving via the new `--grpc` flag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface. GPU-less Render Serving: New `vllm launch render` command (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.

NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead. KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610). Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627)…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.18.0primary22:31:36