§ local-llm · storyline

vLLM v0.17.0

vLLM v0.17.0 releases with PyTorch 2.10, FlashAttention 4 support, Model Runner V2 pipeline parallel, and full Qwen3.5 model family integration across 699 commits from 272 contributors.

Mar 7 · 01:46:41 · primary fetch1 sourceupdated Mar 7 · 01:46:41

vLLM v0.17.0 Known Issue: If you are on CUDA 12.9+ and encounter a `CUBLAS_STATUS_INVALID_VALUE` error, this is caused by a CUDA library mismatch. To resolve, try one of the following: 1. Remove the path to system CUDA shared library files (e.g. `/usr/local/cuda`) from `LD_LIBRARY_PATH`, or simply `unset LD_LIBRARY_PATH`. 2. Install vLLM with `uv pip install vllm --torch-backend=auto`. 3. Install vLLM with `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129` (change the CUDA version to match your system). Highlights This release features 699 commits from 272 contributors (48 new)! PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.

FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance. Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.17.0primary01:46:41