§ tools · cluster

vLLM v0.11.0

Oct 2 · 21:17:04 · primary fetch1 sourcecluster e604ca35updated Oct 2 · 21:17:04

Highlights This release features 538 commits, 207 contributors (65 new contributors)! This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now. This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode. Note: In v0.11.0 (and v0.10.2), `--async-scheduling` will produce gibberish output in some cases such as preemption and others.

This functionality is correct in v0.10.1. We are actively fixing it for the next version. Model Support New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611). Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174). Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451). Data…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.11.0primary21:17:04