§ tools · cluster

vLLM v0.20.1

May 4 · 12:36:26 · primary fetch1 sourcecluster ad8f84f8updated May 4 · 12:36:26

vLLM v0.20.1 This is a patch release on top of `v0.20.0` primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes. DeepSeek V4 Base model support (#41006). Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default `VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD` (#41526). BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960). PTX `cvt` instruction for faster FP32->FP4 conversion (#41015). Integrated tile kernels (`head_compute_mix_kernel`) for optimized head computation (#41255).

Guard megamoe flag with Pure TP (#41522). Fixed persistent topk cooperative deadlock at TopK=1024 (#41189) and inter-CTA init race on RadixRowState (#41444), with temporary disable of persistent topk as a workaround (#41442). Fixed import error due to AOT compile cache loading (#41090). Fixed torch inductor error (#41135). Fixed repeated RoPE cache initialization (#41148). Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4 (#41198). Bug Fixes Fixed `max_num_batched_token` not being captured in CUDA graph (#40734). Fixed `num_gpu_blocks_override` not accounted for…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comvllm v0.20.1primary12:36:26