§ local-llm · storyline

not much happened today

vLLM v0.20.0 releases with TurboQuant 2-bit KV cache, broader hardware support, and day-0 compatibility for new models including Poolside Laguna XS.2 and NVIDIA Nemotron 3 Nano Omni.

Apr 28 · 07:44:39 · primary fetch1 sourceupdated Apr 28 · 07:44:39

vLLM v0.20.0 introduces significant improvements in memory and MoE serving efficiency, including TurboQuant 2-bit KV cache for 4× KV capacity and a 2.1% latency improvement. The update supports multiple hardware platforms like DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and Grace-Blackwell setups. Early benchmarks show DeepSeek V4 Pro on B300 hardware can be up to 8× faster than H200. The ecosystem is rapidly adopting day-0 support for new open models such as Poolside Laguna XS.2, Ling-2.6-flash, and NVIDIA Nemotron 3 Nano Omni.

Poolside released Laguna XS.2, a 33B total / 3B active MoE coding model under Apache 2.0, capable of running on a single GPU, with hybrid attention and FP8 KV cache, performing near Qwen-3.5. NVIDIA launched Nemotron 3 Nano Omni, a 30B / A3B multimodal MoE with 256K context, supporting text, image, video, audio, and documents, with immediate distribution across multiple platforms. Discussions highlighted tradeoffs in quantization methods and a shift away from CUDA lock-in towards heterogeneous accelerator support.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.ainot much happened todayprimary07:44:39