§ local-llm · storyline

Limits max outputs of llama_context to save VRAM

llama.cpp limits max outputs of llama_context to reduce VRAM usage by reserving n_outputs equal to n_seqs where possible.

Jun 1 · 21:23:22 · primary fetch1 sourceupdated Jun 1 · 21:23:22

llama: limit max outputs of `llama_context` (#23861) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Ubuntu x64 (SYCL FP32) DISABLED Android: Android arm64 (CPU) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.3 DLLs Windows x64 (Vulkan) Windows x64 (SYCL) DISABLED Windows x64 (HIP) openEuler: DISABLED openEuler x86 (310p) openEuler x86 (910b, ACL Graph) openEuler aarch64 (310p) openEuler aarch64 (910b, ACL Graph) UI: UI

read full article on github.com ↗

§ sources1 publication · timeline below

github.comllama.cpp b9460primary21:23:22