shipfeedAI news, curated daily

05:28:12 CET
24 JUN05:28:12shipfeed
pull to refreshlast sync
Just in — 30 new
§ local-llm · storyline

Limits max outputs of llama_context to save VRAM

llama.cpp limits max outputs of llama_context to reduce VRAM usage by reserving n_outputs equal to n_seqs where possible.

Jun 1 · · primary fetch1 sourceupdated Jun 1 ·

llama: limit max outputs of `llama_context` (#23861) llama: save more VRAM by reserving n_outputs == n_seqs when possible add n_outputs_per_seq move n_outputs_max to server-context change ubatch to batch everywhere macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Ubuntu x64 (SYCL FP32) DISABLED Android: Android arm64 (CPU) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.3 DLLs Windows x64 (Vulkan) Windows x64 (SYCL) DISABLED Windows x64 (HIP) openEuler: DISABLED openEuler x86 (310p) openEuler x86 (910b, ACL Graph) openEuler aarch64 (310p) openEuler aarch64 (910b, ACL Graph) UI: UI

read full article on github.com
§ sources1 publication · timeline below
  1. github.comllama.cpp b9460primary