llama.cpp b9128
hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993) hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase hmx-mm: optimize per-group scale handling hmx-fa: optimize slope load from…
https://github.com/ggerganov/llama.cpp/releases·tool·65 items·last fetched
hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993) hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase hmx-mm: optimize per-group scale handling hmx-fa: optimize slope load from…
opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755) ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill ggml-opencl: address Adreno xmem review comments ggml-opencl: align xmem gemm kernel naming…
mtmd, server, common: expose modalities to /v1/models (#22952) mtmd, server, common: expose modalities to /v1/models fix build rename to mtmd_caps macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64…
ggml-webgpu: Enables running gpt-oss-20b (#22906) Enable to run gpt-oss-20b and refactor mulmat-q disable test-backend-ops in ubuntu-24-webgpu macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI…
vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (#22461) refactor Use l_warptile only when coopamt is available for BF16 macOS/iOS: macOS Apple Silicon (arm64) macOS Apple…
vulkan: Check shared memory size for mmq shaders (#22693) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
mtmd: add MiMo v2.5 vision (#22883) mimo-v2.5: vision support mimo-v2.5: use fused qkv for vision mimi-v2.5: fix f16 vision overflow mimo-v2.5: comment cleanups mimo-v2.5: Flash doesn't have mmproj more cleanup…
metal : promote mul_mv/mul_mm batch divisors to function constants (#22711) metal : promote mul_mv/mul_mm batch divisors to function constants metal : take op directly in get_pipeline_mul_mv_ext macOS/iOS: macOS Apple…
opencl: add q4_1 MoE for Adreno (#22856) Q4_1 MoE CLC pass sanity check remove unnecessary code opencl: remove unnecessary asserts and reformat opencl: fix supports_op for q4_1 moe q4_1 moe is supported by Adreno with…
CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s)…
docs: fix metrics endpoint description in server README (#22879) docs: fix metrics endpoint description in server README Required model query parameter for router mode described. Removed metrics…
spec : parallel drafting support (#22838) spec : refactor spec : drop support for incompatible vocabs spec : update common_speculative_init() cont : pass seq_id cont : dedup ctx_seq_rm_type server : sketch the ctx_dft…
vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (#22589) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu…
CUDA: directly include cuda/iterator (#22936) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator macOS/iOS: macOS Apple Silicon (arm64)…
vendor : update cpp-httplib to 0.44.0 (#22919) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…
[SYCL] Add OP im2col_3d (#22903) add im2col_3d format code update the ops.md macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…
server : print warning when HTTP timeout exceeded (#22907) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
backend sampling: support returning post-sampling probs (#22622) server: Never return 0.0 post-sampling probabilities backend sampling: support returning post-sampling probs macOS/iOS: macOS Apple Silicon (arm64) macOS…
vendor : update cpp-httplib to 0.43.4 (#22888) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…
sync : ggml macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)…
internal AllReduce kernel for CUDA provider (#22299) ggml-cuda: add internal AllReduce provider for tensor parallelism Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR using a single-phase…
model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu…
model : add sarvam_moe architecture support (#20275) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU)…
cmake : update BoringSSL to 0.20260508.0 (#22839) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…
SYCL: reduce allocation overhead during flash attention (#22732) SYCL: reduce allocation overhead during flash attention tidy up whitespace add a note about the flag move ggml_sycl_fattn_ into fattn-buffers.hpp…
[SYCL] Add BF16 support to GET_ROWS operation (#21391) Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in supports_op and in the kernel dispatch. This fixes a performance regression where models using…
sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path (#22152) sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path Signed-off-by: Chun Tao Remove duplicate definitions --------- Signed-off-by: Chun Tao…
Add flash attention MMA / Tiles to support MiMo-V2.5 (#22812) mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 mimo-v2.5: follow (256, 256) fattn templates mimo-v2.5: cleanup comments mimo-v2.5…
hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET (#22837) Implement the Gated Delta Net recurrence on HVX with: 4-row fused kernels for PP (prompt processing) path 8-row fused kernels for TG (token generation) path…
Feature hexagon l2 norm (#22816) L2_NORM Updates Addressed PR Comments ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops…
common : do not wrap raw strings in schema parser for tagged parsers (#22827) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…
model : support Gemma4_26B_A4B_NVFP4 (#22804) Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes Signed-off-by: ynankani Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret Address review…
common : revert reasoning budget +inf logit bias (#22740) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
server: support Vertex AI compatible API (#22545) server: support Vertex AI compatible API a bit safer support other AIP_ env var various fixes if AIP_MODE is unset, do nothing fix test case fix windows build…
server: (router) expose child model info from router's /v1/models (#22683) server: (router) expose child model info from router's /v1/models update docs macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…
cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667) cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op…
CUDA: lower-case PCI bus id, standardize for ggml (#22820) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
vulkan: fix spv shadowing (#22760) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU)…
ggml: update SCHED_DEBUG output to use ggml_op_desc() (#22825) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64…
opencl: add q4_0 MoE GEMM for Adreno (#22731) Q4_0 MoE CLC pass sanity check release program opencl: fix whitespace opencl: remove unused cl_program opencl: break #if block to make it more clear opencl: adjust format…
CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (#22651) CUDA: batch out_prod inner loop with cublasSgemmStridedBatched CUDA: batch out_prod inner loop with cublasSgemmStridedBatched CUDA: add…
llama : fix device state save/load (#22805) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x…
opencl: add opfilter regex for debugging (#22782) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu…
common/chat : preserve media markers for typed-content templates (#22634) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)…
tests: add long-sequence cases and fix inputs for gated_delta_net (#22794) tests : add long-seq + tail cases for gated_delta_net tests : realistic input ranges for gated_delta_net macOS/iOS: macOS Apple Silicon (arm64)…
sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149) sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao Fix abort during test-backend-ops Signed-off-by…
llama : remove unnecessary seq_id check during state restore (#22797) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU)…
ggml-cpu: Optimized risc-v cpu q1_0 dot macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x…
mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64…
model: Add Mimo v2.5 model support (#22493) add mimo-v2.5 support mimo-v2.5: fix modify_tensors row split mimi-v2.5: forgot `add_attn_value_scale` plumbing mimi-v2.5: fix tp dequant to detect tp rows mimo-v2.5: fix TP…