§ local-llm · storyline

Adds DeepSeek V4 model support with graph optimization

Llama.cpp adds DeepSeek V4 model support with graph optimization and flash attention improvements.

today · 12:25:16 · primary fetch1 sourceupdated today · 12:25:16

DeepSeek V4 (#24162) convert: add dsv4 conversion add basic setup add llm_graph_input_dsv4 add save-load state add sinkhorn eps - correction by @fairydreaming add rope fix cleanup dead code fix bugs support pro model: added by @fairydreaming remove redundant V cache Chat template remove debugging leftovers Add mechanism for inlining templates based on architecture s/deepseek-v4-flash/deepseek4/g s/deepseek-v4-flash/deepseek4/g continued enable graph reuse enable FA fix test llama archs rename compatibility with antirez ds4 GGUFs simplified set_gguf_parameters() by calling super class method, replaced moe.score_func with expert_gating_func.

reserve worst-case kv-cache revert max split inputs address review comments add padding to enable FA pad only the final value of plan.n_kv to 256 remove built-in cpp chat template cont: remove cpp built-in template rm outdated test replace ggml_view_3d() with ggml_reshape_3d() Co-authored-by: Georgi Gerganov only support n_seq=1 for now remove unused var cont: remove unused var use scale bias use correct ptr for can_reuse remove gen-chat-inline-templates.py simplify graph reuse cont: cleanup remove unused inputs enable partial checkpointing add…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comllama.cpp b9840primary12:25:16