§ local-llm · storyline

Adds real-time reasoning interruption via POST

llama.cpp server adds a POST /v1/chat/completions/control endpoint that lets callers interrupt model reasoning mid-generation by targeting a completion by ID.

Jun 2 · 07:53:36 · primary fetch1 sourceupdated Jun 2 · 07:53:36

server: real-time reasoning interruption via control endpoint (#23971) server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export.

Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated…

read full article on github.com ↗

§ sources1 publication · timeline below

github.comllama.cpp b9468primary07:53:36