In llama.cpp, speculative checkpointing matters for a simple reason: it points local users toward a cheaper speculative path. You can try speculative decoding with n-gram-based self-speculation, without loading a separate draft model into VRAM, and the likely payoff depends less on headline benchmarks than on whether your prompts repeat themselves.
A quick primer if you have not encountered speculative decoding before. When a large language model generates text, it normally produces one token (roughly one word) at a time — each token requires a full forward pass through the model. Speculative decoding speeds this up by guessing several tokens ahead, then verifying those guesses in a single batch. Verification is cheaper than generation because the model can check multiple candidates in parallel. When the guesses are right, you get several tokens for the cost of one verification pass. When they are wrong, you throw them away and fall back to normal speed. The whole bet is on how often the guesses land — which is why the rest of this article keeps coming back to draft acceptance rate.
The confirmed part is narrow but useful. llama.cpp’s speculative decoding docs say the system can generate draft tokens and then verify them in batches, because verifying several guessed tokens at once can be cheaper than decoding every token one by one. The docs also say llama.cpp supports both draft-model methods and n-gram methods such as ngram-simple, ngram-map-*, and ngram-mod.
The merged PR confirms that speculative checkpointing has landed. What the available source material does not establish cleanly is the exact internal mechanism beyond that server-side speculative decoding support was added. So the right way to read this feature is not “llama.cpp just got universally faster.” It is “llama.cpp just made another speculative decoding path easier to treat as a tuning layer.”
What Speculative Checkpointing Adds to llama.cpp
The easiest way to understand the change is to separate three things that often get blurred together.
Draft-model speculative decoding uses a second, smaller model to guess upcoming tokens. The main model then verifies those guesses in a batch. That can be fast. It also costs extra memory and setup.
Self-speculative decoding does not use a second model. It tries to guess upcoming tokens from patterns in the text history the same model has already produced. In llama.cpp, that includes the n-gram modes documented in the project.
Speculative checkpointing appears, from the merged PR and its labeling, to be a server-side feature aimed at speculative decoding workflows. That much is verified. The exact implementation details are not established by the source packet here, so they should not be overstated.
That still leaves a very practical conclusion.
If you are using ngram-mod or related self-speculative decoding modes, speculative checkpointing fits the same broader direction: making speculation something you can tune, not just a premium feature that starts with “first load another model.”
| Approach | Extra VRAM cost | Setup cost | Best case | Weak spot |
|---|---|---|---|---|
| Draft-model speculative decoding | High | Higher | Strong speedups when draft model predicts well | Needs a second model and enough memory |
Self-speculative decoding (ngram-mod, etc.) |
Low | Low | Repetitive code and structured text | Weak on low-repeat outputs |
| Speculative checkpointing | Low extra model cost | Moderate server-side feature complexity | Makes speculative tuning more practical without a draft model | Exact gains still workload-dependent |
That is why this patch matters.
It changes the cost of trying speculative decoding more than it proves any fixed speedup number.
Why Speedups Vary So Much by Prompt and Model
The docs give away the whole mechanism, if you read them literally.
For n-gram speculation, llama.cpp says these methods “rely on patterns that have already appeared in the generated text.” The docs also give a concrete example of where that helps: rewriting source code with an LLM.
That sentence does more work than most benchmark charts.
If the model is refactoring a long TypeScript file, the output tends to repeat local structures:
- imports
- class boilerplate
- recurring function signatures
- JSON-like object shapes
- framework-specific patterns
Once those token sequences have appeared, an n-gram matcher has something real to grab. It can draft the next stretch because the next stretch often looks like the last one. The main model then verifies that draft. If those guesses keep matching, you get long draft acceptance rate streaks. That is where token generation speedup comes from.
A one-off reasoning prompt looks different.
Ask for a novel explanation, a planning chain, or an answer that keeps changing direction, and the model may not reuse many local token sequences at all. The history is less repetitive. The n-gram draft has less to latch onto. Drafts get shorter or get rejected. The speculative path falls back toward baseline.
That is why benchmark claims without prompt context are close to useless.
A reported speedup number tells you almost nothing unless you know what kind of text produced it. The same model can look great on repetitive code and flat on exploratory reasoning. NovaKnown’s piece on LLM performance drop made the same point in a different context: performance is always attached to a workload, whether marketers admit it or not.
One concrete way to picture it:
- Code refactoring prompt: rename a set of methods, preserve structure, emit the whole file
- Earlier tokens create many reusable local patterns
ngram-modcan draft repeated chunks- Acceptance can come in streaks
- Reasoning prompt: compare three hiring plans under changing constraints
- Each sentence introduces new combinations
- Few local repeats
- Acceptance is sparse
The mechanism is boring. The consequences are not.
Which Workloads Benefit, and Which Don’t
The best workloads for speculative checkpointing plus n-gram self-speculation are the ones many people underrate because they are unglamorous.
Code rewrites are near the top of the list. Not greenfield coding. Rewrites. The docs explicitly mention source-code rewriting because that is exactly the case where prior token history is rich enough to predict what comes next.
Structured text is another good fit:
- JSON with recurring keys
- config files
- repetitive documentation templates
- schema-heavy outputs
- boilerplate-heavy framework code
These tasks often produce the same shapes over and over. Self-speculative decoding likes shapes.
Weak candidates are almost the inverse:
- short prompts with little generated history
- open-ended essays
- brainstorming across shifting topics
- novel reasoning
- anything where each next sentence is genuinely new
That does not mean n-gram methods never help outside code. It means you should expect help when the text repeats local syntax, not when it merely shares a topic.
There is one broader point worth keeping from the bigger speculative decoding story. Earlier work like DFlash speculative decoding sits on the opposite end of the trade-off curve: more machinery, potentially more speed. Speculative checkpointing reinforces that llama.cpp speculative decoding is no longer one trick. It is a menu of trade-offs.
What This Means for Local Inference Tuning
Start from the variable that matters: draft acceptance rate.
Not “tokens per second” in the abstract. Not a screenshot from someone else’s benchmark. Acceptance.
If accepted drafts come in long runs, self-speculative decoding can feel almost free. If they do not, you are just adding speculative work that gets thrown away.
A practical first pass looks like this:
| Parameter | Try first | Likely effect | Trade-off |
|---|---|---|---|
--spec-type |
ngram-mod |
Enables self-speculative decoding without a draft model | Gains depend on repeated token patterns |
--spec-ngram-size-n |
8, 12, 24 | Smaller values find matches more often | More weak matches, more rejection |
--draft-min |
16, 32, 48 | Starts drafting sooner | More overhead if acceptance is poor |
--draft-max |
16, 32, 64 | Can amplify long acceptance streaks | More wasted work on rejected drafts |
The most interesting knob is usually --spec-ngram-size-n.
A large n-gram size asks for a stricter match. That tends to work better when the output is strongly repetitive, because the matcher is looking for a long repeated sequence. A smaller n-gram size is more permissive. It may find more candidate matches on mixed code-and-prose prompts, but it also raises the chance of bad guesses that the main model rejects.
So the tuning logic is simple:
- highly repetitive codebase rewrite: try larger n-grams
- mixed coding assistant prompt: try medium n-grams
- reasoning-heavy chat: do not expect much, no matter how you tune it
That is a better mental model than asking whether speculative checkpointing is “worth it.”
It is worth it when your workload produces reusable token history.
This is also why measuring your own prompts matters more than copying a flag set from someone else. The Ralph Wiggum technique applies here nicely: try the simple thing first, then watch what the system actually does.
The next round of llama.cpp gains probably looks like this too. Not one magic flag. More layers of tuning that reward people who know their own prompt patterns.
Key Takeaways
- Speculative checkpointing in llama.cpp is confirmed as merged, but the available sources support a narrow claim: it strengthens the practical case for speculative decoding without a separate draft model.
- llama.cpp’s docs explicitly say n-gram methods rely on patterns already present in generated text, which is why code rewrites and structured outputs are the best candidates.
- The real variable is draft acceptance rate. Long accepted runs create speedups. Frequent rejection collapses gains.
- Repetitive code and structured text can benefit from self-speculative decoding. Reasoning-heavy or low-repetition prompts may see little to no benefit.
- Local users should tune for their own acceptance patterns, not for someone else’s benchmark screenshot.
Further Reading
- llama.cpp speculative decoding docs, Confirms the main speculative decoding modes, including draft-model and n-gram approaches, and explicitly notes that n-gram methods rely on prior generated patterns.
- llama.cpp PR #19493: speculative checkpointing, Confirms the merged speculative checkpointing feature and its server-side context, even if the detailed implementation trail is thinner from the available packet.
- llama.cpp PR #22105: DFlash support, Useful contrast case for the heavier draft-model end of speculative decoding in llama.cpp.
- llama.cpp PR #21845: multi-column MMVQ on SYCL, Shows how backend optimization can change observed speculative decoding performance even when the decoding method stays the same.
- llama.cpp PR #22066: Battlemage SYCL optimizations, Another reminder that local token generation speedup depends on backend maturity as much as on the speculative method.
The interesting thing about speculative checkpointing is not that it makes llama.cpp universally faster. It makes speed look more like a property of your prompts than a property of a patch.
