Speculative checkpointing pays off only on repetitive text

In llama.cpp, speculative checkpointing matters for a simple reason: it points local users toward a cheaper speculative path. You can try speculative decoding with n-gram-based self-speculation, without loading a separate draft model into VRAM, and the likely payoff depends less on headline benchmarks than on whether your prompts repeat themselves.

A quick primer if you have not encountered speculative decoding before. When a large language model generates text, it normally produces one token (roughly one word) at a time — each token requires a full forward pass through the model. Speculative decoding speeds this up by guessing several tokens ahead, then verifying those guesses in a single batch. Verification is cheaper than generation because the model can check multiple candidates in parallel. When the guesses are right, you get several tokens for the cost of one verification pass. When they are wrong, you throw them away and fall back to normal speed. The whole bet is on how often the guesses land — which is why the rest of this article keeps coming back to draft acceptance rate.

The confirmed part is narrow but useful. llama.cpp’s speculative decoding docs say the system can generate draft tokens and then verify them in batches, because verifying several guessed tokens at once can be cheaper than decoding every token one by one. The docs also say llama.cpp supports both draft-model methods and n-gram methods such as ngram-simple, ngram-map-*, and ngram-mod.

The merged PR confirms that speculative checkpointing has landed. What the available source material does not establish cleanly is the exact internal mechanism beyond that server-side speculative decoding support was added. So the right way to read this feature is not “llama.cpp just got universally faster.” It is “llama.cpp just made another speculative decoding path easier to treat as a tuning layer.”

What Speculative Checkpointing Adds to llama.cpp

The easiest way to understand the change is to separate three things that often get blurred together.

Draft-model speculative decoding uses a second, smaller model to guess upcoming tokens. The main model then verifies those guesses in a batch. That can be fast. It also costs extra memory and setup.

Self-speculative decoding does not use a second model. It tries to guess upcoming tokens from patterns in the text history the same model has already produced. In llama.cpp, that includes the n-gram modes documented in the project.

Speculative checkpointing appears, from the merged PR and its labeling, to be a server-side feature aimed at speculative decoding workflows. That much is verified. The exact implementation details are not established by the source packet here, so they should not be overstated.

That still leaves a very practical conclusion.

If you are using ngram-mod or related self-speculative decoding modes, speculative checkpointing fits the same broader direction: making speculation something you can tune, not just a premium feature that starts with “first load another model.”

Approach	Extra VRAM cost	Setup cost	Best case	Weak spot
Draft-model speculative decoding	High	Higher	Strong speedups when draft model predicts well	Needs a second model and enough memory
Self-speculative decoding (`ngram-mod`, etc.)	Low	Low	Repetitive code and structured text	Weak on low-repeat outputs
Speculative checkpointing	Low extra model cost	Moderate server-side feature complexity	Makes speculative tuning more practical without a draft model	Exact gains still workload-dependent

That is why this patch matters.

It changes the cost of trying speculative decoding more than it proves any fixed speedup number.

Why Speedups Vary So Much by Prompt and Model

The docs give away the whole mechanism, if you read them literally.

For n-gram speculation, llama.cpp says these methods “rely on patterns that have already appeared in the generated text.” The docs also give a concrete example of where that helps: rewriting source code with an LLM.

That sentence does more work than most benchmark charts.

If the model is refactoring a long TypeScript file, the output tends to repeat local structures:

imports
class boilerplate
recurring function signatures
JSON-like object shapes
framework-specific patterns

Once those token sequences have appeared, an n-gram matcher has something real to grab. It can draft the next stretch because the next stretch often looks like the last one. The main model then verifies that draft. If those guesses keep matching, you get long draft acceptance rate streaks. That is where token generation speedup comes from.

A one-off reasoning prompt looks different.

Ask for a novel explanation, a planning chain, or an answer that keeps changing direction, and the model may not reuse many local token sequences at all. The history is less repetitive. The n-gram draft has less to latch onto. Drafts get shorter or get rejected. The speculative path falls back toward baseline.

That is why benchmark claims without prompt context are close to useless.

A reported speedup number tells you almost nothing unless you know what kind of text produced it. The same model can look great on repetitive code and flat on exploratory reasoning. NovaKnown’s piece on LLM performance drop made the same point in a different context: performance is always attached to a workload, whether marketers admit it or not.

One concrete way to picture it:

Code refactoring prompt: rename a set of methods, preserve structure, emit the whole file
- Earlier tokens create many reusable local patterns
- ngram-mod can draft repeated chunks
- Acceptance can come in streaks
Reasoning prompt: compare three hiring plans under changing constraints
- Each sentence introduces new combinations
- Few local repeats
- Acceptance is sparse

The mechanism is boring. The consequences are not.

Which Workloads Benefit, and Which Don’t

The best workloads for speculative checkpointing plus n-gram self-speculation are the ones many people underrate because they are unglamorous.

Code rewrites are near the top of the list. Not greenfield coding. Rewrites. The docs explicitly mention source-code rewriting because that is exactly the case where prior token history is rich enough to predict what comes next.

Structured text is another good fit:

JSON with recurring keys
config files
repetitive documentation templates
schema-heavy outputs
boilerplate-heavy framework code

These tasks often produce the same shapes over and over. Self-speculative decoding likes shapes.

Weak candidates are almost the inverse:

short prompts with little generated history
open-ended essays
brainstorming across shifting topics
novel reasoning
anything where each next sentence is genuinely new

That does not mean n-gram methods never help outside code. It means you should expect help when the text repeats local syntax, not when it merely shares a topic.

There is one broader point worth keeping from the bigger speculative decoding story. Earlier work like DFlash speculative decoding sits on the opposite end of the trade-off curve: more machinery, potentially more speed. Speculative checkpointing reinforces that llama.cpp speculative decoding is no longer one trick. It is a menu of trade-offs.

What This Means for Local Inference Tuning

Start from the variable that matters: draft acceptance rate.

Not “tokens per second” in the abstract. Not a screenshot from someone else’s benchmark. Acceptance.

If accepted drafts come in long runs, self-speculative decoding can feel almost free. If they do not, you are just adding speculative work that gets thrown away.

A practical first pass looks like this:

Parameter	Try first	Likely effect	Trade-off
`--spec-type`	`ngram-mod`	Enables self-speculative decoding without a draft model	Gains depend on repeated token patterns
`--spec-ngram-size-n`	8, 12, 24	Smaller values find matches more often	More weak matches, more rejection
`--draft-min`	16, 32, 48	Starts drafting sooner	More overhead if acceptance is poor
`--draft-max`	16, 32, 64	Can amplify long acceptance streaks	More wasted work on rejected drafts

The most interesting knob is usually --spec-ngram-size-n.

A large n-gram size asks for a stricter match. That tends to work better when the output is strongly repetitive, because the matcher is looking for a long repeated sequence. A smaller n-gram size is more permissive. It may find more candidate matches on mixed code-and-prose prompts, but it also raises the chance of bad guesses that the main model rejects.

So the tuning logic is simple:

highly repetitive codebase rewrite: try larger n-grams
mixed coding assistant prompt: try medium n-grams
reasoning-heavy chat: do not expect much, no matter how you tune it

That is a better mental model than asking whether speculative checkpointing is “worth it.”

It is worth it when your workload produces reusable token history.

This is also why measuring your own prompts matters more than copying a flag set from someone else. The Ralph Wiggum technique applies here nicely: try the simple thing first, then watch what the system actually does.

The next round of llama.cpp gains probably looks like this too. Not one magic flag. More layers of tuning that reward people who know their own prompt patterns.

Key Takeaways

Speculative checkpointing in llama.cpp is confirmed as merged, but the available sources support a narrow claim: it strengthens the practical case for speculative decoding without a separate draft model.
llama.cpp’s docs explicitly say n-gram methods rely on patterns already present in generated text, which is why code rewrites and structured outputs are the best candidates.
The real variable is draft acceptance rate. Long accepted runs create speedups. Frequent rejection collapses gains.
Repetitive code and structured text can benefit from self-speculative decoding. Reasoning-heavy or low-repetition prompts may see little to no benefit.
Local users should tune for their own acceptance patterns, not for someone else’s benchmark screenshot.

Speculative Checkpointing Pays Off Only on Repetitive Text

What Speculative Checkpointing Adds to llama.cpp

Why Speedups Vary So Much by Prompt and Model

Which Workloads Benefit, and Which Don’t

What This Means for Local Inference Tuning

Key Takeaways

Further Reading

Andrew Kelley Challenged Anthropic’s Claude Code Story

Claude’s “sensitive Leak” Was a Prompt-injection Exfiltration Path

Researchers Used Claude’s Web Fetch to Steal User Profile Data

Microsoft Comic Chat Turned IRC Into Live Comic Strips, and Microsoft Just Open-Sourced the 1996 Code

GPT-5.6’s 136 IQ Score Is Real. ‘Smarter Than 99% of Humans’ Doesn’t Follow.

Categories