Anthropic’s docs and Boris Cherny’s public clarification describe a split world: main-agent requests typically get a 1-hour cache, while sub-agents typically get 5 minutes. User logs from Claude Code show the part that hurts: requests that had been billed with ephemeral_1h_input_tokens started showing ephemeral_5m_input_tokens again in early March, and normal pauses began looking expensive.
That’s the core of the Claude Code cache bug. The real bug is semantic: users expected 1-hour cache behavior, but observable billing shifted toward 5-minute rewrites, turning caching from optimization into a hidden tax. The unresolved question is which mechanism caused it: a default changed, routing changed, or the request mix changed. The logs don’t settle that. They do show that the billable behavior moved.
Why the Claude Code cache bug matters for billing
Prompt caching is supposed to make long coding sessions cheaper after the first expensive write. You send a huge repo context once, then later turns mostly hit cache reads instead of paying cache creation costs again.
The failure mode is brutally simple. Load 100,000 tokens of repo context, ask a question, step away for 6 minutes, come back, send a follow-up. Under a 1-hour cache TTL, that second turn should mostly be a cheap read. Under a 5-minute TTL, it can become a fresh cache write.
That changes three things at once:
- API cost
- quota burn
- how long you can work uninterrupted before Claude Code starts acting expensive for no visible reason
If your cap is hit by token accounting rather than visible chat count, two extra rewrites after a coffee break can shorten a work session even when the conversation barely moved. That’s why this landed as a billing story, not just a weird implementation detail. It also fits the broader pattern in our earlier piece on the Claude Code regression: users end up reverse-engineering behavior from traces because the UI doesn’t explain the economics.
| Pause before next prompt | Active TTL | Likely path | Cost on 100k cached tokens |
|---|---|---|---|
| 4 minutes | 1h or 5m | Read | $0.03 |
| 6 minutes | 1h | Read | $0.03 |
| 6 minutes | 5m | Write | $0.375 |
| 20 minutes | 1h | Read | $0.03 |
| 20 minutes | 5m | Write | $0.375 |
What the logs actually show about the TTL change
The GitHub issue is persuasive because it uses local evidence, not vibes. The author analyzed 119,866 API calls from Claude Code session JSONL files stored under ~/.claude/projects/, across two machines and two accounts, from Jan 11 to Apr 11, 2026. The report says Claude Code itself stayed on the same version during the key transition window, with no client-side setting change to explain the flip.
Those JSONL files include explicit usage fields such as:
usage.cache_creation.ephemeral_1h_input_tokensusage.cache_creation.ephemeral_5m_input_tokens
That matters because the analysis is looking at the token buckets Anthropic returned with each request. It isn’t guessing TTL from invoice totals.
The rough timeline looks like this:
| Date range | Dominant field pattern | What it suggests |
|---|---|---|
| Jan 11-Jan 31 | ephemeral_5m_input_tokens often present |
Early sessions leaned 5m |
| Feb 1-Mar 5 | ephemeral_1h_input_tokens dominates |
A long 1h-heavy stretch |
| Mar 6-Mar 8 | Mixed 1h and 5m entries | Transition window |
| Mar 8 onward | ephemeral_5m_input_tokens common again |
Post-change behavior leaned back to 5m |
March 6-8 is the interesting part. A random workflow change doesn’t usually show up on two machines and two accounts as a short transition from mostly-1h to mixed to mostly-5m. That pattern looks much more like a rollout or routing shift than somebody coincidentally changing how they prompt on the same dates in two places.
A tiny example of the observable change:
usage.cache_creation.ephemeral_1h_input_tokens: 98234
usage.cache_creation.ephemeral_5m_input_tokens: 0
then later:
usage.cache_creation.ephemeral_1h_input_tokens: 0
usage.cache_creation.ephemeral_5m_input_tokens: 98177
That’s not proof of root cause. It is strong evidence of a server-side billing behavior change visible from the client logs.
Why a 5-minute cache TTL inflates quotas and costs
Anthropic’s prompt caching documentation gives the pricing mechanism. Cache writes cost much more than cache reads. The GitHub issue uses published Sonnet pricing of $3.75 per million tokens for 5-minute cache writes and $0.30 per million tokens for cache reads. That’s a 12.5x spread.
For 100,000 cached tokens:
- cache read: 0.1 MTok × $0.30 = $0.03
- 5m cache write: 0.1 MTok × $3.75 = $0.375
Same follow-up prompt. Same codebase. Same user. Cross the wrong TTL boundary, and the cached portion of the request gets about 12.5x more expensive.
One prompt isn’t the problem. A workday is.
Imagine three follow-ups after the initial cache write:
- after 4 minutes
- after 12 minutes
- after 18 minutes
Under a 1-hour cache, all three can be reads:
- 3 × $0.03 = $0.09
Under a 5-minute cache, the first is a read, but the next two trigger rewrites:
- 1 × $0.03 + 2 × $0.375 = $0.78
That’s $0.69 more for three ordinary follow-ups on 100k cached tokens. Scale that to larger contexts and longer sessions and the “optimization” starts acting like a meter running in the background.
What Anthropic says versus what users observed
Anthropic’s clarification is specific: main agents typically use 1h cache; sub-agents typically use 5m. That narrows the claim a lot. The strongest version is no longer “Anthropic definitely broke one global TTL setting.”
Here’s what the logs can and cannot tell us:
| Question | What logs can verify |
|---|---|
| Were users billed with 1h cache creation fields in one period and 5m fields later? | Yes |
| Did that shift happen around March 6-8 across multiple environments? | Yes |
| Did a single global default definitely change? | No |
| Could routing or agent mix have changed instead? | Yes |
| Did the economics for users get worse when more requests landed in 5m buckets? | Yes |
So the narrow, defensible conclusion is the useful one: the observable billing semantics changed. Maybe the underlying architecture was always more complex. Maybe sub-agent usage increased. Maybe server-side routing changed. Users still bought what looked like 1-hour cache behavior for normal pauses and then saw more 5-minute rewrites in the logs.
Architecture can be documented. Billing semantics are what users actually buy.
Key Takeaways
- The Claude Code cache bug is mainly a billing-semantics problem: users expected 1-hour-style cache behavior, but observed billing shifted toward 5-minute rewrites.
- Boris Cherny’s clarification matters: main agents typically use 1h cache, sub-agents typically use 5m, which means the unresolved question is default change versus routing change versus request-mix change.
- The evidence is stronger than a complaint thread: the GitHub issue analyzes 119,866 local JSONL records across two machines and two accounts, using explicit 1h and 5m cache fields.
- March 6-8 looks like a rollout window: the logs move from 1h-heavy to mixed to 5m-heavy in a way that doesn’t look like random workflow noise.
- A 5-minute cache TTL can quietly shorten sessions: repeated cache_creation writes raise both dollar cost and quota burn even when the conversation barely changes.
Further Reading
- GitHub issue #46829: Cache TTL silently regressed from 1h to 5m, Raw JSONL analysis and cost estimates.
- Anthropic prompt caching documentation, Official cache TTL and pricing reference.
- Claude Usage Tracker Chrome extension, Shows how users monitor caps and usage.
- YesMem per-thread keepalive documentation, A workaround for keeping threads warm.
