Claude Code Cache Bug Turned 5 Minutes Into a Tax

Anthropic’s docs and Boris Cherny’s public clarification describe a split world: main-agent requests typically get a 1-hour cache, while sub-agents typically get 5 minutes. User logs from Claude Code show the part that hurts: requests that had been billed with ephemeral_1h_input_tokens started showing ephemeral_5m_input_tokens again in early March, and normal pauses began looking expensive.

That’s the core of the Claude Code cache bug. The real bug is semantic: users expected 1-hour cache behavior, but observable billing shifted toward 5-minute rewrites, turning caching from optimization into a hidden tax. The unresolved question is which mechanism caused it: a default changed, routing changed, or the request mix changed. The logs don’t settle that. They do show that the billable behavior moved.

Why the Claude Code cache bug matters for billing

Prompt caching is supposed to make long coding sessions cheaper after the first expensive write. You send a huge repo context once, then later turns mostly hit cache reads instead of paying cache creation costs again.

The failure mode is brutally simple. Load 100,000 tokens of repo context, ask a question, step away for 6 minutes, come back, send a follow-up. Under a 1-hour cache TTL, that second turn should mostly be a cheap read. Under a 5-minute TTL, it can become a fresh cache write.

That changes three things at once:

API cost
quota burn
how long you can work uninterrupted before Claude Code starts acting expensive for no visible reason

If your cap is hit by token accounting rather than visible chat count, two extra rewrites after a coffee break can shorten a work session even when the conversation barely moved. That’s why this landed as a billing story, not just a weird implementation detail. It also fits the broader pattern in our earlier piece on the Claude Code regression: users end up reverse-engineering behavior from traces because the UI doesn’t explain the economics.

Pause before next prompt	Active TTL	Likely path	Cost on 100k cached tokens
4 minutes	1h or 5m	Read	$0.03
6 minutes	1h	Read	$0.03
6 minutes	5m	Write	$0.375
20 minutes	1h	Read	$0.03
20 minutes	5m	Write	$0.375

What the logs actually show about the TTL change

The GitHub issue is persuasive because it uses local evidence, not vibes. The author analyzed 119,866 API calls from Claude Code session JSONL files stored under ~/.claude/projects/, across two machines and two accounts, from Jan 11 to Apr 11, 2026. The report says Claude Code itself stayed on the same version during the key transition window, with no client-side setting change to explain the flip.

Those JSONL files include explicit usage fields such as:

usage.cache_creation.ephemeral_1h_input_tokens
usage.cache_creation.ephemeral_5m_input_tokens

That matters because the analysis is looking at the token buckets Anthropic returned with each request. It isn’t guessing TTL from invoice totals.

The rough timeline looks like this:

Date range	Dominant field pattern	What it suggests
Jan 11-Jan 31	`ephemeral_5m_input_tokens` often present	Early sessions leaned 5m
Feb 1-Mar 5	`ephemeral_1h_input_tokens` dominates	A long 1h-heavy stretch
Mar 6-Mar 8	Mixed 1h and 5m entries	Transition window
Mar 8 onward	`ephemeral_5m_input_tokens` common again	Post-change behavior leaned back to 5m

March 6-8 is the interesting part. A random workflow change doesn’t usually show up on two machines and two accounts as a short transition from mostly-1h to mixed to mostly-5m. That pattern looks much more like a rollout or routing shift than somebody coincidentally changing how they prompt on the same dates in two places.

A tiny example of the observable change:

usage.cache_creation.ephemeral_1h_input_tokens: 98234
usage.cache_creation.ephemeral_5m_input_tokens: 0

then later:

usage.cache_creation.ephemeral_1h_input_tokens: 0
usage.cache_creation.ephemeral_5m_input_tokens: 98177

That’s not proof of root cause. It is strong evidence of a server-side billing behavior change visible from the client logs.

Why a 5-minute cache TTL inflates quotas and costs

Anthropic’s prompt caching documentation gives the pricing mechanism. Cache writes cost much more than cache reads. The GitHub issue uses published Sonnet pricing of $3.75 per million tokens for 5-minute cache writes and $0.30 per million tokens for cache reads. That’s a 12.5x spread.

For 100,000 cached tokens:

cache read: 0.1 MTok × $0.30 = $0.03
5m cache write: 0.1 MTok × $3.75 = $0.375

Same follow-up prompt. Same codebase. Same user. Cross the wrong TTL boundary, and the cached portion of the request gets about 12.5x more expensive.

One prompt isn’t the problem. A workday is.

Imagine three follow-ups after the initial cache write:

after 4 minutes
after 12 minutes
after 18 minutes

Under a 1-hour cache, all three can be reads:

3 × $0.03 = $0.09

Under a 5-minute cache, the first is a read, but the next two trigger rewrites:

1 × $0.03 + 2 × $0.375 = $0.78

That’s $0.69 more for three ordinary follow-ups on 100k cached tokens. Scale that to larger contexts and longer sessions and the “optimization” starts acting like a meter running in the background.

What Anthropic says versus what users observed

Anthropic’s clarification is specific: main agents typically use 1h cache; sub-agents typically use 5m. That narrows the claim a lot. The strongest version is no longer “Anthropic definitely broke one global TTL setting.”

Here’s what the logs can and cannot tell us:

Question	What logs can verify
Were users billed with 1h cache creation fields in one period and 5m fields later?	Yes
Did that shift happen around March 6-8 across multiple environments?	Yes
Did a single global default definitely change?	No
Could routing or agent mix have changed instead?	Yes
Did the economics for users get worse when more requests landed in 5m buckets?	Yes

So the narrow, defensible conclusion is the useful one: the observable billing semantics changed. Maybe the underlying architecture was always more complex. Maybe sub-agent usage increased. Maybe server-side routing changed. Users still bought what looked like 1-hour cache behavior for normal pauses and then saw more 5-minute rewrites in the logs.

Architecture can be documented. Billing semantics are what users actually buy.

Key Takeaways

The Claude Code cache bug is mainly a billing-semantics problem: users expected 1-hour-style cache behavior, but observed billing shifted toward 5-minute rewrites.
Boris Cherny’s clarification matters: main agents typically use 1h cache, sub-agents typically use 5m, which means the unresolved question is default change versus routing change versus request-mix change.
The evidence is stronger than a complaint thread: the GitHub issue analyzes 119,866 local JSONL records across two machines and two accounts, using explicit 1h and 5m cache fields.
March 6-8 looks like a rollout window: the logs move from 1h-heavy to mixed to 5m-heavy in a way that doesn’t look like random workflow noise.
A 5-minute cache TTL can quietly shorten sessions: repeated cache_creation writes raise both dollar cost and quota burn even when the conversation barely changes.

Claude Code Cache Bug Turned 5 Minutes Into a Tax

Canva’s AI pivot deepens just as the senior team walks

Heretic Turns Guardrails Into Forks; AI Security Adds Another Alert Stream; Transformer Doubt Goes Public

Open Document AI Gets Cheaper; Chrome Rewrites Partial Page Updates; IBM Plans a Quantum Foundry Spinout; MySQL Buries a 20 Year Bug

llama.cpp Becomes a Local Agent Host; Hidden Audio Still Threatens Voice Agents; Dutch Raid Hits Cybercrime Plumbing

Microsoft AI Costs Hit Budget Walls; Anthropic Says AI Is Finding Real Bugs; Vibe Coding Creates Cleanup Work

Categories