Claude Código Dicas para economizar dinheiro: engenheiros economizam 300 milhões de tokens por semana com cache, o segredo está em não interromper

Claude Code Long Conversation Quota? Engineer Nate Herk Reveals Saving 300 Million Tokens in a Week with Caching, Up to 91 Million per Day. The Key is Not How Much Code You Write, but How to Avoid "Breaking" the Cache to Prevent Repeated Contexts from Wasting Costs.
(Previous: The open-source badclaude project that accelerates Claude code was sent a copyright infringement notice by Anthropic)
(Additional background: Claude Code adds cloud scheduled tasks! No need to turn on your computer, AI automatically reviews PRs and upgrades)

Table of Contents

Toggle

  • Caching costs only 10%, 91 million tokens equals 9 million
  • Three-layer architecture: system, project, conversation, stacking layer by layer
  • Most common "break" trap: model switching and 1-hour gaps
  • Engineer-made dashboard: view Cache Read and Create
  • Practical tips: Session Handoff saves more than /compact

Many developers find that when writing code with Claude Code, the biggest headache is often the rapid depletion of token quotas, making long conversations almost a luxury.

But influencer Nate Herk, who often shares AI usage tips in the community, revealed in an X tweet that the real cost killer isn’t the amount of code, but whether the system effectively uses prompt caching mechanisms. He personally saved over 300 million tokens in a week, with a peak cache volume of 91 million tokens per day: since cache tokens cost only 10% of regular input tokens, this adds up to about 9 million tokens worth of cost per day, almost "free" extension of the conversation lifespan.


This week I saved 300 million tokens, with 91 million in a single day, over 300 million in a week.

I didn’t change any settings. This is just prompt caching working normally in the background.

But once I truly understood what cache is and how to avoid "breaking" it, I could keep conversations going longer within the same quota. So, here is a 80/20 beginner’s guide to Claude Code prompt caching, without deep API-level details.

The cost of cache tokens is only 10% of regular input tokens. 91 million cache tokens roughly equate to 9 million tokens billed.

Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.

Cache is divided into three layers: system layer, project layer, conversation layer.

Switching models mid-conversation can break the cache, including turning on "opus plan" mode.

coding agents need glass boxes now

jianshuo/ccglass

111 stars on github
created yesterday
mit + javascript
local proxy + web dashboard for claude code, codex, deepseek-tui, and kimi
shows the full system prompt, tool schemas, message history, token/cache/cost, and… pic.twitter.com/Wot5SFV16N

— Beau Johnson (@BeauJohnson89) May 24, 2026

Caching costs only 10%, 91 million tokens equals 9 million

Every cached token costs only 10% of a regular input token.

So, when my dashboard shows that on a certain day 91 million tokens hit the cache, the actual billed amount is roughly equivalent to processing 9 million tokens. This is why, compared to no cache, long-term use of Claude Code feels almost "free" in extending conversation sessions.

Two numbers in the dashboard are worth paying attention to:

Cache create: the one-time cost when writing content into cache. It takes effect in the next conversation.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. Compared to reprocessing as input, this costs 10 times less.

If your Cache read number is high, it indicates effective cache utilization; if low, you’re paying repeatedly for the same context.

Anthropic’s Thariq said something very memorable: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, we trigger alerts or even declare SEV-level incidents."

He also wrote a very good X article. When cache hit rate is high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long-term coding sessions become more feasible.

But if the hit rate is low, everyone suffers.

Three-layer architecture: system, project, conversation, stacking layer by layer

Thus, the incentives are aligned: Anthropic wants higher cache hit rates, and you do too. The real drag comes from small habits that seem insignificant but quietly rebuild the cache.

Cache relies on prefix matching, meaning "matching the beginning of the string."

No need to dive into deep technical details—just understand: as long as the content before a certain point matches exactly what’s cached, Claude can reuse those tokens.

A new conversation generally unfolds like this:

Based on Claude Code files, a new session usually proceeds as follows:

First round: no cache yet. System prompt, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.

Second round: all content from the first round is now cached. Claude only needs to process your new reply and next message. Costs are much lower.

Third round: same logic. previous dialogue remains cached, only the latest interaction needs reprocessing.

Most common "break" trap: model switching and 1-hour gaps

Cache itself can be divided into three layers:

From Thariq’s X article:

System layer: includes core instructions, tool definitions (read, write, bash, grep, glob), and output styles. This layer is globally cached.

Project layer: includes CLAUDE.md, memory, project rules. Cached per project.

Conversation layer: includes replies and messages, growing with each turn.

If during a session, the system or project layer content changes, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you’re at message 16, then suddenly change the system prompt or pause for an hour, all tokens from message 1 onward must be reprocessed.

This is a common misunderstanding.

Claude Code subscription: default TTL is 1 hour.

Engineer-made dashboard: view Cache Read and Create

Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan’s Sub-agent: always 5 minutes.

Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.

Months ago, many complained that Claude quota was consumed too quickly. Some thought Anthropic secretly lowered TTL from 1 hour to 5 minutes without notice. But that’s not true; Claude Code’s TTL remains 1 hour.

The confusion comes from the fact that Claude Code and API files are separate, and these are fundamentally different systems.

If you run many Sub-agent workflows or use the API directly, the 5-minute figure matters. But for 95% of Claude Code users, the critical window is the 1 hour.

Here are the parts I find most useful in daily use:

If you’ve been idle for over an hour, previous content has mostly expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "expired" old session, it’s often cheaper to do a clear handoff and start fresh.

/compact or /clear always break the cache, so it’s better to rebuild it at this point.

Practical tip: Session Handoff saves money compared to /compact

I made a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, important files, and where to continue. Then I run /clear, paste the summary, and continue as if nothing was interrupted.

The /compact command can sometimes be slow. This handoff skill usually completes in less than a minute.

Claude.ai’s cache mechanism isn’t fully documented officially, but Projects clearly use different optimization strategies than regular conversation threads. So, if you want to paste large files, it’s better to put them in a Project rather than directly into the conversation.

Certain actions can rebuild the cache without obvious warning:

Model switching: cache relies on prefix matching, and each model has its own cache. Switching models causes the next request to miss cache entirely and re-read the full history.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos, and there’s a reason. But understand that each plan switch is essentially a model switch, which means cache rebuild. Long-term, it still helps extend quota, but knowing what’s happening under the hood is important.

Editing CLAUDE.md mid-conversation is okay: changes won’t take effect immediately, only after restart. So current cache isn’t affected.

The screenshot I showed earlier comes from a token dashboard.

https://github.com/nateherkai/token-dashboard
This is a simple GitHub repo. You give the link to Claude Code, which deploys locally on localhost, reading all your past sessions instead of starting from scratch. You can see daily input, output, cache create, and cache read data.

But note: this dashboard tracks tokens on your local device. Switching from desktop to laptop will cause discrepancies. Each device has its own stats.

Prompt caching is a deep topic. Thariq’s article covers it more comprehensively. If you want the full picture, it’s worth reading.

But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cache tokens cost 10% of normal tokens; Claude Code TTL is 1 hour; model switching breaks cache; clear handoff is usually more cost-effective than letting an old session "expire" and continuing.

》Original link

Ver original
Esta página pode conter conteúdo de terceiros, que é fornecido apenas para fins informativos (não para representações/garantias) e não deve ser considerada como um endosso de suas opiniões pela Gate nem como aconselhamento financeiro ou profissional. Consulte a Isenção de responsabilidade para obter detalhes.
  • Recompensa
  • Comentário
  • Repostar
  • Compartilhar
Comentário
Adicionar um comentário
Adicionar um comentário
Sem comentários
  • Fixado