Futuros
Acesse centenas de contratos perpétuos
CFD
Ouro
Plataforma única para ativos tradicionais globais
Opções
Hot
Negocie opções vanilla no estilo europeu
Conta unificada
Maximize sua eficiência de capital
Negociação demo
Introdução à negociação de futuros
Prepare-se para sua negociação de futuros
Eventos de futuros
Participe de eventos e ganhe recompensas
Negociação demo
Use fundos virtuais para experimentar negociações sem riscos
Lançamento
CandyDrop
Colete candies para ganhar airdrops
Launchpool
Staking rápido, ganhe novos tokens em potencial
HODLer Airdrop
Possua GT em hold e ganhe airdrops massivos de graça
Pre-IPOs
Desbloqueie o acesso completo a IPO de ações globais
Pontos Alpha
Negocie on-chain e receba airdrops
Pontos de futuros
Ganhe pontos de futuros e colete recompensas em airdrop
Investimento
Simple Earn
Ganhe juros com tokens ociosos
Autoinvestimento
Invista automaticamente regularmente
Investimento duplo
Lucre com a volatilidade do mercado
Soft Staking
Ganhe recompensas com stakings flexíveis
Empréstimo de criptomoedas
0 Fees
Penhore uma criptomoeda para pegar outra emprestado
Centro de empréstimos
Centro de empréstimos integrado
Centro de riqueza VIP
Planos premium de crescimento de patrimônio
Gestão privada de patrimônio
Alocação premium de ativos
Fundo Quantitativo
Estratégias quant de alto nível
Apostar
Faça staking de criptomoedas para ganhar em produtos PoS
Alavancagem Inteligente
Alavancagem sem liquidação
Cunhagem de GUSD
Cunhe GUSD para retornos em RWA
Promoções
Centro de atividade
Participe de atividades e ganhe recompensas
Indicação
20 USDT
Convide amigos para recompensas de ind.
Programa de afiliados
Ganhe recomp. de comissão exclusivas
Gate Booster
Aumente a influência e ganhe airdrops
Anúncio
Atualizações na plataforma em tempo real
Blog da Gate
Artigos do setor de criptomoedas
Serviços VIP
Grandes Descontos nas Taxas
Gerenciamento de ativos
Solução completa de gerenciamento de ativos
Institucional
Soluções de ativos digitais para empresas
Desenvolvedores (API)
Conecta-se ao ecossistema de aplicativos da Gate
Transferência Bancária OTC
Deposite e retire moedas fiat
Programa de corretoras
Mecanismos de grandes descontos via API
AI
Gate AI
Seu parceiro de IA conversacional para todas as horas
Gate AI Bot
Use o Gate AI diretamente no seu aplicativo social
GateClaw
Gate Blue Lobster, pronto para usar
Gate for AI Agent
Infraestrutura de IA, Gate MCP, Skills e CLI
Gate Skills Hub
10K+ habilidades
Do escritório à negociação: um hub completo de habilidades para turbinar o uso da IA
GateRouter
Escolha inteligentemente entre mais de 40 modelos de IA, com 0% de taxas extras
Claude Código Dicas para economizar dinheiro: engenheiros economizam 300 milhões de tokens por semana com cache, o segredo está em não interromper
Claude Code Long Conversation Quota? Engineer Nate Herk Reveals Saving 300 Million Tokens in a Week with Caching, Up to 91 Million per Day. The Key is Not How Much Code You Write, but How to Avoid "Breaking" the Cache to Prevent Repeated Contexts from Wasting Costs.
(Previous: The open-source badclaude project that accelerates Claude code was sent a copyright infringement notice by Anthropic)
(Additional background: Claude Code adds cloud scheduled tasks! No need to turn on your computer, AI automatically reviews PRs and upgrades)
Table of Contents
Toggle
Many developers find that when writing code with Claude Code, the biggest headache is often the rapid depletion of token quotas, making long conversations almost a luxury.
But influencer Nate Herk, who often shares AI usage tips in the community, revealed in an X tweet that the real cost killer isn’t the amount of code, but whether the system effectively uses prompt caching mechanisms. He personally saved over 300 million tokens in a week, with a peak cache volume of 91 million tokens per day: since cache tokens cost only 10% of regular input tokens, this adds up to about 9 million tokens worth of cost per day, almost "free" extension of the conversation lifespan.
This week I saved 300 million tokens, with 91 million in a single day, over 300 million in a week.
I didn’t change any settings. This is just prompt caching working normally in the background.
But once I truly understood what cache is and how to avoid "breaking" it, I could keep conversations going longer within the same quota. So, here is a 80/20 beginner’s guide to Claude Code prompt caching, without deep API-level details.
The cost of cache tokens is only 10% of regular input tokens. 91 million cache tokens roughly equate to 9 million tokens billed.
Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.
Cache is divided into three layers: system layer, project layer, conversation layer.
Switching models mid-conversation can break the cache, including turning on "opus plan" mode.
Caching costs only 10%, 91 million tokens equals 9 million
Every cached token costs only 10% of a regular input token.
So, when my dashboard shows that on a certain day 91 million tokens hit the cache, the actual billed amount is roughly equivalent to processing 9 million tokens. This is why, compared to no cache, long-term use of Claude Code feels almost "free" in extending conversation sessions.
Two numbers in the dashboard are worth paying attention to:
Cache create: the one-time cost when writing content into cache. It takes effect in the next conversation.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. Compared to reprocessing as input, this costs 10 times less.
If your Cache read number is high, it indicates effective cache utilization; if low, you’re paying repeatedly for the same context.
Anthropic’s Thariq said something very memorable: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, we trigger alerts or even declare SEV-level incidents."
He also wrote a very good X article. When cache hit rate is high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long-term coding sessions become more feasible.
But if the hit rate is low, everyone suffers.
Three-layer architecture: system, project, conversation, stacking layer by layer
Thus, the incentives are aligned: Anthropic wants higher cache hit rates, and you do too. The real drag comes from small habits that seem insignificant but quietly rebuild the cache.
Cache relies on prefix matching, meaning "matching the beginning of the string."
No need to dive into deep technical details—just understand: as long as the content before a certain point matches exactly what’s cached, Claude can reuse those tokens.
A new conversation generally unfolds like this:
Based on Claude Code files, a new session usually proceeds as follows:
First round: no cache yet. System prompt, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.
Second round: all content from the first round is now cached. Claude only needs to process your new reply and next message. Costs are much lower.
Third round: same logic. previous dialogue remains cached, only the latest interaction needs reprocessing.
Most common "break" trap: model switching and 1-hour gaps
Cache itself can be divided into three layers:
From Thariq’s X article:
System layer: includes core instructions, tool definitions (read, write, bash, grep, glob), and output styles. This layer is globally cached.
Project layer: includes CLAUDE.md, memory, project rules. Cached per project.
Conversation layer: includes replies and messages, growing with each turn.
If during a session, the system or project layer content changes, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you’re at message 16, then suddenly change the system prompt or pause for an hour, all tokens from message 1 onward must be reprocessed.
This is a common misunderstanding.
Claude Code subscription: default TTL is 1 hour.
Engineer-made dashboard: view Cache Read and Create
Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan’s Sub-agent: always 5 minutes.
Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.
Months ago, many complained that Claude quota was consumed too quickly. Some thought Anthropic secretly lowered TTL from 1 hour to 5 minutes without notice. But that’s not true; Claude Code’s TTL remains 1 hour.
The confusion comes from the fact that Claude Code and API files are separate, and these are fundamentally different systems.
If you run many Sub-agent workflows or use the API directly, the 5-minute figure matters. But for 95% of Claude Code users, the critical window is the 1 hour.
Here are the parts I find most useful in daily use:
If you’ve been idle for over an hour, previous content has mostly expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "expired" old session, it’s often cheaper to do a clear handoff and start fresh.
/compact or /clear always break the cache, so it’s better to rebuild it at this point.
Practical tip: Session Handoff saves money compared to /compact
I made a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, important files, and where to continue. Then I run /clear, paste the summary, and continue as if nothing was interrupted.
The /compact command can sometimes be slow. This handoff skill usually completes in less than a minute.
Claude.ai’s cache mechanism isn’t fully documented officially, but Projects clearly use different optimization strategies than regular conversation threads. So, if you want to paste large files, it’s better to put them in a Project rather than directly into the conversation.
Certain actions can rebuild the cache without obvious warning:
Model switching: cache relies on prefix matching, and each model has its own cache. Switching models causes the next request to miss cache entirely and re-read the full history.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos, and there’s a reason. But understand that each plan switch is essentially a model switch, which means cache rebuild. Long-term, it still helps extend quota, but knowing what’s happening under the hood is important.
Editing CLAUDE.md mid-conversation is okay: changes won’t take effect immediately, only after restart. So current cache isn’t affected.
The screenshot I showed earlier comes from a token dashboard.
》Original link