AI coding tools have become part of everyday work. But if you've spent any time with them, you've probably noticed that the longer a conversation runs, the slower it gets—and the more your token costs balloon. Looking at how the team behind Anthropic's Claude Code tackled this problem reveals a few insights into using AI tools far more efficiently.
Prompt Caching: The Hidden Engine of AI Tools
Prompt caching is a technique for reusing work that was already processed in an earlier request. Much like a web browser loading a page you've already visited, an AI model skips recomputing content it has seen before and draws on the stored result instead. The key is that the API compares a request from the very beginning, in order, and reuses the cache up to the point where the two requests still match.
The catch is that if even a small piece near the front changes, everything after it is invalidated. Without understanding this behavior, you can end up getting no benefit from caching at all.
Structure Is What Determines Performance
Put What Never Changes First, What Changes Often Last
Claude Code's actual structure makes this principle plain. The fixed system prompt and tool definitions sit right at the front, forming a global cache; project settings come next, followed by session context. Only at the very end does each new conversation message get appended.
In a setup like this, something as small as dropping a timestamp into the system prompt or reordering the tools can wipe out the entire cache. It runs against intuition, but from a caching standpoint it's an absolute rule.
Handle Updates Through Messages
When the time changes or a file gets edited, it's tempting to update the system prompt. But doing so invalidates the cache. Instead, the approach is to append the updated information to the next user message as a system reminder. It's a clever way to preserve the cache while still feeding the model the latest information.
The Traps of Managing Tools and Models
Don't Touch the Tool List
Tool definitions are part of the cache structure too. Adding or removing even a single one invalidates the whole cache. In Claude Code, the team wanted plan mode to keep only read-only tools available—but instead of stripping the list down, they turned EnterPlanMode and ExitPlanMode into tools of their own, so the tool list itself always stays identical.
When there are dozens of tools, rather than removing the ones you aren't using, the approach is to leave only a lightweight stub with defer_loading: true and pull in the full schema with ToolSearch when it's actually needed.
The Hidden Cost of Switching Models
Caches are maintained separately per model. If you switch to a lighter model in the middle of a long conversation just to handle one simple question, you have to build the cache for that new model from scratch—which actually ends up costing more. When a model switch is necessary, the recommended approach is the sub-agent pattern, in which the main model hands off only the work that's needed to a sub-model, summarized separately.
Optimization Strategies in Practice
Context Compaction Has to Account for the Cache Too
When a conversation runs long and the context window fills up, you summarize the conversation and start a fresh session. If you generate that summary with a separate API call, you get no use out of the cache at all. Instead, by reusing the parent conversation's exact system prompt, tools, and message history and simply appending the compaction request at the very end, you can carry over nearly all of the existing cache.
Treat Cache Hit Rate as a Core Metric
The Anthropic team takes cache hit rate seriously enough to declare an incident when it drops. A difference of just a few percentage points has a dramatic effect on both cost and speed. As a developer, you should monitor cache hit rate continuously, the same way you'd watch system uptime.
A New Way of Thinking About AI Tools
Once you understand prompt caching, the way you converse with AI tools changes. You get into the habit of setting your system prompt and tool configuration once and then leaving them alone as much as possible, while passing along any changing information through conversation messages instead. And whenever you switch models or add a new capability, you start by considering the impact it will have on the cache.
In the end, prompt caching isn't just an optimization trick—it's a window into how AI tools are structured and how they work. Keep in mind the principle of designing the entire system so that the front is never touched, and faster, more cost-efficient AI becomes well within reach.




