This guide collects patterns that have proven out in the Context Mode project and in community discussion.
___________
Every time an AI coding agent calls a tool, raw data piles up in the context window. A Playwright snapshot runs 56KB; twenty GitHub issues, 59KB; an access log, 45KB. At that rate, half an hour can burn through 40% of a 200K-token budget. Install just five or six tools and you load more than 80 tool definitions — enough that 72% can already be spent before the first message is even sent.
Output Isolation: The Sandbox Principle
Don't drop a tool call's raw output straight into context. Run each call in an isolated subprocess and pass only stdout back. Byproducts — logs, raw API responses, DOM snapshots — stay quarantined inside the sandbox.
The rule of thumb: once a tool's output exceeds 1KB, route it through the sandbox by default. That said, sending a 200-byte call like `curl api.example.com/health` through the sandbox is overkill. Keep a bypass path open based on output size, but the principle is “isolate by default, go direct only as the exception.” Tune the threshold to fit your project.
Knowledge-Base Search: Go Hybrid
Pure keyword search isn't enough, because tool output is a mix of structured data — JSON, tables — and natural language like error messages and comments.
Keyword search uses SQLite FTS5 with BM25 ranking and Porter stemming; it's strong at matching exact identifiers, function names, and error codes. Semantic search leans on Model2Vec or a lightweight embedding model plus sqlite-vec, handling similarity-based lookups like “contexts similar to this error.” Result fusion merges the two with Reciprocal Rank Fusion, playing to each method's strengths while covering the other's blind spots.
Indexing principles matter too. Split markdown at heading boundaries, and keep code blocks intact rather than breaking them apart. With an `--incremental` approach that re-indexes only the changed chunks, the benchmark is four minutes for a full re-index of all 15,800 files and under ten seconds for a routine incremental update.
The Subagent Split Pattern
The most fundamental way to save context isn't compressing data — it's splitting up the work itself.
When a problem comes up, fork a subagent to solve it and bubble only the result back up to the parent context. Each subagent call runs as its own process, so it never pollutes the parent's memory.
Here's the structured report a subagent hands back to its parent:
The Four-Part Retrospective
Save tool output to a file and let the LLM read only the parts it needs. If a run ends in `"Success!"`, check just the last line; if it fails, extract only the error message.
Context Hygiene: Backtracking and Auto-Cleanup
Once a bug is fixed, you should be able to clear out the logs and failure traces that piled up during debugging.
With automatic retry-pattern detection, if there are traces of the same task being attempted several times, keep only the last successful version. With auto-cleanup after a set number of references, evict log-like data from context after it has been referenced N times, or once the related task is finished. Treat context as a freely editable workspace, not a stack — you have to break out of a structure that only ever accumulates.
The Trade-Off Between Compression and Accuracy
Compressing 153 git commits down to 107 bytes is impressive, but the model can reach the information it needs only if it writes a flawless extraction script. One wrong command and the critical data is gone.
Compression is a means, not an end. Stretching a session to three hours means nothing if the reasoning quality at the two-hour mark doesn't hold. Always factor in the risk of hallucination from incomplete data or faulty extraction logic. The pattern to favor is “summarize, then drill down”: put only the summary in context at first, and keep a path open for the model to reach the original once it decides it needs the details.
Mind the Cache Economics
When prompt caching works well, even a verbose context is nearly free in cost terms. But the moment compression breaks cache continuity, costs actually go up.
For a given query, the compressed output has to be deterministic. If it compresses into a different shape every time, your cache hit rate falls. And even when the cache looks free, you still pay in degraded attention and slower processing — reusing a long prefix doesn't actually reduce the amount of computation. The rule is to measure cache efficiency and compression efficiency together, and never optimize just one at the other's expense.
Minimize Tool Loading
Rethink whether you really need to load 80-plus tool definitions into context all at once.
Load only the tools each task needs, dynamically. Where a CLI app can stand in for an MCP server, the CLI uses far fewer tokens (think GitHub CLI vs. the GitHub MCP). Bloated tool definitions are a separate problem from output compression — and you have to solve both.
No Optimization Without Measurement
You need an instrumentation layer that visualizes and tracks context consumption.
The metrics worth tracking:
AI Token-Usage Metrics
One example tool parses `~/.claude/projects/*/*.jsonl` to break down cost by session, tool, and timeline.
In Summary: A Decision Checklist
If a tool's output exceeds 1KB, isolate it in the sandbox and pass back only stdout; under 1KB, put it directly into context. For search over a mix of structured data and natural language, go hybrid. Fork any independently solvable subtask to a subagent and take back only the result.
Once debugging is done, backtrack to clear the leftover logs from context. For bulk data you might still need details from, summarize first and secure a drill-down path. When your tool count passes 80, look at dynamic loading or CLI replacements — and when you validate an optimization, measure token usage, cache hit rate, and reasoning quality all at once.
This guide is based on an analysis of the Context Mode project, current as of February 2026.



