If you’ve ever seen Claude’s “usage limit reached” message mid-task, you know the pain. You’re deep in a coding session, context is perfect, and suddenly — you’re locked out for hours.
Here’s the thing most people don’t realize: Claude doesn’t count messages. It counts tokens. And every single reply requires re-reading your entire conversation history. That means the cost of message #30 is roughly 31× the cost of message #1.
One developer tracked their actual usage and found that 98.5% of tokens went to re-reading history — only 1.5% generated new output. Let that sink in: for every 1,000 tokens of useful output, you’re burning 65,000 tokens on re-reading old messages.
This guide covers everything I’ve learned about making tokens last: conversation habits that cost nothing to adopt, and open-source tools that automate the savings.
Part 1: Conversation Habits (Zero Cost, Immediate Impact)
These work whether you’re using Claude’s web app, desktop app, or API.
1. Edit Your Prompt Instead of Adding Messages
Picture this: you ask Claude to write a Python function, and it uses requests instead of httpx. Your instinct? Send a follow-up:
You: "Actually, use httpx instead of requests"
That one “correction” just doubled your token cost for the next turn — because Claude now has to re-read the original question, the wrong answer, and your correction before generating the fix.
What to do instead: Click the edit button (pencil icon) on your original message, change “write a function” to “write a function using httpx,” and hit regenerate. The wrong answer disappears entirely from history. Zero bloat.
Here’s why this matters at scale:
| Messages in Chat | Cumulative Tokens (at ~500/turn) | Cost of Next Reply |
|---|---|---|
| 5 | ~7,500 | Reading 5 messages |
| 10 | ~27,500 | Reading 10 messages |
| 20 | ~105,000 | Reading 20 messages |
| 30 | ~232,000 | Reading 30 messages |
By message #30, every single reply costs 31× what it cost at the start. Editing instead of appending keeps you at the lower end of this curve.
2. Start Fresh Every 15-20 Messages
I used to have marathon Claude sessions — 100+ messages deep. By the end, responses were slow, expensive, and honestly worse in quality (the model struggles with very long context, too).
The ritual that changed everything:
You: "Summarize our conversation so far — the key decisions made,
current code state, and what we're working on next."
[Claude outputs a ~300 token summary]
[Open new chat]
You: "Continuing from a previous session. Here's where we left off:
[paste summary]
Next task: implement the caching layer we discussed."
The new chat starts at ~500 tokens instead of 50,000. Same context. 100× cheaper per reply. I do this almost instinctively now, and it’s the single biggest change in how long my quota lasts.
3. Batch Multiple Questions Into One Message
This one seems obvious, but I catch myself doing it wrong constantly. Here’s the real cost:
❌ Three separate messages (3 full context loads):
"Summarize this article" → loads 5,000 tokens of history
"List the key points" → loads 6,000 tokens of history
"Suggest a title" → loads 7,000 tokens of history
Total context loaded: 18,000 tokens
✅ One combined message (1 context load):
"Summarize this article, list the key points,
and suggest a title." → loads 5,000 tokens of history
Total context loaded: 5,000 tokens
That’s a 72% reduction just from combining three questions. And here’s the surprising bonus: Claude usually gives better answers when it sees all three requests at once, because it can make the summary, points, and title work together coherently.
4. Use Projects to Cache Repeated Files
I was uploading the same 15-page API spec to every new conversation for two weeks before I learned about Projects. Each upload cost me ~4,000 tokens of parsing — and I was creating 3-4 new chats per day.
The math: 4,000 tokens × 4 chats × 14 days = 224,000 tokens wasted on re-uploading the same unchanged document.
How Projects solve this (step by step):
- Open claude.ai → click “Projects” in the left sidebar → “Create Project”
- Give it a name (e.g., “MyApp Backend”) and optionally add a system prompt describing the project context
- Click “Add content” → upload your files (PDFs, code files, docs — up to 50 files per project)
- Now every conversation you create inside this project automatically has access to those files
Without Projects:
Chat 1: Upload API spec (4,000 tokens) → Ask question
Chat 2: Upload API spec again (4,000 tokens) → Ask question
Chat 3: Upload API spec again (4,000 tokens) → Ask question
Total: 12,000 tokens on the same file
With Projects:
Project: Upload API spec once (4,000 tokens, cached by Anthropic)
Chat 1: Ask question → API spec loaded from cache (minimal cost)
Chat 2: Ask question → API spec loaded from cache (minimal cost)
Chat 3: Ask question → API spec loaded from cache (minimal cost)
Total: ~4,000 tokens + negligible cache reads
The key insight: files uploaded to a Project are tokenized once and cached server-side by Anthropic. Subsequent conversations reference the cached version, so you pay a fraction of the original parsing cost. The files stay available until you remove them — no need to re-upload even if you close and reopen the browser.
What belongs in a Project:
- API specifications and technical docs you reference daily
- Brand style guides and coding standards
- Contract templates or legal boilerplate
- Any file over 2,000 tokens that you use more than twice
5. Set Up Memory and Preferences
Without saved preferences, every new conversation starts like this:
You: "I'm a full-stack developer working primarily in TypeScript."
You: "I prefer concise responses with code examples."
You: "Please use ES modules, not CommonJS."
You: "Keep explanations brief — I understand the basics."
That’s 4 messages × ~100 tokens each = 400 tokens of “warm-up” every single time you open a new chat. Over a month of daily use? That’s 12,000+ tokens spent telling Claude the same thing it already knew.
Go to Settings → Memory & Preferences and save these once. Claude loads them silently at the start of every conversation — no messages, no token cost to you.
6. Turn Off Features You’re Not Using
This one is sneaky. Claude’s web app has several toggles that consume tokens in the background:
- Web search: Even when Claude doesn’t search, the capability to search adds system prompt tokens
- Extended Thinking: Powerful for math proofs and complex logic. Absolutely unnecessary for “rewrite this email in a friendlier tone”
- Connectors: Each enabled connector adds tool definitions to the context
My rule: Before starting a task, I check what’s enabled and disable anything I won’t need. It’s like closing browser tabs — a small act that prevents a slow drain.
7. Match the Model to the Task
I spent weeks using Sonnet for everything, including tasks like “fix this typo” and “add a comma after line 12.” That’s like renting a moving truck to pick up groceries.
Here’s the cheat sheet I actually use now:
| Task | Best Model | Why | Token Savings vs. Sonnet |
|---|---|---|---|
| Grammar check, reformatting, translation | Haiku | Pure pattern matching — doesn’t need reasoning | 50-70% cheaper |
| ”Rewrite this in a friendlier tone” | Haiku | Simple rewriting, no deep understanding needed | 50-70% cheaper |
| Writing code, answering technical questions | Sonnet | Good balance of speed and intelligence | Baseline |
| Code review, refactoring suggestions | Sonnet | Needs context but not deep reasoning | Baseline |
| Architecture design, complex debugging | Opus | Multi-step reasoning, worth the premium | 3-5× more expensive |
| Mathematical proofs, research analysis | Opus | Needs Extended Thinking for best results | 3-5× more expensive |
How to switch models: In Claude’s web/desktop app, click the model name at the top of the chat (it usually says “Claude Sonnet” or “Claude 4 Sonnet”). A dropdown appears — select Haiku for simple tasks, Opus for hard ones.
My daily workflow:
- Morning email and docs: Switch to Haiku (quick rewrites, summaries, formatting)
- Active coding: Stay on Sonnet (the default sweet spot)
- Hit a hard bug or planning a new system: Switch to Opus for that one conversation, then switch back
The mental habit that matters: before hitting Enter, take one second to ask — “Does this task actually need Sonnet-level intelligence?” Most of the time, the answer is no.
8. Spread Work Across the Day
Claude uses a rolling 5-hour window, not a midnight reset. This is the most misunderstood part of Claude’s rate limiting — and once you understand it, you can practically double your daily output.
How the rolling window actually works:
Imagine your quota as a bucket that holds 100 units. Every token you send drains the bucket. But here’s the key: each token re-enters the bucket exactly 5 hours after it was used. It’s not “your bucket refills at midnight” — it’s a continuous rolling expiration.
❌ The Sprint Approach (burns out by lunch):
9:00 AM ████████████████████░░░░░ Used 80%, deep coding sprint
10:00 AM ████████████████████████░ Used 95%, pushing through
10:30 AM ████████████████████████X LOCKED OUT 🔒
10:31 AM - 2:00 PM ⏳ Waiting... doing nothing...
2:00 PM ████████████░░░░░░░░░░░░░ 9 AM usage starts expiring
3:00 PM ████████░░░░░░░░░░░░░░░░░ More expires, usable again
Daily total: ~1.2x quota (lots of idle waiting)
✅ The Wave Approach (sustainable all day):
9:00 AM ████████░░░░░░░░░░░░░░░░░ Session 1: planning, scaffolding (30%)
9:30 AM Break — switch to non-AI work (email, meetings, code review)
12:00 PM ██████░░░░░░░░░░░░░░░░░░░ Session 2: core implementation (25%)
12:30 PM Break — lunch, walk
2:00 PM ░░░░░░░░░░░░░░░░░░░░░░░░░ 9 AM usage fully expired! Bucket refilled
2:30 PM ████████░░░░░░░░░░░░░░░░░ Session 3: testing, debugging (30%)
3:00 PM Break
5:00 PM ░░░░░░░░░░░░░░░░░░░░░░░░░ 12 PM usage expired! Bucket refilled again
5:30 PM ██████░░░░░░░░░░░░░░░░░░░ Session 4: polish, documentation (25%)
Daily total: ~2.2x quota (zero downtime, never hit the limit)
A practical daily schedule that works:
| Time Block | Duration | What to Do with Claude | What to Do Without Claude |
|---|---|---|---|
| 9:00 - 9:30 AM | 30 min | Planning: architecture discussions, task breakdown | — |
| 9:30 - 12:00 PM | 2.5 hrs | — | Manual coding, code review, meetings |
| 12:00 - 12:30 PM | 30 min | Implementation: write core logic with Claude | — |
| 12:30 - 2:30 PM | 2 hrs | — | Lunch, non-coding work (9 AM tokens expiring) |
| 2:30 - 3:00 PM | 30 min | Debugging: fix issues from morning session | — |
| 3:00 - 5:00 PM | 2 hrs | — | Testing, documentation (12 PM tokens expiring) |
| 5:00 - 5:30 PM | 30 min | Polish: refactor, optimize, write docs with Claude | — |
The key insight: your best work with Claude happens in focused 30-minute bursts, not marathon sessions. Short bursts force you to think before you ask (better prompts = less waste), and the gaps between sessions let your quota recover naturally.
9. Avoid Peak Hours for Heavy Tasks
Since March 2026, Anthropic applies a peak multiplier to token consumption. The same query consumes more of your quota during peak hours:
- Peak: Weekdays, 8 AM – 2 PM Eastern (8 PM – 2 AM Beijing time)
- Off-peak: Evenings, nights, and weekends
If you have a massive code refactor or deep research task, scheduling it for 7 PM instead of 10 AM can stretch your quota noticeably further — same work, less cost.
10. Enable Overage as a Safety Net
This doesn’t save tokens, but it prevents the worst-case scenario: getting locked out during a critical debugging session or right before a deadline.
Pro, Max 5x, and Max 20x subscribers can enable Overage in Settings → Usage. When your rolling quota runs out, Claude switches to pay-as-you-go API pricing instead of locking you out. Set a monthly cap ($5, $10, whatever you’re comfortable with) to avoid bill shock.
Think of it as insurance — you hope you never need it, but when you do, it’s worth every penny.
Part 2: Open-Source Tools for Claude Code Users
If you use Claude Code (the CLI), there’s a growing ecosystem of open-source tools that automate token savings at the system level. Here’s what each tool does, how it achieves compression, and when to use it.
Already familiar with RTK and Caveman Mode? We’ve covered those in dedicated reviews. The tools below are complementary options.
ClaudeSlim — Local Proxy Compression (60-85% Savings)
What it does: Sits between Claude Code and the Anthropic API as a local proxy on localhost:8086. Every API request passes through it and gets compressed before hitting Anthropic’s servers.
How the compression works:
ClaudeSlim targets four specific sources of bloat, each with a different strategy:
-
System prompt hashing (95% reduction): Claude Code sends the same system prompt with every request — it’s huge and never changes mid-session. ClaudeSlim hashes it and sends only the hash after the first request.
-
Tool definition compression (80% reduction): Claude Code registers dozens of tool definitions (file read, write, bash, etc.) on every API call. ClaudeSlim strips redundant schema fields, shortens descriptions, and deduplicates definitions.
-
Message history compression (40% reduction): Old messages in the conversation get their whitespace stripped, code blocks summarized, and verbose tool results truncated.
-
Tool call compression (50% reduction): Raw tool call/result payloads often contain full file contents that were already sent in previous turns. ClaudeSlim detects duplicates and replaces them with references.
Before: 7,094 tokens per request
After: 2,775 tokens per request
Saved: 60.9% — on a $20/month Pro plan, that's ~$110/month effective value
Install & use:
git clone https://github.com/apolloraines/ClaudeSlim
cd ClaudeSlim
pip install -r requirements.txt
python proxy.py # starts on localhost:8086
# Point Claude Code to the proxy
export ANTHROPIC_BASE_URL=http://localhost:8086
claude # use normally — compression is transparent
Best for: Pro plan users ($20/month) who want the biggest bang for their buck without changing workflows.
GitHub: apolloraines/ClaudeSlim
Pruner — Smart Context Pruning + Prompt Caching (20-70% Savings)
What it does: Instead of compressing content, Pruner removes what Claude doesn’t need. It trims old messages, caps oversized tool outputs, and automatically injects Anthropic’s prompt caching to avoid re-processing static content.
How the pruning works:
-
Context pruning: Keeps only the last N messages (default: 20). The first message is always preserved. When trimming, tool use/result pairs are kept together — you never get a tool call without its result, which would confuse the model.
-
Prompt cache injection: Anthropic offers a
cache_controlAPI that lets you mark content as cacheable. Pruner automatically addscache_control: { type: "ephemeral" }to any system prompt over 1,024 tokens. This means Anthropic caches it server-side, and subsequent requests pay only 10% of the original token cost for that section. -
Output truncation: Large tool results (e.g., a
caton a 500-line file) get capped at 3,000 characters. The model rarely needs the full output — the relevant section is usually near the top or bottom. -
Smart summaries: When messages get pruned, Pruner generates a brief structural summary of what was removed, so the model knows “there was prior discussion about X” without needing the full transcript.
Verified savings: Unlike most tools that estimate compression, Pruner calls Anthropic’s /v1/messages/count_tokens API in parallel to get the exact before/after token count. What it reports matches your actual bill.
npx pruner@latest # install and start in one command
# Config in ~/.pruner/config.json:
{
"maxMessages": 20, // keep last 20 messages
"maxToolOutputChars": 3000, // truncate large outputs
"enablePromptCache": true, // auto-inject cache_control
"enableDedup": true, // deduplicate repeated file reads
"quiet": false // show per-request savings inline
}
Best for: Developers who want a zero-config solution with smart defaults. Great as a first tool — low risk, easy to understand.
Token Reducer — RAG-Based Compression (90-98% Savings)
What it does: The most aggressive tool in this list. Instead of pruning or proxying, Token Reducer builds a local search index of your entire codebase and sends only the relevant fragments to Claude — never the full file.
How the RAG pipeline works:
-
AST-based chunking: Uses Tree-sitter to parse your code into semantic units — functions, classes, imports — not arbitrary line-based chunks. This means a 500-line file becomes 15-20 meaningful code blocks rather than random slices.
-
Hybrid retrieval: When Claude needs context about a file, Token Reducer runs two parallel searches:
- BM25 (keyword matching): Fast exact-match search using SQLite FTS5
- Semantic vectors (ONNX embeddings): Jina Code v2 model finds conceptually related code, even if the keywords don’t match
Results from both are merged and re-ranked to get the most relevant chunks.
-
TextRank compression: For remaining text content, a graph-based algorithm scores each sentence by importance (similar to how Google’s original PageRank worked) and keeps only the highest-scoring ones.
Example: Claude asks to read a 2,000-line utils.ts file
Without Token Reducer:
→ Sends all 2,000 lines (≈8,000 tokens)
With Token Reducer:
→ AST parses into 45 chunks
→ Query: "authentication middleware"
→ BM25 finds 3 keyword matches
→ Semantic search finds 2 conceptually related chunks
→ Sends 5 relevant chunks (≈400 tokens)
Savings: 95%
pip install token-reducer
token-reducer index ./src # index your codebase (one-time)
token-reducer serve # start the local API
# Fully local — no API calls, no data leaves your machine
# Supports: Python, TypeScript, Go, Rust, Java, and more
Trade-off: More complex setup than proxy tools, and you need to re-index when files change. But the compression ratios are unmatched.
Best for: Large codebases where Claude wastes significant tokens reading irrelevant files.
GitHub: Madhan230205/token-reducer
Claude-Warden — Hook-Based Token Guards
What it does: Unlike the tools above, Warden doesn’t compress output — it prevents waste from happening in the first place. It installs Git-style hooks that intercept Claude Code’s tool calls at every stage.
How the hooks work:
Pre-tool hooks (before a command runs):
Claude wants to run: npm install express
Warden intercepts → rewrites to: npm install express --silent
Claude wants to run: cargo build
Warden intercepts → rewrites to: cargo build -q
Claude wants to read: dist/bundle.min.js (458 KB minified)
Warden intercepts → BLOCKED (binary/minified file, useless to the model)
Claude wants to run: grep -r "TODO" .
Warden intercepts → BLOCKED (recursive grep without depth limit)
Post-tool hooks (after a command runs):
Bash output is 25,000 characters:
→ Truncated to 10,000 (8KB head + 2KB tail — errors are usually
at the beginning or end)
Task/agent output is 8KB of verbose reasoning:
→ Compressed to structured bullets and headers
Read result for a 600-line file:
→ Structural signature extracted: imports, function names,
class definitions (subagents only see the skeleton)
Budget enforcement: You can set a per-session tool budget. At 75% usage, Warden shows a warning. At 90%, it becomes urgent. This prevents runaway subagents from eating your entire session budget.
git clone https://github.com/johnzfitch/claude-warden
cd claude-warden && ./install.sh
# Hooks are auto-registered — Claude Code picks them up immediately
# Check the live statusline for real-time stats:
# [sonnet-4] ctx:42% | in:12.3k out:2.1k | cache:89% | tools:24 | budget:67%
Best for: Developers who want guardrails against common token waste patterns. Pairs well with RTK — Warden blocks verbose commands, RTK compresses the output.
GitHub: johnzfitch/claude-warden
Token-Saver — Content-Aware Output Compression (60-99%)
What it does: Ships 21 specialized processors that understand the semantics of different command outputs. Instead of blind truncation, each processor knows what information matters.
How the content-aware processing works:
Every command output is routed to a domain-specific processor:
git diff output (1,200 lines):
Processor: git-diff
→ Keeps: actual changed lines, file names, conflict markers
→ Strips: identical context lines beyond ±3, permission changes,
rename-only diffs
→ Result: 180 lines (85% reduction)
pytest output (500 lines):
Processor: pytest
→ Keeps: FAILED tests, error messages, stack traces, summary line
→ Strips: PASSED tests (dozens of "test_xxx PASSED" lines),
progress dots, timing info
→ Result: 35 lines (93% reduction)
docker build output (800 lines):
Processor: docker
→ Keeps: error steps, final image ID, build warnings
→ Strips: layer download progress bars, cache hit messages,
intermediate step confirmations
→ Result: 12 lines (98% reduction)
Safety rails:
- Outputs under 200 characters are never modified (too short to compress meaningfully)
- Compression only kicks in if the reduction exceeds 10% (don’t touch it if there’s nothing to gain)
- Source code files (
cat *.py) pass through unchanged — the model needs exact content .envfile contents are redacted before reaching the model (security bonus)
# Install
curl -fsSL https://github.com/ppgranger/token-saver/releases/latest/download/install.sh | bash
# Works with both Claude Code and Gemini CLI
# Registers as a hook — transparent, no workflow changes
Best for: Developers who run lots of test suites, build commands, and git operations. The domain-specific processing is smarter than generic truncation.
Claudio — Preprocessing Pipeline (91-96% on Large Files)
What it does: A CLI layer that sits between you and Claude, preprocessing your prompts before they’re sent. It strips noise, compresses file content, and caches responses locally.
How the preprocessing pipeline works:
Every request goes through three stages:
-
Strip: Removes comments, blank lines, trailing whitespace, and import noise from code files. A 200-line Python file might drop to 120 lines — same logic, less cruft.
-
Compress: Converts Markdown-heavy prompts to XML format. Why? Claude parses XML natively (it’s the same format used for tool definitions), and XML tags cost ~2 tokens each vs. ~4-6 for Markdown headers. Across 50 requests, this saves ~200 tokens.
-
Cache: Responses are stored locally with a content hash. If you ask the same question about the same file, Claudio returns the cached result instantly — zero tokens spent.
# Analyze a file — 96% savings on a 2,844-token input
claudio build -r @filter.py "simplify this"
# Input: 2,844 tokens → After pipeline: 128 tokens (96% saved)
# Ask about code — 94% savings
claudio ask -rv @executor.py "any security issues?"
# Input: 650 tokens → After pipeline: 80 tokens (94% saved)
# Agentic mode — multi-task without re-ingesting context
claudio plan "refactor auth module" --tasks 5
# Each sub-task shares context from task 1 — no per-task re-ingest
Best for: Developers who want fine-grained control over exactly what Claude sees, and who frequently re-ask similar questions.
Context Optimizer — Analytics Before Optimization (30-50%)
What it does: Instead of compressing blindly, Context Optimizer measures first. It silently tracks every file read, edit, and search across your sessions, then tells you exactly where your tokens go — and where they’re wasted.
How the analytics work:
Context Optimizer builds a “coding profile” over time. After a few sessions, it knows:
/cco — Session heatmap (which files ate the most tokens)
src/utils/auth.ts ████████████████ 4,200 tokens (read 8x)
src/api/routes.ts ████████████ 3,100 tokens (read 6x)
src/types/index.ts ████████ 2,000 tokens (read 5x)
node_modules/express/ ██████████████ 3,500 tokens ← WASTE
package-lock.json ██████ 1,500 tokens ← WASTE
/cco-report — ROI analysis
Efficiency Score: 62/100 (Grade: C)
Token waste: 5,000 tokens/session on node_modules + lock files
Suggestion: Add to .claudeignore to save 35% per session
/cco-git — Smart file loading based on git diff
Files changed since last commit: 3
Suggested context: auth.ts, routes.ts (skip unchanged types/)
Estimated savings: 2,000 tokens vs loading all project files
The power isn’t in compression — it’s in knowing what to stop loading. After running Context Optimizer for a week, most developers find they’ve been feeding Claude 30-40% irrelevant files out of habit.
npm install -g claude-context-optimizer
# Slash commands inside Claude Code:
/cco # session heatmap
/cco-report # full ROI report
/cco-budget set 50000 # set token budget with alerts
/cco-git # git-aware file suggestions
Best for: Teams that want data-driven decisions. Run it for a week before choosing a compression tool — you’ll know exactly where to focus.
GitHub: egorfedorov/claude-context-optimizer
Which Approach Should You Use?
| Your Situation | Start Here | Then Add |
|---|---|---|
| Using Claude web/desktop only | Part 1 habits (free, immediate) | — |
| New to Claude Code | Pruner (zero-config, safe defaults) | Warden for guardrails |
| Claude Code on Pro plan ($20/mo) | ClaudeSlim proxy (biggest ROI) | Token-Saver for command output |
| Large codebase, heavy sessions | Token Reducer (highest compression) | Context Optimizer to find waste |
| Want to prevent waste proactively | Claude-Warden hooks | RTK for command output |
| Already using RTK | Context Optimizer (find remaining waste) | Warden for complementary coverage |
| Not sure where to start | Context Optimizer for 1 week | Then pick based on the data |
The Bottom Line
Token management isn’t about being stingy — it’s about being intentional. The 10 conversation habits cost nothing and take effect immediately. The open-source tools can multiply your effective quota by 3-10×, depending on your workflow.
Start with the habits. Pick one tool that matches your setup. And remember the core principle: Claude charges for tokens, not messages. Once you internalize that, everything else follows.
Comments
Sign in with GitHub to leave a comment. Your feedback is appreciated!