Every developer using AI coding assistants has felt the sting: you ask a simple question about a React re-render bug, and the model responds with a 1,200-token essay. It opens with “Great question!”, fills the middle with transitional phrases nobody reads, and closes with “Let me know if you need anything else!”

That’s not just annoying, it’s expensive. Every one of those filler tokens costs real money.

Caveman, now at 52.3K stars on GitHub with 2.8K forks, takes a brutally simple approach to fixing this: it tells your AI to shut up and just answer. And since its initial viral moment, it’s evolved from a single Claude Code plugin into a full three-part ecosystem supporting over 30 AI coding agents.


The Problem: You’re Paying for Small Talk

Here’s the comparison from Caveman’s own README that makes the waste obvious. Same question, explaining why a React component re-renders due to reference changes:

Normal Claude (69 tokens)

“The reason your React component is re-rendering is likely because you’re creating a new object reference on each render cycle. When you pass an inline object as a prop, React’s shallow comparison sees it as a different object every time, which triggers a re-render. I’d recommend using useMemo to memoize the object.”

Caveman Claude (19 tokens)

“New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo.”

Same fix. 75% less words. Brain still big.


The Caveman Ecosystem (New in 2026)

Caveman is no longer just one tool. It’s grown into three complementary projects:

ProjectPurposeKey Benefit
cavemanCompress agent output~65% mean output token reduction
cavememCross-agent persistent memoryStores memories in compressed format (SQLite/FTS5) to reduce token bloat when recalling context
cavekitSpecification-driven build loopsReduces agent “guessing” with structured autonomous workflows

Plus two powerful sub-tools:

  • caveman-compress, cuts ~46% of input tokens from files like CLAUDE.md every session
  • caveman-shrink, an MCP proxy middleware for even deeper compression

Four Compression Levels

Caveman now offers four modes, including a 文言文 (Classical Chinese) mode:

LevelStyleExample Output
LiteTrims filler, keeps complete grammar”Your component re-renders because you create a new object reference each render. Wrap it in useMemo.”
Full (default)Drops articles, fragments sentences”New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo.”
UltraHeavy abbreviation, telegram-style”Inline obj prop → new ref → re-render. useMemo.”
文言文Classical Chinese compression”物出新參照,致重繪。useMemo Wrap之。”

The core design principle remains unchanged: compress the language, never the code. Technical nouns, function names, file paths, and code blocks pass through completely untouched. Only natural-language fluff gets stripped.


Official Benchmark Numbers

From Caveman’s own benchmarks, 65% mean output reduction across tasks, with a range of 22–87%:

┌─────────────────────────────────────┐
│ TOKENS SAVED ████████ 65% │
│ TECHNICAL ACC. ████████ 100% │
│ SPEED INCREASE ████████ ~3x │
│ VIBES ████████ OOG │
└─────────────────────────────────────┘

The savings claim is backed by peer-reviewed research showing that forcing concise output doesn’t reduce accuracy — it can actually improve it. The model stops performing “helpful assistant theater” and focuses compute on the actual problem.

Key benefits:

  • Faster responses — fewer tokens to generate = lower latency
  • Easier to read — no wall of text, just the answer
  • Same accuracy — all technical info preserved, only fluff dropped
  • Save money — 65% mean output reduction (range 22–87%)

Setup: One Command, 30+ Agents

The installation has been dramatically simplified. A single command auto-detects every supported agent on your system:

# macOS / Linux / WSL / Git Bash
curl -fsSL https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.sh | bash

# Windows (PowerShell)
irm https://raw.githubusercontent.com/JuliusBrussee/caveman/main/install.ps1 | iex

This detects 30+ agents automatically: Claude Code, Gemini CLI, Codex, Cursor, Windsurf, Cline, GitHub Copilot, Continue, Kilo, Roo, Augment, Aider Desk, Amp, JetBrains Junie, Kiro CLI, Mistral Vibe, OpenHands, Qwen Code, Tabnine, Trae, Warp, Replit Agent, Antigravity, and more. It runs each agent’s native install, skips what you don’t have, and is safe to re-run.

Manual Install (Per Agent)

# Claude Code
claude plugin marketplace add JuliusBrussee/caveman && claude plugin install caveman@caveman

# Gemini CLI
gemini extensions install https://github.com/JuliusBrussee/caveman

# Cursor / Windsurf / Cline / Copilot
npx skills add JuliusBrussee/caveman -a <cursor|windsurf|cline|github-copilot>

Installer Flags

FlagEffect
--minimalSkip extras, just install the plugin/extension
--allAlso drop per-repo rule files into current directory
--with-hooksInstall Claude Code hooks + statusline + stats badge
--with-mcp-shrinkRegister the caveman-shrink MCP proxy
--with-initWrite auto-start rules for Cursor, Windsurf, Cline, Copilot, etc.
--dry-runPreview what would be installed
--listShow all detected agents

Using It: Simple Voice Commands

Once installed, control Caveman with natural language:

  • Activate: Type /caveman in your prompt (or $caveman in Codex)
  • Deactivate: Say “stop caveman” or “switch back to normal”
  • Change level: /caveman ultra or just say “switch to 文言文 mode”
  • Commit mode: caveman-commit for tight commit messages
  • Review mode: caveman-review for one-line code reviews
  • Compress input: caveman-compress to shrink context files

The transitions are seamless, no session restart needed.


The Deeper Point: Are You Paying for AI’s “Emotional Labor”?

Caveman is a small project with a big insight: a significant chunk of what we pay AI models for is performative politeness, greetings, hedging, empathetic sign-offs. That’s fine in a customer-facing chatbot. In a development tool, it’s pure waste.

When you force a model to skip the social performance and focus entirely on the technical answer, something interesting happens: the answers actually get better. Published research confirms that concise output constraints can improve model reasoning accuracy. The model stops maintaining a “helpful assistant persona” and focuses its compute on the actual problem.

That might be the real value of Caveman, not just the cost savings, but the quality improvement that comes from letting your tools be tools.


Who Should Use This?

  • Anyone using AI coding assistants, it now supports 30+ agents, not just Claude Code
  • Heavy API users spending significant monthly budgets on tokens
  • Teams where multiple developers share a token budget
  • Solo developers on free tiers who want to stretch their allocation
  • Anyone tired of scrolling past AI pleasantries to find the actual answer

Who might skip it:

  • If you’re writing documentation or client-facing content where natural language matters
  • If you’re new to coding and benefit from verbose explanations

My Take After Two Weeks

I started on full and kept it there for most tasks. The savings numbers in the benchmarks above — 65% average — are accurate for prose-heavy responses, but in a real coding session most of your tokens come from file reads and conversation history, not Claude’s explanations. My actual per-session savings were closer to 8-12%. Still worth it when you’re running 15+ sessions a day.

What I didn’t expect: I started sending fewer follow-up messages. Without the hedging and filler, Claude’s first attempt was more focused, and I stopped needing to say “just give me the fix” as a second prompt. That saved more tokens than the compression itself, honestly.

I turn it off for debugging sessions where I’m genuinely confused and want verbose reasoning. And architecture discussions — I actually want Claude to think out loud when I’m evaluating trade-offs. But for everyday refactoring and code reviews, full is staying on permanently. The ultra level is a novelty; lite is barely noticeable. full hits the sweet spot.


Final Verdict

What started as a viral joke about caveman-speak has grown into a serious, well-engineered ecosystem at 52.3K stars. With four compression levels, a universal one-command installer for 30+ agents, input compression via caveman-compress, persistent memory via cavemem, and real-world savings of 22–87% on output tokens, Caveman has become essential infrastructure for cost-conscious AI-assisted development.

The developer efficiency divide might not be about which model you use, it might be about whether you have the nerve to tell your AI to stop talking and start working.

Free & Open Source | GitHub Repository


More from AI Tool Pick: