· 7 min read

What my own Claude Code was doing wrong

Monday morning I ran agent-ledger week --summary expecting the usual few-hundred-dollar figure. Got ~$20K.

I’m on Claude Max so I didn’t actually pay that — the $200/mo cap ate it. But 244× on a Max plan is real token usage, and when I went looking at where it went, most of it was spent on dumb shit. Retrying the same failing edit a dozen times. Running Opus for work Haiku could do in its sleep. A subagent that stayed on Bash for over fifty consecutive turns before I noticed.

So I wrote a second tool to figure out why. The first one, claude-agent-ledger, tells you how much each subagent / model / day / session cost. Useful, but it just gives you tables. Tables don’t tell you anything is wrong — they tell you what happened. This post is about the second tool, the one that tells you what’s wrong.

/cost shows you how much the current session has burned. That’s useful for about six seconds, which is how long it takes to realize you actually want to know which subagent is eating your life across the last 54 sessions. ledger gave me that per-subagent view, but staring at 54 rows of numbers I still couldn’t pick out the patterns fast enough. I needed a lint. A rule catalog that reads the same JSONL files and points at the patterns I probably want to fix, same discipline as eslint or shellcheck. I called it claude-agent-doctor. It has two commands:

doctor check             # scan recent sessions
doctor explain <CODE>    # mechanism + case study for one rule

v0.3 ships twelve rules across three categories: cost, loops, tools. Each rule has a code (MODEL_MONOCULTURE, BASH_STORM, that kind of thing), a detection signature, a mechanism paragraph, a real case from my own logs, and a copy-pasteable fix. The catalog is at PATHOLOGY.md if you want to skip the writeup and just read it.

Here’s what doctor printed on my 7-day window:

Doctor report  ·  54 sessions  ·  ~$20K total shadow spend
  15 high · 25 med · 1 low
  potential weekly savings: ~$2.9K

Forty-one findings. Three that made me sit up:

The first one was MODEL_MONOCULTURE. 99.6% of my week’s shadow cost was Opus. Sonnet got 0.4%. Haiku saw a grand total of $1.29. Haiku 4.5 is not a downgrade — it’s a different engine, tuned for short-context deterministic work. My Swift Developer subagent, which mostly does structural Swift edits and read-only reviews, had a median output of 574 tokens per turn. That’s Haiku territory. Haiku at maybe a tenth the cost, near-identical output quality. But I’d never configured it, so every Swift task was paying Opus rates for work that didn’t need them. The fix is one line in the subagent frontmatter, or an env var (CLAUDE_CODE_SUBAGENT_MODEL=haiku) if you want to try it for an afternoon and revert by unsetting the variable.

The second was RUNAWAY_SESSION. One single session consumed 31% of my week. Five thousand fifty-two turns. I hadn’t noticed because the session kept looking productive from the outside — ledger never highlights a session unless you go looking. When I finally ran agent-ledger explain on it, the top three most-expensive turns were each $26, back-to-back, all Opus, all writing nearly a megabyte of cache per turn, running some grep variant on repeat. The agent had gotten stuck probing for a pattern it couldn’t find, and I kept the session open because I was busy. The fix there is max_turns: 80 and max_cost_usd: 50 as per-subagent guardrails, plus a habit of manually inspecting any session that crosses a thousand turns. I would not have thought to set that limit before seeing the $6K invoice-that-wasn’t.

The third was LOOP_DEATH. This one catches the same tool appearing as the primary action for eight or more consecutive turns, which almost always means the agent is stuck. In my 14-day window doctor flagged seventeen of these. The worst had Claude stuck on Bash for over fifty consecutive turns, most of them minor variations on the same command. That’s an agent that’s not learning from the tool’s output, and you’re paying Opus to watch it not learn.

The other nine rules are smaller in blast radius but similar in shape: CACHE_TTL_MISMATCH catches short sessions paying the 2× rate for 1h cache when 5m would do; BASH_STORM flags sessions with more than 500 Bash calls (usually a grep loop); RETRY_THRASH catches one file being heavily read and heavily edited (typically an agent fighting a failing test); CONTEXT_BLOAT looks for turns over 150K tokens of input, which is the point where your context window is paying for a lot of bytes the model isn’t going to use. The full list and its mechanisms are in the catalog. The next four rules for v0.4 are already drafted in docs/drafts/ so anyone can PR them in.

The obvious next question — and the first thing HN will ask — is why not build a router. Detect the small task, auto-route to Haiku, save my week. I looked at it. The space is packed: musistudio/claude-code-router is at 32.9k stars, LiteLLM at 44.6k, Portkey at 11.4k. There’s no obvious hole for a fourth proxy. Anthropic has already eaten most of the value that routing would capture — the CLAUDE_CODE_SUBAGENT_MODEL env var does one-line routing to Haiku, and Claude Code’s built-in Explore subagent already defaults to Haiku. What’s left is a long tail of edge cases where a router could still help, but capturing that tail means running a persistent proxy in front of your API, and running a persistent proxy as a solo maintainer is a great way to spend a year of weekends playing compat whack-a-mole every time the API adds a field.

So doctor stays a static analyzer. Read-only. Its newest command, suggest-routing, reads your ~/.claude/agents/*.md files and proposes per-agent model changes, but it emits the changes as a plan you apply by hand (or as a unified diff via --export-patch). It guards hard — orchestrators, payment/subscription agents, safety/compliance auditors, and reality-checkers never get auto-downgraded, regardless of keyword hits. On my own seventeen agent files, the first pass of the suggestion engine returned twelve aggressive downgrades. After the guards fired, only one remained: my dingding server-monitor agent, whose description is “restart services, read logs, reply to commands.” Haiku handles that fine.

I went back and forth on whether “doesn’t auto-fix” is a feature or a cop-out. The honest answer is: I don’t trust an LLM to silently edit my agent config based on heuristics I wrote in an afternoon, and I don’t think you should either. If you disagree with a finding, you should see it, argue with it, and decide. An auto-fixer takes that away and ships a worse product faster. That’s the trade, and I landed on the read-only side.

A thing I didn’t expect when I started: the hardest part wasn’t the detectors. The detectors are short — each one is a few dozen lines of TypeScript running over a SessionStat struct. The hard part was the guards. My first run of suggest-routing cheerfully proposed downgrading my iOS factory orchestrator to Haiku because its description mentions “review” and “audit.” Downgrading the orchestrator breaks routing for every child subagent. The wrong save on that one agent would cost more than whatever I saved everywhere else combined.

That’s a weirdly instructive failure mode. The low-hanging fruit in agent infrastructure — swap a model, cut a prompt section, raise a threshold — isn’t actually low-hanging if the wrong call breaks a critical path. A lot of the value of a tool like doctor is just sorting the suggestions into “safe to try” and “don’t touch unless you understand exactly what happens next.”

I’m not making your agent smarter — that’s a rabbit hole I’m not qualified for this week. I’m making your own logs legible so you can be smarter about your agent. If that sounds unexciting, good. Last time I tried to build “self-improving agent infrastructure” I archived the repo the same afternoon, after a couple of friends red-teamed it and pointed out I was solving my own problem, not a real one.

Install:

npm install -g claude-agent-doctor
doctor check

Source is MIT. Contributions welcome — a new rule joins the catalog when it has a detection signature, a mechanism paragraph, a real case study, and a one-paragraph fix. That bar plus my own 219-session corpus is what the current twelve rules are built on.

The finding that surprised you about your own logs — please tell me. I’m collecting these.

Comments

Be the first to start a discussion. Reply with your thoughts on Bluesky and tag @jiexiang.dev — I'll link the thread back here.