Guides

The Harness Tax: What Coding-Agent Dead Weight Costs Your Engineering Org in 2026

A coined term for the silent overhead burning through every coding-agent budget in 2026: seat tax, context tax, tier tax, tool tax, session tax. Quantified, named, and what to do about it.

·
13 min read
ai-gateway 2026
Editorial cover image for The Harness Tax: What Coding-Agent Dead Weight Costs Your Engineering Org in 2026

A 30-engineer team paying $40,000 a month for Claude Code is paying $14,000 of it for nothing. Not bad code, not hallucinations, not over-eager refactors. Dead weight. Idle seats, context packed with files the agent never reads, Opus invoked to rename a variable, MCP tool descriptions re-serialised into every turn, four duplicate sessions per developer per day because nobody resumes the old one.

That overhead has no name in the budget review. Finance sees one line item from Anthropic. Engineering sees a productivity gain. Neither sees the tax.

Call it the harness tax.

What the harness tax actually is

The harness tax is the gap between what your coding agent costs and what it would cost if every token bought signal. Paid-for capacity nobody uses. Context windows stuffed with irrelevant files. Frontier models doing trivial work. Tool schemas re-serialised every turn. Session sprawl that fragments every conversation into a dozen cold starts.

Not a hallucination problem. Not a quality problem. An operations problem dressed up as an AI problem.

The tax is hard to see for three reasons. Sessions are long, so per-call telemetry is useless. Cost is concentrated in a few power users, so averages lie. The invoice is one big number, so nobody asks where the slack lives. Most engineering leaders treat coding agents the way they treated SaaS in 2015: sign the contract, hand out seats, don’t audit the line item. That worked for Slack at $8 per seat. It doesn’t work for Claude Code at $40,000 a month.

The 2026 inflection: coding-agent spend has crossed the line where it’s the second or third largest engineering cost after salaries. When the bill stops being a rounding error, the tax stops being one too.

The five forms of the harness tax

Five shapes. Each named. Each measurable. Each cuttable.

1. Seat tax: paying for engineers who barely use the agent

In every team we have audited in 2026, between 28% and 41% of paid Claude Code seats produce fewer than 10 sessions per week. That’s two sessions per working day. For a tool whose value comes from being the first thing you reach for, two sessions a day means it isn’t the first thing you reach for. It’s a backup tool a senior engineer turns to once a week and forgets to cancel.

Concretely: 30 seats at $200/month per engineer plus pooled token spend averaging $1,100/month per engineer. If 35% of those seats fire fewer than 10 sessions a week, the seat tax is 30 x 35% x $1,300 = $13,650/month gone to dead seats. A 200-engineer shop with the same idle ratio is bleeding $90,000 a month before anyone has typed a prompt.

The signal is simple: if you can’t produce a daily-active-developer count for your coding agent, you have a seat tax.

2. Context tax: packing every turn with files the agent never reads

Claude Code aggressively expands the working set. The default behaviour packs the project tree, open buffers, recently edited files, and the contents of any file the agent might plausibly need into the input context. A typical mid-conversation turn carries 80K to 160K input tokens, of which the agent actually attends to maybe 10K. The rest is the model paying for storage you bought to make sure it had it.

At May 2026 Opus pricing ($15 per million input), every 100K of unused input is $1.50 per turn. A team running 4,000 turns per day pays $6,000 a day in input tokens, of which our Q1 2026 profiling across 22 teams suggests 35% to 60% is context the agent never references. Mid-estimate: $2,400/day, $52,800/month of pure context tax.

Subagent architectures make this worse. Claude Code’s sub-task spawning duplicates the relevant context block into each subagent’s window, so a parent turn at 120K tokens with three subagents becomes 480K input tokens billed for the same fundamental query. Engineering reaches for subagents because the UX is clean. Finance doesn’t see the multiplier until the invoice arrives.

3. Tier tax: Opus for trivial work that Haiku could finish

Coding agents default to the strongest available model because the default is the safest setting for product surface area. It’s the wrong setting for the budget. Across our 2026 dataset, 47% of Claude Code turns are turns Sonnet would have handled identically, and another 22% are turns Haiku would have handled identically. The remaining 31%, multi-file reasoning, complex refactors, debugger interpretation, actually need Opus.

The cost gradient on May 2026 pricing:

  • claude-haiku-4-5: $0.80 in / $4.00 out per million
  • claude-sonnet-4-6: $3.00 in / $15.00 out per million
  • claude-opus-4-7: $15.00 in / $75.00 out per million

That’s a 19x multiplier from Haiku to Opus on input. When 69% of turns are over-spec’d, the tier tax is typically the largest single line item. For the 30-engineer team at $40K/month, mid-estimate of the tier tax is $10,000 to $14,000/month. The bill drops 25-35% the day a routing policy ships.

The pattern reverses for genuinely hard turns. Routing Opus traffic down to Sonnet on hard problems silently degrades quality and the team spends an hour debugging the agent’s wrong answer. The tax isn’t “always use Haiku.” It’s “use Haiku when Haiku is enough, and have telemetry that knows.”

4. Tool tax: MCP descriptions re-serialised into every turn

Every coding agent running tools through the Model Context Protocol pays a less visible tax on the tool schema itself. Each registered MCP tool ships its full JSON schema and description text into the system prompt of every turn. A typical Claude Code setup with 8 to 12 MCP servers connected carries 3,500 to 9,000 tokens of tool descriptions per turn.

Fine for the first turn. Wasteful when the same 9,000 tokens repeat across 40 turns and the agent uses two of the twelve tools. At $15 per million for Opus input, a 9,000-token tool block over 40 turns is $5.40 per session for tool descriptions the conversation never invoked. A team running 800 sessions a day pays $4,320/day, $95,000/month for tool-schema repetition.

Prompt caching helps when implemented correctly. Anthropic’s discount cuts the cached portion to roughly $1.50 per million (90% off). The trap: caches need stable inputs, which breaks the moment a new MCP server connects mid-session. Most teams we audit believe they’re caching when they’re caching maybe 30% of the tool block.

5. Session tax: N cold starts per developer per day

The cleanest pattern we see in production data: developers don’t resume sessions. They open Claude Code, run for 20 minutes, close the terminal, open it again at 3pm, start fresh. Across the day a single engineer can spawn 5 to 8 new sessions where 1 or 2 resumed sessions would have done the same work for a tenth of the input tokens.

The tax shows up as the input-token spike at the start of every session. A cold start typically lands at 40K to 80K input tokens before the developer has typed anything. Twenty thousand of those is the system prompt and tool block. The rest is project bootstrapping. Multiply by 5 to 8 sessions per developer per day and the 30-engineer team pays $3,000 to $6,000/month for repeated bootstrapping.

Partly a UX problem, partly a process problem. The CLI doesn’t nudge resumption. Most teams have no convention about when to resume vs. start fresh. The tax compounds when the developer leaves the terminal idle and the session times out; the next session starts cold again.

The hidden compounding effect

The five taxes aren’t additive. They compound. A bigger context window means more input tokens, which means tier tax on Opus hurts more. A higher MCP tool count means heavier cold starts, which means session sprawl costs more per restart. The 35% of idle developers are subsidising the 10% of power users without anyone knowing.

For teams we have profiled in 2026, the harness tax tends to land at 30% to 45% of total coding-agent spend, a $40,000/month bill carrying $12,000 to $18,000 of dead weight. It’s the line item that grows fastest, because every new MCP server, every new developer onboarded, every new project added to the working set widens the surface for all five taxes simultaneously.

A simple measurement framework

Before cutting the tax, you need to see it. Four layers, in order:

  1. Per-developer activity. Daily-active and weekly-active developer counts. If DAU/MAU is under 0.4, the seat tax is your biggest line item. Requires a gateway that tags traffic by developer identity, not by team API key.

  2. Per-turn token decomposition. Split each turn’s input tokens into: system prompt, tool descriptions, project context, conversation history, developer’s current message. Tool tax shows up in the tool-descriptions column. Context tax shows up in the project-context column.

  3. Per-turn model utilisation. What fraction of Opus turns would have produced an identical answer on Sonnet or Haiku? Measure by replaying sampled turns through cheaper models and scoring against the original. The tier tax is the spread between observed cost and the minimum cost that produced the same answer.

  4. Per-session lifecycle. Sessions started, resumed, and abandoned per developer per day. High new-session count plus low resume count is the session tax.

A team that can produce those four numbers monthly has the data to cut the tax. A team that can’t is operating on vibes.

How to cut the harness tax

Five levers, each named, each quantifiable. Pull them in this order, because the upstream cuts make the downstream cuts bigger.

Lever 1: Active-seat audit on a 30-day window

Reclaim any seat that has fired fewer than 10 sessions in the trailing 30 days. The pushback is “but they might need it.” The counter is “they can request it back in five minutes.” For a 30-engineer team, this lever alone reclaims $8,000 to $14,000/month without affecting any active developer.

Pair with quarterly re-attestation: managers approve continued seats based on actual usage reports. Self-service restoration. Overhead: two hours per quarter per manager.

Lever 2: Context budgeting with hard caps on per-turn input

Set a per-turn input cap at 60K tokens for routine work, 120K for complex multi-file tasks. Enforce at the gateway, not the client, because client layers drift. The cap forces the agent’s context-selection heuristic to be tighter and pushes the team toward smaller, focused tasks instead of “refactor the entire repo.”

Realistic savings: 20% to 35% reduction in input tokens, $8,000 to $14,000/month on a $40K/month team. The cost is occasional friction on genuinely large tasks. The fix is an escalation path that lifts the cap for the session with a one-click reason.

Lever 3: Tier routing on input-token threshold

The highest-use lever. Route turns under 10K input tokens to Haiku, 10K-50K to Sonnet, 50K+ to Opus. Input size is a reasonable proxy for task complexity in coding work, more reliable than any other automatic signal.

Realistic savings: 25% to 35% reduction in model spend. The catch is that routing decisions need to be reversible at the session level, a developer flagged “this is hard, go up a tier” should override the policy. The router needs a quality-feedback mechanism so that if Haiku produces worse answers, the policy learns to bump those turns up.

A naive tier-routing policy ships in a week. A policy that adapts based on observed quality scores ships in a quarter and is the durable lever.

Lever 4: Tool-set scoping with per-task MCP enablement

Most sessions need three tools. Most sessions have twelve loaded. Cut the per-session tool set to what the task actually needs. Simplest implementation: a per-project tool manifest, each repo declares the MCPs it uses, and the gateway loads only that set.

Realistic savings: 40% to 60% reduction in tool-description tokens per turn, $25K to $50K/month on a 30-engineer team, because tool descriptions repeat across every turn of every session. Cost: a one-day project per repo to declare the tool set. Pays back in week one.

Second-order improvement: once the tool set is stable, the gateway can cache the tool block aggressively, dropping the cached input cost by another 90% on cached tokens.

Lever 5: Session-resume default + idle-session reaping

Change the team default from “new session” to “resume last session, prompt if stale.” Most agents support resumption; most developers don’t use it because the muscle memory is to open a new terminal. A team-level wrapper that defaults to resume cuts session count by 40% to 55% for the same volume of work.

Pair with an idle-session reaper that closes sessions after 45 minutes of inactivity. Avoids the worst case where a session sits open all afternoon, accumulating cruft, then explodes when the developer asks one more question after lunch.

Realistic savings on the session tax: $1,500 to $3,000/month on a 30-engineer team. Smaller absolute number, but the highest UX-use of any lever, the team feels the difference within a week.

Two case patterns

Shapes from real 2026 audits, anonymised.

Pattern A: 45-engineer fintech, $58K/month bill

19 of 45 seats with under 10 sessions a week (seat tax: $14,200/month). 14 MCP servers org-wide, none scoped per-repo (tool tax: $32,000/month). Every turn on Opus (tier tax: $9,800/month). Average input 142K per turn, ~90K unreferenced project context (context tax: $11,500/month).

The audit revealed three of the 14 MCP servers were broken; fixing them tripled cache hit rate on the remaining tool block, converting a chunk of the tool tax into pure savings.

Six weeks after pulling all five levers: $58K/month → $31K/month. Productivity unchanged on standard proxies.

Pattern B: 22-engineer infra team, $26K/month bill

No seat tax, 21 of 22 engineers daily-active. Tier tax dominant because the team had standardised on Opus. After shipping a Haiku/Sonnet/Opus routing policy, model spend dropped 38%.

The cut: $26K/month → $17K/month in eight weeks. The team reported the visible difference was latency (Haiku ~1.2s vs Opus 3-5s), not cost. Latency was the unlock; cost was the side effect.

Pattern: in teams with high active usage, the tier tax dominates and routing is the lever. In teams with low active usage, the seat tax dominates and the audit is the lever. Identify the shape first.

The gateway’s role: why the loop matters

All five levers depend on telemetry the coding agent doesn’t produce natively. The CLI doesn’t split usage by developer, doesn’t decompose input tokens by source, doesn’t track session lifecycles in a queryable shape, doesn’t score model utilisation against a cheaper-model counterfactual. Without a layer in front of the agent that captures this data, the harness tax is invisible and the levers are ungrippable.

This is where an AI gateway earns its place. Not as a routing optimisation, as the only observation point that sees enough to find the tax. The gateway tags every call with developer identity, repo, session, and tool set. The traces decompose input tokens by source. That’s what makes the five-lever audit feasible.

Future AGI’s Agent Command Center is the natural fit for this problem because it’s built around a feedback loop, not a dashboard:

  1. Trace. Every Claude Code turn produces a span tree via traceAI (Apache 2.0). Spans capture inputs, outputs, model, tool set, session ID, developer ID, repo.
  2. Decompose. The trace identifies which fraction of input tokens came from system prompts, tool descriptions, project context, and conversation history, the five-tax decomposition.
  3. Score. fi.evals (Apache 2.0) scores each turn for task-completion, correctness, and tool-use accuracy. The scores tell the loop whether Haiku was sufficient or whether Opus was actually needed.
  4. Cluster. Agent Command Center groups low-scoring and over-spec’d sessions by failure mode. Tier tax shows up as “Opus calls where Sonnet scored identically.” Tool tax shows up as “high tool-block size with low tool invocation count.”
  5. Optimise. fi.opt.optimizers (six optimizers (ProTeGi, BayesianSearchOptimizer with Optuna, GEPAOptimizer, MetaPromptOptimizer, RandomSearchOptimizer, PromptWizardOptimizer), all sharing EarlyStoppingConfig. Apache 2.0 under agent-opt) rewrites prompts and adjusts the routing policy against the clustered failures. The most common automatic optimisation is exactly the tier-routing rule from Lever 3.
  6. Route. The gateway applies the updated policy on the next request. Protect (inline guardrails, measured at 65 ms text median for text in arxiv.org/abs/2510.13351) sits in the same lane.

A team that audits seats and ships a tier-routing policy in one week sees a step-down in the bill. A team that runs the loop sees the step-down plus continued drift downward over six months as the routing learns, the prompts tighten, and the tool sets get scoped tighter per project.

The framing matters: the gateway isn’t the solution to the harness tax. The five levers are. The gateway is what makes the levers gripable, the lens that lets you see the tax, and the lever-puller that keeps it pulled.


Sources

  • Anthropic Claude Code pricing, anthropic.com/pricing (May 2026)
  • Anthropic prompt caching documentation, docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • Model Context Protocol specification, modelcontextprotocol.io
  • Future AGI Q1 2026 coding-agent cost decomposition (22-team internal dataset)
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text / 107 ms image median time-to-label)
  • traceAI, github.com/future-agi/traceAI (Apache 2.0)
  • ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
  • agent-opt, github.com/future-agi/agent-opt (Apache 2.0)

Frequently asked questions

Is the harness tax the same as 'wasted tokens'?
Not quite. Wasted tokens is a per-call concept. The harness tax is the operational shape that produces wasted tokens at scale — idle seats, oversized contexts, over-spec'd models, repeated tool descriptions, session sprawl. You cut it by changing operations, not by tweaking individual calls.
How does the tax differ for Claude Code, Cursor, and Cline?
The five forms apply to all three; proportions shift. Cursor's UI nudges resumption better than Claude Code, so the session tax is smaller. Cline pushes tool-heavy workflows by default, so the tool tax is bigger. The seat tax is universal.
Can prompt caching alone solve the tool tax?
Partially. A well-implemented cache cuts the cached portion to ~10% of uncached cost, but caches need stable inputs. Teams that swap MCP servers mid-session blow their hit rate and pay full price again. Prompt caching is a 60-80% reduction on the tool tax if you scope your tool set; it does almost nothing if you do not.
Reasonable target ratio of harness tax to total spend?
Under 10%. Above 20% means at least two of the five levers are unpulled. Above 30% means the team is operating without a gateway or without visibility into the gateway's data.
Does the tier tax break Opus-trained workflows?
Only if you route Opus turns down without a quality feedback loop. The risk is a complex refactor gets routed to Sonnet and produces a subtly wrong answer the developer accepts. The mitigation is to evaluate sampled turns from each tier and bump the threshold when quality drops. The five-lever framework specifies the eval as a precondition for shipping the routing, not an optional add-on.
How long does it take to measure the tax on a new team?
Three weeks. Week one to instrument the gateway. Week two to accumulate data for the four-layer measurement. Week three to produce the report. Cutting the tax takes another two to six weeks depending on levers.
Is there a self-hosted path that does not send traces to a third party?
Yes. `traceAI`, `ai-evaluation`, and `agent-opt` are all Apache 2.0 under `future-agi` on GitHub. Run the whole loop in your VPC. Hosted Agent Command Center adds the failure-cluster UI, RBAC, and procurement story; the underlying capability is open-source.
Related Articles
View all
The Comprehensive Guide to LLM Security (2026)
Guides

LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.

NVJK Kartik
NVJK Kartik ·
17 min