Guides

Evaluating Coding Agents in 2026: A Five-Dimension Eval for Your Codebase

Public SWE-bench scores don't transfer to your codebase. Evaluate coding agents on golden-PR replay, tool calls, multi-file coherence, plan, and rollback.

·
Updated
·
10 min read
coding-agents agent-evaluation swe-bench claude-code cursor cline aider 2026
Editorial cover image for Evaluating Coding Agents in 2026
Table of Contents

SWE-bench Verified is the starting line, not the finish. A 70 percent score on a public benchmark and a 30 percent pass rate on your codebase are both real numbers about the same agent. The eval that decides whether Cursor, Claude Code, Cline, Aider, Continue, or Replit ships against your repo is not a public leaderboard. It’s a golden-PR replay against your last 50 merged pull requests, scored on tool-call correctness, multi-file coherence, plan coherence, and rollback discipline. This post is the methodology, the FAGI primitives that ground it, and the parts you actually own as a buyer.

Why SWE-bench scores don’t transfer to your codebase

Public benchmarks are designed to be neutral. Your codebase is the opposite. SWE-bench Verified runs on a fixed set of GitHub Python issues across twelve open repos. Your repo has framework conventions the benchmark doesn’t know, internal libraries that aren’t on PyPI, a review bar that would reject half the diffs the public set accepts, and dependencies the agent has never seen. The same model that posts 67 percent on SWE-bench Verified will solve a different fraction of your tickets, and the gap is rarely small.

Four reasons the score doesn’t transfer:

Contamination risk. SWE-bench items pre-date most current model cutoffs. The agent may have seen the canonical patches during training, which inflates the score and tells you nothing about novel work. The LLM benchmarks vs production evals post walks through the contamination detection patterns and why you treat any pre-cutoff public score as diagnostic, not procurement-grade.

Surface mismatch. SWE-bench gives the agent a clean repo, a test command, and an issue. Your engineers give the agent a half-loaded VS Code workspace with a partial diff, a Slack thread, and a comment from a reviewer. The agent that wins the bench-rig loses the workflow rig and vice versa.

Idiom drift. Every codebase has conventions a public benchmark can’t model: your custom ORM wrapper, your in-house decorator, your test fixture style, your forbidden imports. Agents that score well on idiomatic-Python tasks can produce diffs your reviewers reject on first read.

Review-bar gap. SWE-bench passes a patch if it makes the failing test pass. Your team passes a patch if it makes the failing test pass and doesn’t break unrelated tests, doesn’t reformat the file, doesn’t add unrelated imports, and is small enough to review in 90 seconds. Most public benchmarks score the first condition. None score the rest.

Vendor benchmarks provide signals. The eval that matters runs on your code, with your review bar, on the tickets you actually ship. Pair this post with agent observability vs evaluation vs benchmarking for the full distinction.

The five-dimension eval

Five dimensions. Each scores a failure mode that public benchmarks miss. Run all five on every replay.

DimensionWhat it answersGrounded on
Golden-PR replayDoes the agent’s diff pass your review bar against a known-good PR?TaskCompletion + diff comparator
Tool-call correctnessDid search, edit, run-tests, and commit happen in the right order?EvaluateFunctionCalling over the trace
Multi-file coherenceDid the agent update every call site, import, and config that needed it?CustomLLMJudge walking the import graph
Plan coherenceDid the agent’s written plan match the user intent and stay stable across edits?CustomLLMJudge on plan vs final diff
Rollback disciplineDid the agent revert broken changes or pile on top of test failures?Trace-timeline analysis + edits-per-failure ratio

Dimensions one through three gate correctness. Plan coherence gates intent. Rollback discipline gates production safety. The next five sections show how each one composes from primitives the ai-evaluation SDK ships out of the box.

1. Golden-PR replay: your last 50 merges as a test set

The strongest signal you can build against a coding agent is the work you’ve already merged. Your last 50 PRs ship with everything a public benchmark can’t supply: real intent, reviewer feedback, a passing test suite, and a known-good diff. The replay is the contract.

The protocol:

  1. Pick the last 50 merged PRs, skipping version bumps and generated files. Keep the mix close to what you actually ship.
  2. For each PR, capture the parent commit, the issue text or commit message, the merged diff, and the test deltas.
  3. Revert the repo to the parent. Hand the issue or commit message to the candidate agent. Let it run unobserved against your sandbox.
  4. Diff the agent’s patch against the merged patch. Score by review-bar match, not exact-text match.

The judge is the part most teams get wrong. Exact-match diff scoring punishes agents for plausible variations a human reviewer would have merged. The cleaner pattern is TaskCompletion calibrated against the user intent in the PR description, paired with a coherence judge that compares the agent’s diff to the merged diff for behavioral equivalence.

from fi.evals import Evaluator, TestCase
from fi.evals.templates import TaskCompletion, CustomLLMJudge

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

replay_judge = CustomLLMJudge(
    name="golden_pr_replay",
    grading_criteria=(
        "Score 1.0 if the agent's patch would pass the same code review "
        "as the merged patch. Behavioral equivalence over textual match. "
        "Subtract 0.3 if the agent added unrelated reformats. Subtract "
        "0.5 if the agent's patch fails any test the merged patch passes. "
        "Subtract 0.2 if the diff is more than 2x the size of the merged "
        "diff for the same outcome."
    ),
)

result = evaluator.evaluate(
    eval_templates=[TaskCompletion(), replay_judge],
    inputs=[TestCase(
        input=pr.issue_text,
        output=agent_final_patch,
        context=pr.merged_diff,  # the known-good baseline
    )],
)

The set ages well. Refresh weekly by promoting one freshly merged PR into the golden set and dropping the oldest. For the broader design pattern, see LLM eval golden set design.

2. Tool-call correctness: read, edit, test, commit

A coding agent is a tool-using loop, not a string generator. The agent reads files, edits them, runs tests, reacts to failures, and commits the work. Tool-call correctness asks whether each call happened, in the right order, with the right arguments. Score it from the trace.

Four tool-call behaviors carry the weight:

  • Search before edit. Did the agent open the files it touched before it wrote to them? An agent that edits a file it never read is hallucinating against your code.
  • Edit on target. Did the patch land in the file the plan named? Mis-targeted edits are the single largest source of cross-file regressions.
  • Run tests after edit. Did the agent run the test runner after the edit, or did it commit blind? Agents that don’t test are not agents; they’re text editors with extra steps.
  • Commit only what changed. Did the commit boundary match the edit boundary, or did the agent stage unrelated files?

EvaluateFunctionCalling fits all four cleanly because the test runner result, the file targets, and the commit set are deterministic. Trace the session with traceAI, then score against the call sequence:

from fi.evals.templates import EvaluateFunctionCalling

tool_call_eval = EvaluateFunctionCalling(
    grading_criteria=(
        "For each tool call in the trace, score whether the call was "
        "necessary, correctly targeted, and ordered with respect to the "
        "rest of the loop. Score 1.0 if every file edit was preceded by "
        "a file read, the test runner was invoked after the final edit, "
        "and the commit contained only files the plan named."
    ),
)

The auto-instrumentors emit fi.span.kind as AGENT for the top-level plan, CHAIN for sub-plans, and TOOL for read, edit, run-tests, and commit. That tagging is what makes per-tool scoring composable across Cursor, Claude Code, Cline, and Aider on the same eval row. For the span shape, the Claude Code observability with OpenInference and OpenTelemetry post is the reference.

3. Multi-file coherence: walk the import graph

Single-file diffs hide the failure mode reviewers hate most. The agent renames a helper in module A and leaves seventeen broken call sites in modules B and C. The tests in module A pass. The repo is wrecked. Multi-file coherence catches it before merge.

The protocol is mechanical. Take the agent’s diff, walk the import graph from every changed file, and produce a list of touched-or-should-have-touched files. For each file in the list, ask a code-review judge whether it’s consistent with the diff.

coherence_judge = CustomLLMJudge(
    name="multi_file_coherence",
    grading_criteria=(
        "Walk the import graph from every changed file. For each "
        "downstream file, score consistency with the diff. Score 1.0 if "
        "all imports resolve, all symbols referenced exist, no public "
        "API surface is silently removed, and no call site of a renamed "
        "symbol is left stale. Subtract 0.4 per broken import. Subtract "
        "0.6 per stale call site of a renamed public symbol. Subtract "
        "0.3 per config or environment update that should have followed "
        "the code change but didn't."
    ),
)

Multi-file coherence is where Cursor’s Composer, Claude Code’s plan-first loop, Cline’s MCP-driven loop, and Aider’s repo-map approach show their real differences. None of these strategies is correct in the abstract; the eval tells you which one fits your codebase.

4. Plan coherence: did the plan survive the loop?

The 2026 agents emit a written plan before they touch the repo. Claude Code does it by default. Cursor’s agent mode does it on demand. Aider’s architect mode separates planning from editing. The plan is content. Score it.

Plan coherence asks two questions: does the plan match the user’s intent at the start, and does the final diff still match the plan at the end? Agents drift mid-loop. Sub-task three quietly contradicts the user’s request. The unit tests pass. The final patch is wrong. A rubric that scores intent against the final diff, and the final diff against the plan, catches drift early.

plan_judge = CustomLLMJudge(
    name="plan_coherence",
    grading_criteria=(
        "Score the agent's plan against the user intent. Then score the "
        "final diff against the plan. Score 1.0 if both alignments are "
        "exact. Subtract 0.3 if the plan dropped a requirement from the "
        "intent. Subtract 0.5 if the final diff added behavior the plan "
        "did not promise. Subtract 0.4 if the plan and diff diverged "
        "after a mid-loop tool failure."
    ),
)

This is where senior engineers earn their pay in code review, and it’s where the eval earns its keep. Plan coherence is also the strongest predictor of which agent your reviewers will trust. Power users describe a “plan-first workflow: measure 15 times, cut once,” and the agents that ship that pattern out of the box score consistently higher on this axis.

5. Rollback discipline: revert or pile on?

The hardest signal to measure from a chat rubric is whether the agent knows when to undo. Rollback discipline is the behavior at the moment tests fail after an edit. The disciplined agent reverts the broken change and tries a different approach. The undisciplined agent pours more code on top of the failure until the test suite is in pieces.

Score it from the trace timeline. Two numbers do the work: edits per test failure, and the ratio of reverts to retries. Pull both from the span sequence.

def rollback_score(trace):
    failures = [s for s in trace if s.kind == "TOOL"
                and s.name == "run_tests" and s.failed]
    edits_after_failure = []
    for f in failures:
        next_test = next((s for s in trace if s.kind == "TOOL"
                          and s.name == "run_tests"
                          and s.start > f.start), None)
        edits_between = [s for s in trace if s.kind == "TOOL"
                         and s.name in ("edit_file", "revert")
                         and f.start < s.start < (next_test.start
                         if next_test else float("inf"))]
        edits_after_failure.append(edits_between)
    avg_edits = sum(len(e) for e in edits_after_failure) / max(1, len(failures))
    reverts = sum(1 for e in edits_after_failure
                  if any(s.name == "revert" for s in e))
    revert_ratio = reverts / max(1, len(failures))
    return {"avg_edits_per_failure": avg_edits,
            "revert_ratio": revert_ratio}

Agents that average more than two edits per failure without a single revert are likely to ship regressions in production. The AI agent failure modes post catalogs the rollback failure mode at the cluster level, and the Error Feed surfaces it directly when you run replays at scale.

Production observability with traceAI

A 50-PR replay is offline. Production runs forever. The eval that survives the first month streams from the agent loop in real time, not from a notebook. traceAI auto-instruments the model calls every coding agent makes; the gateway emits cost, latency, and routing telemetry on every span.

from fi_instrumentation import register, ProjectType
from traceai_anthropic import AnthropicInstrumentor
from traceai_openai import OpenAIInstrumentor
from traceai_bedrock import BedrockInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="coding-agent-eval",
)

AnthropicInstrumentor().instrument(tracer_provider=trace_provider)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
BedrockInstrumentor().instrument(tracer_provider=trace_provider)

Once registered, every Claude Code, Cursor, Cline, or Aider session that flows through the process emits OpenInference spans with fi.span.kind set to AGENT, TOOL, or CHAIN. Route the agent’s provider traffic through gateway.futureagi.com/v1 and the spans gain cost telemetry directly:

export ANTHROPIC_BASE_URL=https://gateway.futureagi.com/v1
export ANTHROPIC_API_KEY=fagi_dev_<per-developer-virtual-key>

Every request emits headers the eval pipeline reads:

  • x-prism-cost — actual cost in cents after cache hits
  • x-prism-latency-ms — end-to-end including provider plus gateway hop
  • x-prism-model-used — what routing actually picked
  • x-prism-fallback-used — true if the primary provider was unavailable
  • x-prism-routing-strategy — cost, latency, or quality-weighted

The per-developer virtual key gives attribution without exposing the raw provider key, so “which engineer’s Cursor sessions cost the most last week” is one query, not a SIEM project. The cross-developer cache is the line item that scales hardest: when 30 engineers ask the gateway to explain the same function, one provider call serves all 30. For the cost math at scale, how to reduce Claude Code token costs 90 percent walks through the lever sizing.

Security regression: four scanners on every diff

Coding agents invent secrets and echo training-data shell patterns more often than buyers expect. Four code-relevant Scanners run sub-10ms each and slot into the same eval row:

from fi.evals import Protect
from fi.evals.guardrails.scanners import (
    SecretsScanner, CodeInjectionScanner,
    JailbreakScanner, RegexScanner,
)

protect = Protect(scanners=[
    SecretsScanner(),
    CodeInjectionScanner(),
    JailbreakScanner(),
    RegexScanner(patterns=[
        r"import\s+deprecated_module",
        r"eval\(",
        r"subprocess\.run\([^,]*shell=True",
    ]),
])

scan_result = protect.scan(text=agent_final_patch)

SecretsScanner catches credentials the agent invented. CodeInjectionScanner flags eval, exec, shell interpolation, and template injection. JailbreakScanner runs against comments and docstrings, since multi-turn sessions sometimes round-trip user content into generated text. RegexScanner enforces the repo’s banned patterns. The OWASP LLM Top 10 maps every scanner to a specific category.

Where Future AGI fits

Future AGI isn’t a coding agent. It’s the eval, observability, and gateway stack for teams that build them, operate them, or pick between them.

  • traceAI — Apache 2.0 OpenTelemetry SDK; 50+ AI surfaces across Python, TypeScript, Java; auto-instruments Anthropic, OpenAI, Bedrock, LangChain, and others; PII redaction on by default.
  • ai-evaluation — Apache 2.0 SDK with 50+ pre-built evaluators plus 20+ local heuristic metrics; the five-dimension eval above composes from TaskCompletion, EvaluateFunctionCalling, CustomLLMJudge, and four scanners.
  • Agent Command Center — OpenAI-compatible LLM gateway in a single Go binary (Apache 2.0); 100+ providers, 18+ built-in guardrail scanners + 15 third-party adapters, OTel-native observability. Self-host or use gateway.futureagi.com/v1.

Future AGI runs SOC 2 Type II, HIPAA, GDPR, and CCPA per futureagi.com/trust, with ISO/IEC 27001 in active audit. The Platform’s self-improving evaluators retune thresholds per-agent over time, so a Cursor judge and a Claude Code judge converge to different baselines without you maintaining two prompts by hand. The Error Feed clusters failures across replays with HDBSCAN over ClickHouse and a Sonnet 4.5 Judge that writes an immediate_fix per cluster.

The honest boundary: the agent vendor owns the prompt. You own the trace, the eval, the cost, and the routing. For the optimizer side, see agent metrics frameworks 2026.

The shortest version

A 70 percent SWE-bench Verified score doesn’t survive contact with your repo. Build a golden-PR replay from your last 50 merged PRs, instrument the agent with traceAI, and score the run on five dimensions: golden-PR replay, tool-call correctness, multi-file coherence, plan coherence, and rollback discipline. Ground each dimension on a primitive the ai-evaluation SDK ships out of the box. Route the agent’s provider traffic through the gateway so cost, latency, and routing land on the same eval row. Run the replay weekly. The agent vendor owns the prompt. You own everything else, and the everything-else is what decides whether the agent ships against your code.

Frequently asked questions

Why don't SWE-bench scores transfer to my codebase?
SWE-bench Verified runs on a fixed set of GitHub Python issues from twelve public repos. Your codebase has different framework conventions, custom decorators, internal libraries that aren't on PyPI, and review standards the public set doesn't model. A 70 percent SWE-bench score and a 30 percent pass rate on your repo are both real numbers about the same agent. The public benchmark tells you the agent can solve issues. It does not tell you it can solve yours. Run a golden-PR replay against your last 50 merged PRs before you commit to a vendor.
What is a golden-PR replay?
A golden-PR replay treats your last 50 merged pull requests as a labeled test set. For each PR you have a known-good outcome: the actual merged diff, the tests that ran, the reviewer comments, the files that changed. You revert the repo to the parent commit, hand the issue or commit message to a coding agent, let it run, then diff the agent's patch against the merged patch. The score is not exact match; it is whether the agent's diff would have passed your review bar. Build the set once and you can compare Cursor, Claude Code, Cline, Aider, Continue, and Replit on the same workload.
How do I evaluate tool-call correctness for a coding agent?
Capture every tool call the agent makes through an OpenTelemetry instrumentor, then score four behaviors. Did the agent search before it edited? Did the edit target the right file? Did it run tests after the edit? Did it commit only what it changed? Tool-call correctness is what separates an agent that solves the task from one that solves it the same way a senior engineer would. Code-aware judges read the call sequence and grade against your team's actual workflow, not against the model's confidence score.
What is multi-file coherence?
Multi-file coherence asks whether the agent understands cross-file dependencies. If a function signature changes in module A, did the agent update every call site in modules B and C? If a constant moves from config to environment, did the agent update the docker compose? Single-file diffs hide the failure mode that hurts most in code review. Score multi-file coherence by walking the import graph from the changed files and asking a code-review judge whether each touched-or-should-have-touched file is consistent with the diff.
Why is rollback discipline a separate axis?
A coding agent that knows when to undo is dramatically more useful than one that always pushes forward. Rollback discipline is the agent's behavior when tests fail after an edit: does it revert the broken change and try a different approach, or does it pile more code on top of the failure until the test suite is wrecked? Score this by looking at the trace timeline. Count the edits per test failure. Agents that average more than two edits per failure without a revert are likely to ship regressions in production.
How does Future AGI's stack fit a coding-agent eval?
traceAI auto-instruments Anthropic, OpenAI, and Bedrock calls, so every Claude Code, Cursor, or Cline session emits OpenInference spans with fi.span.kind set to AGENT, TOOL, or CHAIN. The ai-evaluation SDK runs the five dimensions as composable evaluators: TaskCompletion for golden-PR replay, EvaluateFunctionCalling for tool-call correctness, CustomLLMJudge for multi-file coherence and plan quality, and four code-relevant Scanners for security regression. The Agent Command Center gateway emits cost, latency, and routing headers so per-developer attribution lands on every span without a SIEM. You own the trace, the eval, and the cost telemetry; the agent vendor owns the prompt.
How big should my golden-PR set be?
Start with your last 50 merged PRs. Skew toward the mix you actually ship: bug fixes from the issue tracker, small refactors, dependency upgrades, and the occasional new feature. Avoid trivial commits like version bumps or generated files; those don't probe agent behavior. Fifty PRs is enough to catch first-order differences between agents. Grow to 200 once you've decided which agent makes the shortlist and you need to discriminate between similar candidates. Refresh weekly by promoting one freshly merged PR into the set and dropping the oldest one.
Related Articles
View all