What Is Code Execution as an Eval Metric?
An LLM-evaluation metric that runs generated code against tests or expected outputs to score whether the code actually works.
What Is Code Execution as an Eval Metric?
Code execution as an eval metric is an LLM-evaluation metric that scores generated code by running it in a controlled runtime and checking tests, outputs, or runtime errors. It appears in eval pipelines for coding assistants, SQL generators, code interpreters, and agents that write scripts or tool calls. FutureAGI uses eval:ContainsCode through the ContainsCode evaluator to identify code-bearing responses, then teams attach sandbox pass rates, failure classes, and regression thresholds around those traces.
Why It Matters in Production LLM and Agent Systems
Text-only scoring fails on code because programs can look plausible and still fail at the first import. A model can produce Python that matches the reference structure, but uses the wrong library version, mutates the wrong column, returns a string instead of a number, or passes sample inputs while failing hidden edge cases. The visible failure modes are false-positive correctness and unsafe side effects: the eval says “good enough,” then production throws an exception or changes data that should have stayed read-only.
Developers feel the pain as flaky CI evals, rising syntax-error counts, assertion failures, missing dependency errors, and timeouts. SREs see sandbox queues backing up, p99 latency spikes on code-interpreter flows, and more retries after prompt or model changes. Product teams see users rerun the same coding task because the answer compiles only in the model’s imagined environment. Compliance teams care because generated SQL, shell, or Python can cross data-access boundaries when it is executed without a constrained runtime.
This matters more for 2026 multi-step agents than for single-turn code completion. Agents often generate code as an intermediate action: query a warehouse, transform a CSV, call an API, then summarize the result. If the code step is not measured, every downstream answer inherits an unverified computation.
How FutureAGI Handles Code Execution as an Eval Metric
FutureAGI’s approach is to split code detection from execution scoring. The specific FAGI surface for this entry is eval:ContainsCode, exposed as the ContainsCode class in fi.evals with the metric name contains_code. The inventory defines ContainsCode as a code-presence evaluator, not as a sandbox runner, so the reliable workflow is two-stage: first detect that a response contains code, then join that signal to execution results from your own controlled harness.
Concrete example: a data-analysis agent writes SQL and Python for finance analysts. The team stores prompts, expected artifacts, fixture IDs, and allowed libraries in a FutureAGI Dataset. They attach ContainsCode through Dataset.add_evaluation() to mark rows where the answer includes executable content. Those rows go to a read-only sandbox that records tests_passed, exit_code, stderr_class, execution_seconds, and whether network or file access was attempted. A release fails if code presence rises in non-code cohorts, or if sandbox pass rate drops below the task threshold.
For live agents instrumented with traceAI-langchain, the same result can be attached back to the relevant agent step and surfaced in Agent Command Center as a post-guardrail or model fallback trigger. Unlike HumanEval, which reports benchmark pass@k for coding problems, this workflow scores your actual tasks, prompts, dependencies, and trace context.
How to Measure or Detect It
Measure code execution as a two-stage signal: code presence first, execution result second.
fi.evals.ContainsCode— reports whether an output contains code-like content; use it to route responses into code-specific eval paths.- Sandbox pass rate — percent of code-bearing rows where all required tests pass under the pinned runtime, dependency lockfile, and resource limits.
- Runtime failure class — group failures into syntax error, assertion failure, timeout, missing dependency, permission denied, and unsafe operation.
- Eval-fail-rate by cohort — dashboard pass rate by model, prompt version, task type, language, and fixture set.
- User-feedback proxy — track thumbs-down, manual rerun, revert, and escalation rate for code-bearing answers.
Minimal wiring:
from fi.evals import ContainsCode
code_detector = ContainsCode()
release_gate = {
"presence_eval": code_detector.eval_name,
"required_sandbox_pass_rate": 0.95,
"failure_fields": ["exit_code", "stderr_class", "tests_passed"],
}
Common Mistakes
- Counting code presence as correctness.
ContainsCodetells you code is present; it does not prove the program runs. - Testing only sample inputs. Public examples miss boundary cases, empty inputs, dependency drift, and hidden type assumptions.
- Running generated code without a read-only sandbox. Code evals need network, file, CPU, memory, and secret-access limits.
- Using one global pass rate. Track pass rate by language, task type, fixture, model, and prompt version.
- Ignoring non-code cohorts. A code-writing agent that emits scripts for plain explanations has a product-quality regression.
Frequently Asked Questions
What is code execution as an eval metric?
Code execution as an eval metric scores generated code by running it in a controlled runtime against tests, examples, or expected outputs. It is used when textual similarity is not enough to tell whether a program works.
How is code execution different from exact match?
Exact match compares generated text with a reference string. Code execution runs the generated program and checks behavior, so two different implementations can both pass if they produce the expected result.
How do you measure code execution as an eval metric?
Use FutureAGI's `ContainsCode` evaluator to route code-bearing responses into a sandbox harness. Track pass rate, failed tests, exit code, timeout rate, and regression thresholds by task cohort.