What Is HumanEval?
HumanEval is a Python code-generation benchmark scored by running generated functions against unit tests.
What Is HumanEval?
HumanEval is a code-generation benchmark released by OpenAI in 2021 that tests whether an LLM can complete Python functions so the result passes hidden unit tests. A prompt contains a function signature, a docstring, and sometimes example calls; the model emits a function body; an executor runs each candidate against a private test suite and scores pass@1 or pass@k. HumanEval is a single point in a much larger 2026 code-evaluation landscape, and in May 2026 it is saturated. every frontier coding model (GPT-5.x, Claude Opus 4.7, Gemini 3 Ultra, Llama 4) sits above 96% pass@1 on the 164 original problems, which means the benchmark no longer discriminates between top-tier systems. A FutureAGI evaluation workflow uses HumanEval as a tier filter and contamination canary, not as a release gate.
The short rule for a senior engineer in May 2026: if a coding-model vendor leads its 2026 pitch with HumanEval pass@1, treat it the way you would treat a 2021 paper leading with BLEU on translation. Useful for continuity, irrelevant for ranking frontier systems, and dangerous as a production proxy for a coding agent.
Why HumanEval matters in production LLM and agent systems
HumanEval still catches a specific failure that broad chat benchmarks miss: the model can explain code fluently and still emit code that fails at runtime. In a coding assistant or coding agent, that failure shows up as syntax errors, wrong edge-case handling, missing imports, off-by-one logic, hallucinated standard-library calls, or functions that pass the visible example and fail hidden cases. Developers feel the pain first because they have to debug generated patches. SREs feel it later when an agent ships a faulty migration, retries a broken script, or burns tool budget rerunning a failing command. Product teams feel it when AI coding help looks impressive in demos and unreliable on repeated work.
HumanEval also has a second purpose in 2026 production stacks: it is a fast contamination probe. Because the dataset is public and small, frontier model cards report HumanEval as a sanity check; if a new model’s pass@1 is suspiciously close to 100% but its score on a fresh benchmark like LiveCodeBench is much lower, the gap is a contamination signal. FutureAGI’s view is that HumanEval is most valuable for what it fails to tell you in 2026: it cannot distinguish frontier-tier coding ability anymore, but it remains a useful trigger for “this model’s training data probably saw this dataset”.
The benchmark is especially relevant for 2026-era agentic pipelines because coding agents rarely make one model call. They inspect files, write patches, call tools, run tests, read failures, and edit again. A 96%+ HumanEval score says the base model can synthesize small Python functions from clean prompts. It does not say the agent can choose the right file, preserve project conventions, satisfy a typed API, or avoid a destructive tool call. Useful symptoms to log in production are unit-test pass rate, compile-error rate, average repair attempts, tool-call success rate, trajectory-level task-completion scores, and eval-fail-rate-by-cohort after a model or prompt change. Compared with MMLU, HumanEval is closer to execution; compared with SWE-Bench, it is smaller and less representative of whole-repository maintenance.
HumanEval vs. the 2026 coding-benchmark landscape
The honest table: HumanEval is one of nine or ten coding benchmarks a serious 2026 stack should track, and it is the lowest-signal of them.
| Benchmark (May 2026) | What it tests | Frontier ceiling | Use as |
|---|---|---|---|
| HumanEval | 164 Python function completions, hidden tests | 96–99% pass@1 | Tier filter, contamination probe |
| MBPP | 974 entry-level Python tasks | 90%+ | Tier filter |
| HumanEval+ | HumanEval with extra hidden tests | 90–94% | Contamination check |
| LiveCodeBench | Fresh competitive programming, filtered by date after model cutoff | 65–75% | Release gate for code generation |
| SWE-Bench Verified | 500 human-verified GitHub bug fixes; whole-repo edits | 70–78% | Release gate for coding agents |
| Aider Polyglot | Multi-language code editing with edit-and-test cycles | 65–80% | Release gate for editor-style assistants |
| BigCodeBench | Function calls with libraries; harder than HumanEval | 60–75% | Tier filter for library-aware coding |
| SciCode | Implement scientific algorithms from papers | 30–50% | Research-coding tier filter |
| MLE-Bench | 75 Kaggle-style ML engineering tasks | 20–40% | Agent benchmark for ML work |
| RE-Bench | Research-coding tasks, expert-graded | <30% | Frontier headroom probe |
The takeaway is that single-function completion no longer measures the bottleneck. The bottleneck is multi-file editing, tool use, and trajectory-level repair. exactly what SWE-Bench Verified, Aider Polyglot, and LiveCodeBench measure. HumanEval keeps its seat at the table because it is cheap, fast, and historically tied to model-card continuity, not because it discriminates.
How FutureAGI handles HumanEval
HumanEval has no dedicated FutureAGI evaluator class in the inventory, and it does not need one. execution against unit tests is the correct scoring method, and FutureAGI does not try to replace pytest with LLM-as-a-judge. FutureAGI’s approach is to treat HumanEval as a benchmark dataset that belongs inside a broader evaluation workflow, not as a standalone release gate.
A model team loads HumanEval-style prompts into fi.datasets.Dataset, stores the generated function body as the response, attaches the expected test outcome as the target, and uses Dataset.add_evaluation to track pass/fail results by model, prompt version, and sampling configuration. For production coding agents, the same team instruments the agent with traceAI-openai, traceAI-anthropic, or traceAI-claude-agent-sdk, and connects offline benchmark failures to live spans that include llm.token_count.prompt, gen_ai.request.model, tool calls, and test-run status. The result is one trace view that links “HumanEval pass@1 dropped” to “candidate model X’s agent runs in production fail twice as often on file-edit tool calls”. a connection a static benchmark report cannot make.
Concretely, the FutureAGI pattern is to run nightly HumanEval and LiveCodeBench regression on candidate models, record pass@1 and compile-error rate, alert when pass@1 drops more than five points for any cohort, and only then promote to the deeper SWE-Bench Verified-style evaluation that runs the agent end-to-end. The nearest FutureAGI evaluator classes. ContainsCode for code-presence sanity checks, GroundTruthMatch for deterministic expected-output tasks, JSONValidation for structured outputs, FunctionCallExactMatch for AST-based exact match, and ToolSelectionAccuracy for agent-level tool choice. are adjacent checks, not HumanEval replacements. They are useful when a coding agent workflow also emits structured answers, explanations, or tool-call metadata.
If a deployment still passes HumanEval but live traces show rising test retries, route the release through Agent Command Center with the model fallback control or a stricter pre-deployment regression gate. We’ve found that the highest-signal correlation in 2026 coding-agent data is between offline LiveCodeBench scores and production tool-retry rates. HumanEval correlates weakly with both because it has saturated.
Where HumanEval still pays its rent
Three workflows in May 2026 where HumanEval remains useful:
- Contamination canary. a new fine-tuned model claims +6 points on a domain benchmark. Run HumanEval against the base and tuned variants; if pass@1 also jumps, the fine-tune likely overlapped with public coding data and the domain gain is partially memorisation.
- Prompt-regression sanity check. a prompt-template change should not crater HumanEval pass@1. If it does, the new template is fighting the model’s instruction-following defaults.
- Decoder-parameter sweep. when picking temperature, top-p, and max-tokens for a coding route, HumanEval pass@1 is a cheap, repeatable signal.
For everything else. agent quality, repository understanding, refactoring safety, library API correctness. use LiveCodeBench, SWE-Bench Verified, Aider Polyglot, and a domain golden dataset scored through FutureAGI evaluators.
A note on HumanEval+ and EvalPlus
HumanEval’s original test suite is small (an average of 7.7 tests per problem) and known to under-discriminate; some buggy completions pass the visible tests but would fail under stricter checking. The HumanEval+ extension and the broader EvalPlus project add 80x more tests on average and catch substantially more silent failures. In our 2026 evals across frontier models, HumanEval+ exposes a 3–8 point drop versus HumanEval pass@1; the gap is the “test-suite tightness” tax. Any team relying on HumanEval as a contamination canary should pair it with HumanEval+. if a model’s HumanEval is 99% and its HumanEval+ is 89%, the model is over-fit to the visible test pattern, not necessarily to the data. This is a different failure mode from training-set contamination and matters for production calibration.
The 2026 coding-agent rollout playbook
A clean rollout playbook for any 2026 coding-agent product, with HumanEval in its proper place:
- Tier filter. run HumanEval and HumanEval+ to confirm the candidate model is in the right tier (above 90% pass@1 means it can write Python; below that, it cannot, and no agent scaffold will save it).
- Discriminator. run LiveCodeBench on a fresh window after the model’s training cutoff to confirm capability is not memorisation. A 30+ point gap between HumanEval and LiveCodeBench means contamination; investigate before promotion.
- Agent benchmark. run SWE-Bench Verified and Aider Polyglot against the agent scaffold (not just the base model). This is the headline number for the actual product.
- Domain golden dataset. run the team’s own golden dataset of repository-specific tasks through
TrajectoryScore,ToolSelectionAccuracy,TaskCompletion, andStepEfficiencyevaluators. - Traffic mirroring. once the candidate passes 1–4, mirror 5–10% of production traffic via Agent Command Center, evaluate the mirrored runs, and only promote when no evaluator regresses.
- Release gate. every prompt or model change runs the regression suite from steps 1–4 and fails the build on any threshold breach.
- Continuous monitoring. sample 5% of live traces into the eval cohort daily; the annotation queue captures failures for the next golden-dataset refresh.
HumanEval is step 1. necessary, fast, cheap, and 100% inadequate as the sole signal for steps 2–7.
How to measure or detect HumanEval performance
HumanEval is measured by execution, so the core score comes from running generated code against tests, not from an LLM judge. Wire these into the eval pipeline:
- pass@1. percentage of problems solved by the first generated completion; the cleanest release-gate number and the only frontier-relevant cut.
- pass@k. probability that at least one of k sampled completions passes; useful for search-based coding systems and best-of-n agents.
- compile-error rate. fraction of generations that fail before tests run; often reveals prompt or decoding regressions.
- eval-fail-rate-by-cohort. FutureAGI dashboard signal grouped by model, prompt version, task type, or repository; the canonical regression alarm.
GroundTruthMatch. adjacent FutureAGI evaluator class for deterministic expected-output checks around the benchmark harness.ContainsCode. sanity-check evaluator that confirms the response contains a code block, useful when wrapping HumanEval with chat-style prompts.FunctionCallExactMatch. AST-based exact match, useful when scoring tool-call arguments that wrap HumanEval-style functions.JSONValidation. when HumanEval is embedded inside a structured response, validate against schema before running tests.- Trace fields. segment by
gen_ai.request.model,gen_ai.request.temperature, latency p99, and token cost; HumanEval is sensitive to decoding parameters.
from fi.evals import GroundTruthMatch, ContainsCode
ground_truth = GroundTruthMatch()
contains_code = ContainsCode()
result = ground_truth.evaluate(
response="return n * (n + 1) // 2",
expected_response="return n * (n + 1) // 2",
)
print(result.score)
Do not use this snippet as a HumanEval substitute. It shows how to attach a nearby deterministic evaluator; HumanEval itself still requires executing candidate code in a sandbox against the hidden test suite. The recommended 2026 path is to run HumanEval and LiveCodeBench together. HumanEval as a fast canary, LiveCodeBench as the discriminator. and then promote candidates that pass both into a SWE-Bench Verified evaluation cycle.
For a cohort-filtered regression eval over a versioned Dataset of HumanEval-style + LiveCodeBench rows, with a sandbox executor providing the pass/fail signal and trajectory evaluators layered on top for agent-mode runs:
from fi.datasets import Dataset
from fi.evals import TrajectoryScore, ToolSelectionAccuracy, TaskCompletion, AggregatedMetric
from fi.executors import PythonSandbox
coding = Dataset.load("coding-regression-v9").filter(
cohort={"$in": ["humaneval", "livecodebench_2026_03"]}
)
sandbox = PythonSandbox(timeout_seconds=10, isolate=True)
pass_rate = sandbox.run_dataset(coding, model="gpt-5.1", attempts=1)
agent_agg = AggregatedMetric(
metrics=[TrajectoryScore(), ToolSelectionAccuracy(), TaskCompletion()],
weights=[0.4, 0.3, 0.3],
)
agent_run = agent_agg.run_dataset(coding, agent="coding-assistant-v12")
print(f"pass@1={pass_rate.score:.3f} agent_score={agent_run.score:.3f}")
assert pass_rate.score >= 0.95, "HumanEval canary regressed"
assert agent_run.score >= 0.72, "Agent trajectory regressed"
Common mistakes
- Calling HumanEval production readiness. It scores small Python functions, not repository navigation, package APIs, code review quality, or safe tool behaviour. In May 2026 the gap between a 99% HumanEval and a 75% SWE-Bench Verified score is the entire difference between “the model can code” and “the agent can ship”.
- Treating HumanEval as a frontier discriminator. Every top model is above 96% pass@1 in 2026; the score does not distinguish them. Use LiveCodeBench, SWE-Bench Verified, or Aider Polyglot for that.
- Reporting only pass@k. pass@k can hide weak first-attempt quality if the product cannot sample many completions; production coding assistants usually run pass@1 with a repair loop.
- Ignoring sandbox failures. Timeouts, import errors, and syntax errors should be tracked separately from wrong-answer test failures; lumping them together hides regressions in decoding.
- Mixing benchmark prompts with agent traces. HumanEval prompts are clean, single-file, and stateless. Production coding requests include files, logs, policies, repo conventions, and state.
- Using an LLM judge instead of execution. HumanEval is valuable because code either passes tests or it does not. replacing the executor with a judge model defeats the point.
- Trusting a fine-tune’s HumanEval jump. Contamination is the default assumption for any public coding benchmark in 2026; pair HumanEval with HumanEval+ and a fresh LiveCodeBench window after the model’s training cutoff.
- Skipping the agent layer. A model that solves HumanEval at 99% can still call the wrong tool, edit the wrong file, or loop on a failing test. Use
ToolSelectionAccuracy,TrajectoryScore, andTaskCompletionfor the agent layer.
Why HumanEval is not enough for coding agents
Coding agents in 2026 are evaluated by τ-bench-style trajectory benchmarks and by SWE-Bench Verified, not by HumanEval. The reason is that a coding agent makes 10–60 model calls per task: read file, search repo, plan edit, write patch, run test, read failure, repair. HumanEval measures call number one only, and it measures it on a problem with no external state. SWE-Bench Verified measures the whole loop on real GitHub bugs and is the standard headline number on every 2026 coding-model card.
The FutureAGI workflow for a coding agent looks like: HumanEval as nightly canary → LiveCodeBench as weekly discriminator → SWE-Bench Verified as monthly deep eval → live trace scoring with TrajectoryScore, ToolSelectionAccuracy, TaskCompletion, and StepEfficiency → regression eval gate on every prompt or model change → annotation queue for failures sampled from production traffic. HumanEval sits in step one and only step one.
Decoding parameters that move HumanEval. and what they hide
HumanEval is unusually sensitive to decoding parameters, which is part of why model cards continue to report it. A temperature change from 0.0 to 0.4 can move pass@1 by 4–8 points on the same model; top-p and top-k similarly shift the score. The risk is that decoding-parameter tuning on HumanEval can look like model improvement when it is just sampling-policy tuning. We’ve found that the cleanest defence in 2026 is to lock decoding parameters at the route level inside Agent Command Center, record them on every trace span via gen_ai.request.temperature and gen_ai.request.top_p, and treat any HumanEval shift larger than two points as worth investigating before it becomes a release signal. Pair this with a prompt-management workflow that versions the system prompt, the user prompt template, and the decoding configuration together. a HumanEval pass@1 change that lacks a corresponding prompt-version diff is almost always a decoding artifact, not a real capability change.
Frequently Asked Questions
What is HumanEval?
HumanEval is a code-generation benchmark that tests whether an LLM can complete Python functions that pass hidden unit tests. It is useful as a baseline for coding ability, not as proof that a coding agent is production-ready.
How is HumanEval different from MMLU?
HumanEval measures executable Python function completion. MMLU measures multiple-choice knowledge across many academic subjects, so it does not test whether generated code actually runs.
How do you measure HumanEval performance?
Measure HumanEval with pass@1 or pass@k over executed unit tests, then track eval-fail-rate-by-cohort in FutureAGI. The nearest FutureAGI evaluator surfaces are GroundTruthMatch and ContainsCode, but HumanEval itself is a benchmark.