Evaluation

What Is the CodeContests Coding Benchmark?

A 13,500-problem competitive-programming benchmark from DeepMind, scraped from Codeforces and similar sites, evaluating LLMs on algorithmic reasoning and adversarial test passing.

What Is the CodeContests Coding Benchmark?

CodeContests is a coding benchmark released by DeepMind alongside the AlphaCode paper, comprising 13,500+ competitive-programming problems scraped from Codeforces and similar sites. Each problem ships with a natural-language statement, public sample tests, hidden adversarial tests, time and memory limits, and a difficulty rating. The model reads the problem, generates code (Python, C++, or Java), and the code is executed against the hidden tests. Quality is reported as pass@k: the probability that at least one of k sampled solutions passes every hidden test. Compared with HumanEval (164 problems) or APPS (10K), CodeContests is the harder, more representative benchmark for algorithmic reasoning.

Why It Matters in Production LLM and Agent Systems

Coding benchmarks are no longer academic. Every team shipping a coding agent — Cursor, Cline, Claude Code, OpenAI Codex, an internal SWE bot — ships a benchmark contract with their users. CodeContests probes the part of the distribution that matters: algorithmic reasoning, edge cases, time-complexity awareness, and the ability to debug from a failing public test. HumanEval saturates fast (most frontier models exceed 90% pass@1 in 2026); CodeContests still discriminates between frontier models and is a much better predictor of real coding-agent reliability.

The pain shows up across roles. An ML engineer fine-tunes a coding model and sees HumanEval pass@1 climb from 84% to 89%; CodeContests pass@10 stays flat at 14% — the fine-tune memorised easy patterns without improving algorithmic reasoning. A product lead launches a coding-agent product that scores 92% on internal HumanEval-style tests; user complaints come in immediately because the agent fails on dynamic-programming and graph problems that look easy to humans. A platform engineer notes inference cost exploded after a prompt change that boosted CodeContests by 6 points but pushed average tokens-per-trace from 1.2K to 4.8K.

In 2026 agent stacks, coding agents form trajectories: read repo, propose plan, write code, run tests, debug. CodeContests probes the write-code step at adversarial difficulty; trajectory-level evaluation is still required end-to-end.

How FutureAGI Handles the CodeContests Benchmark

FutureAGI’s approach is to treat CodeContests as a Dataset you import, evaluate, and version like any other golden test set. The HuggingFace deepmind/code_contests dataset loads directly into fi.datasets.Dataset via Dataset.import_from_huggingface. From there, a custom CustomEvaluation wraps a sandboxed code-execution stage that runs the model’s output against the public and hidden tests; the evaluator returns 0/1 per test plus aggregate pass@k. Pair it with TaskCompletion for trajectory-level scoring when the model is operating as an agent that runs and debugs its own code.

Concretely: a coding-agent team imports the CodeContests test split (165 problems) into a Dataset versioned at v1. Their agent runs as a multi-step trajectory — read problem, draft solution, run public tests, debug if failing, submit. They instrument with traceAI-langchain and write agent.trajectory.step per step. Every PR runs the eval suite over the dataset; the CI artifact reports pass@1, pass@10, and per-difficulty-band pass rates. Merge blocks if pass@10 drops more than 2 points or if any difficulty band regresses by more than 4. We’ve found that in our 2026 evals, the difficulty-band breakdown catches regressions that aggregate pass@k hides — a model can preserve pass@10 globally while losing 8 points on the hard tier.

For regression eval against frontier base models, FutureAGI’s LLM-as-a-Judge setup pins the judge to a different model family from the candidate, avoiding the same-family score inflation that happens with self-evaluation.

How to Measure or Detect It

CodeContests grading uses execution-based metrics:

  • pass@k: probability that at least one of k sampled solutions passes every hidden test; the canonical CodeContests metric.
  • Per-difficulty-band pass rate: easy/medium/hard splits; aggregate pass@k can hide regressions on the hard tier.
  • TaskCompletion (fi.evals): trajectory-level score for coding agents that run and debug their own code.
  • Time-complexity violations: hidden tests time out when the solution is asymptotically wrong; track timeout rate as a separate signal.
  • Edit distance to reference: not used as a primary metric (correctness > similarity) but useful when debugging why a solution fails.
from fi.evals import TaskCompletion

tc = TaskCompletion()
result = tc.evaluate(
    input="Read N, output the longest increasing subsequence length.",
    output="def longest_increasing(nums):\n    ...",
)
print(result.score, result.reason)

Common Mistakes

  • Reporting pass@1 only. Frontier models benefit from sampling; pass@10 or pass@100 is the meaningful comparison for production code-gen with retry.
  • Skipping the difficulty-band breakdown. Aggregate pass@k hides hard-tier regressions.
  • Running CodeContests without a sandbox. Executing model-generated code on your CI machine is a security risk; use Docker or gVisor.
  • Conflating CodeContests with HumanEval. A 90%+ HumanEval score predicts almost nothing about CodeContests pass@10.
  • Treating CodeContests as the only coding benchmark. Pair with APPS, MBPP, and SWE-bench for end-to-end agent eval.

Frequently Asked Questions

What is the CodeContests coding benchmark?

CodeContests is a benchmark from DeepMind containing 13,500+ competitive-programming problems from sites like Codeforces, with public and hidden tests, used to evaluate LLM coding ability via the pass@k metric.

How is CodeContests different from HumanEval?

HumanEval has 164 hand-written Python problems focused on basic functions; CodeContests has 13,500+ real competitive-programming problems with adversarial hidden tests covering algorithmic reasoning, dynamic programming, graph theory, and edge cases that HumanEval does not exercise.

How does FutureAGI integrate the CodeContests benchmark?

Load CodeContests problems into a Dataset, run your model with code execution, and use a custom code-execution evaluator plus TaskCompletion to score pass@k. Wire it into a regression eval that runs on every model or prompt change.