GEPA is genetic Pareto prompt optimization: it evolves prompt variants and keeps candidates on the best tradeoff frontier across task score, faithfulness, cost, latency, and safety.

How is GEPA different from ProTeGi?

ProTeGi uses textual gradients from failure analysis to rewrite a prompt. GEPA treats prompt search as a multi-objective evolutionary process, so it can compare several good candidates instead of forcing one blended score.

How do you measure GEPA prompt optimization?

FutureAGI measures GEPAOptimizer runs by evaluator deltas such as PromptAdherence and TaskCompletion, plus trace fields like llm.token_count.prompt and cost per eval row.

What Is GEPA? Definition, Examples & FutureAGI Guide (2026)

What Is GEPA?

GEPA (Genetic Pareto Prompt Optimization) is a prompt-optimization method that evolves prompt candidates and selects the Pareto-best versions across competing goals. It belongs to the prompt family and appears in eval-driven prompt workflows where a team wants higher task success without uncontrolled cost, latency, or safety regressions. In FutureAGI, the GEPAOptimizer surface runs candidates against an eval dataset, compares scores such as PromptAdherence and TaskCompletion, and keeps prompts whose tradeoffs are visible in traces.

Why It Matters in Production LLM and Agent Systems

Prompt optimization fails in production when a team optimizes for one headline score and ships a prompt that quietly worsens another constraint. A support-agent prompt may raise TaskCompletion on common refund questions while increasing hallucinated policy claims. A RAG answer prompt may improve tone while lowering Groundedness. A tool-calling prompt may reduce output length while causing the planner to skip a required verification tool.

GEPA matters because prompt quality is usually a tradeoff problem, not a single-number contest. Developers feel the pain as prompt changes that pass a small notebook test and fail in a larger regression cohort. SREs see token-cost-per-trace climb because the winning prompt is 900 tokens longer. Compliance teams see new refusal gaps or policy-language drift. Product teams see users re-open tickets after an answer sounded polished but omitted a required step.

The symptoms show up in traces and eval dashboards: eval-fail-rate-by-cohort moves in opposite directions across slices, p99 latency rises after a prompt rollout, cost per successful task changes, and evaluator explanations cluster around one instruction block. In 2026 multi-step agent pipelines, the tradeoff compounds. A planner prompt, retrieval formatter, tool-selection prompt, and final-response prompt can each look acceptable alone, while the full trajectory loses reliability because one step optimizes away context needed by the next.

How FutureAGI Handles GEPA

FutureAGI’s approach is to keep GEPA tied to the same eval dataset and trace layer that production engineers already use for releases. The anchor surface is optimizer:GEPAOptimizer, implemented as the agent-opt GEPAOptimizer class: a genetic Pareto optimizer for complex prompt solution spaces. It mutates prompt candidates, scores them against multiple objectives, and keeps a Pareto frontier instead of flattening quality, cost, and latency into one opaque number.

A typical workflow starts with a seed prompt for a claims-support agent and a 300-row regression dataset. The engineer selects objectives such as PromptAdherence, TaskCompletion, Groundedness, llm.token_count.prompt, and median latency. GEPAOptimizer generates candidate prompts, runs them against the same cohort, and records candidate id, prompt version, evaluator scores, and token usage. If the app is instrumented through traceAI-langchain, those candidate runs can be inspected beside the same LLM spans used in production debugging.

Unlike ProTeGi, which follows textual gradients from error analysis, GEPA keeps several high-quality candidates alive when they represent different tradeoffs. One candidate may maximize TaskCompletion but add 18 percent token cost. Another may lose 1 point of task score but cut prompt tokens by 35 percent and reduce refusal drift. The engineer can set a release gate, for example PromptAdherence >= 0.92, Groundedness >= 0.85, and token-cost-per-successful-task below baseline, then commit the selected prompt version or send it through another regression eval.

How to Measure or Detect It

Measure GEPA by comparing candidate prompts on the same cohort, not by reading the prompt and guessing.

PromptAdherence: scores whether the output followed the prompt’s instructions; track pass rate by GEPA candidate id.
TaskCompletion and Groundedness: detect whether the optimized prompt still completes the task and stays grounded in provided context.
Trace fields: compare llm.token_count.prompt, total token cost, latency p95, and model route for each candidate run.
Pareto frontier health: review how many candidates beat baseline on one metric without violating gates on the others.
User-feedback proxy: after rollout, monitor thumbs-down rate, escalation rate, and reopened-ticket rate by prompt version.

from fi.evals import PromptAdherence

metric = PromptAdherence()
# Run the same eval rows for every GEPA candidate.
# Store each score with candidate_id, prompt_version, and eval_cohort.

The key detection rule is consistency. If two candidates were scored on different rows, different model versions, or different retrieval settings, the Pareto frontier is not meaningful.

Common Mistakes

Most GEPA mistakes come from turning a multi-objective search into an unreviewed auto-deploy path.

Optimizing only mean task score. The best average prompt can still fail compliance, cost, or long-tail cohorts.
Using different eval rows per candidate. Genetic search needs a stable comparison set, or the frontier reflects sampling noise.
Letting GEPA rewrite policy text freely. Lock regulatory clauses, refusal boundaries, and tool permissions before mutation.
Selecting the cheapest prompt without gates. Cost is useful only after PromptAdherence, Groundedness, and task success clear minimum thresholds.
Ignoring prompt-version rollback. Store the baseline, candidate id, and selected prompt version so incidents can be traced.