ProTeGi is a prompt-optimization algorithm that turns model failures into textual gradients, searches candidate prompt edits, and keeps the variants that improve evaluator scores.

How is ProTeGi different from prompt engineering?

Prompt engineering is the broader practice of writing and maintaining prompts. ProTeGi is an automated optimizer inside that practice: it uses failure analysis and beam search to propose measurable prompt edits.

How do you measure ProTeGi?

FutureAGI measures ProTeGi runs by comparing candidate prompt versions with evaluators such as PromptAdherence and TaskCompletion, plus cost, latency, and eval-fail-rate-by-cohort.

What Is ProTeGi? Definition & FutureAGI Guide (2026)

What Is ProTeGi?

ProTeGi is a prompt-optimization algorithm that uses textual gradients: natural-language critiques of failed model outputs that point to prompt edits. It belongs to the prompt family and shows up in an optimization workflow after an eval run exposes errors in production traces or a regression dataset. FutureAGI uses ProTeGi as an agent-opt optimizer to propose candidate prompt versions, score them with evaluators, and promote only variants that improve task quality under cost and safety thresholds.

Why It Matters in Production LLM and Agent Systems

ProTeGi matters because prompt fixes based on a few failures often trade one hidden failure mode for another. A support prompt tightened to avoid long answers can start refusing valid requests. A RAG prompt adjusted to cite sources can invent citations when retrieved context is thin. A tool-selection prompt that adds more constraints can cause an agent to skip the right tool. The failure looks like instruction drift, silent hallucination, schema validation failure, or lower task completion, but the root cause is a prompt edit with no optimizer-backed eval loop.

The pain is shared. Developers see failing regression rows but do not know what wording to change. SREs see p99 latency and token-cost-per-trace move after a prompt grows. Product teams see cohort-specific thumbs-down spikes. Compliance reviewers need to explain why a prompt variant produced a risky answer. End users only see inconsistent behavior.

For 2026 agentic systems, the effect compounds. A single task may call a planner, retriever, tool router, and final responder, each with its own prompt. Manual edits to one step can mask failures in another. ProTeGi helps when failures can be described in text, because the optimizer can turn error analysis into candidate prompt edits and test those edits against the same eval cohort.

How FutureAGI Handles ProTeGi

FutureAGI’s approach is to run optimizer:ProTeGi inside a traced optimization workflow. The engineer starts with a seed prompt, an eval dataset, and one or more evaluators such as PromptAdherence, TaskCompletion, Groundedness, or ToolSelectionAccuracy. The agent-opt surface calls ProTeGi, analyzes failed rows, writes textual gradients such as “the prompt does not ask for account-status verification before tool use,” beam-searches prompt edits, and scores each candidate on the same cohort.

Example: a LangChain support agent has TaskCompletion 0.68 on billing-change requests and a rising escalation rate. Traces from traceAI-langchain show failures concentrated in the planner span, while llm.token_count.prompt is already near the budget. The engineer runs ProTeGi against 250 labeled traces with PromptAdherence and TaskCompletion. A candidate prompt adds a short decision rule for billing-intent routing, raises TaskCompletion to 0.78, and keeps prompt tokens within a 5% budget.

The next action is operational. The engineer commits the prompt version through fi.prompt.Prompt, sends a 10% rollout through Agent Command Center prompt-versioning, and sets an alert on eval-fail-rate-by-cohort. Unlike DSPy teleprompter experiments that stay in a notebook, the winning ProTeGi candidate remains linked to traces, prompt version, evaluator scores, and rollback policy.

How to Measure or Detect ProTeGi

Measure ProTeGi by comparing candidate prompts against a fixed baseline, not by reading the rewritten prompt and guessing:

PromptAdherence: returns a score for whether the model output followed the prompt instructions.
TaskCompletion: measures whether the new prompt improved the end-to-end job, not only wording.
Optimizer score curve: compare baseline score, each beam candidate, and the selected prompt after every optimization round.
Trace fields: split results by prompt id, prompt version, model, route, llm.token_count.prompt, and span status.
Production proxies: watch thumbs-down rate, escalation rate, manual override rate, and eval-fail-rate-by-cohort during rollout.

Minimal Python:

from fi.evals import PromptAdherence

eval = PromptAdherence()
result = eval.evaluate(
    input=prompt_text,
    output=model_output,
)
print(result.score)

Common Mistakes

These mistakes make ProTeGi look noisy when the real issue is experiment design:

Treating textual gradients as truth. They are hypotheses from failure analysis; accept them only when candidate scores improve on a held-out cohort.
Optimizing against one evaluator. A prompt can raise TaskCompletion while lowering Groundedness or safety, especially in RAG and tool workflows.
Letting beam search grow prompts unchecked. Add llm.token_count.prompt, p99 latency, and cost budgets to the selection rule.
Running ProTeGi without stable traces. If prompt versions, inputs, model, and route are not logged, you cannot reproduce the win.
Shipping the top candidate directly to 100% traffic. Use canary deployment or traffic-mirroring before promotion.