How is PromptWizard different from ProTeGi?

ProTeGi uses textual gradients from error analysis and beam-searched edits. PromptWizard uses a multi-stage mutate, critique, and refine loop, which fits pipeline prompts that need iterative correction.

How do you measure PromptWizard?

FutureAGI measures PromptWizard candidates with PromptAdherence, TaskCompletion, eval-fail-rate-by-cohort, llm.token_count.prompt, and latency before promoting a prompt version.

What Is PromptWizard? Definition, Examples & FutureAGI Guide (2026)

Q: What is PromptWizard?

PromptWizard is a prompt optimizer that mutates, critiques, and refines prompt candidates across eval rounds so teams can choose a better prompt by measured score, not taste.

What Is PromptWizard?

PromptWizard is a prompt-family optimizer that improves LLM prompts through multi-stage mutate, critique, and refine rounds. It shows up in production eval pipelines when a team needs candidate prompt versions scored against the same dataset, not guessed from one transcript. In FutureAGI, the PromptWizardOptimizer surface proposes prompt variants, evaluates them with metrics such as PromptAdherence and TaskCompletion, and gives engineers a measured winner they can commit, trace, roll back, or compare with another optimizer.

Why It Matters in Production LLM and Agent Systems

Prompt changes fail quietly. A support agent that sounded better in a demo can start ignoring refund policy, over-calling tools, or omitting required JSON fields once traffic reaches long-tail intents. The common failure mode is prompt overfit: the new wording fits the examples a human stared at and regresses the cases the team forgot to check. A second failure mode is prompt bloat: extra examples and caveats lift one metric while pushing llm.token_count.prompt, p99 latency, and token-cost-per-trace above budget.

Developers feel this first because every regression looks ambiguous: model issue, retrieval issue, prompt issue, or tool issue. SREs see latency and cost graphs move without an obvious deploy failure. Product teams see thumbs-down rate rise for one cohort. Compliance teams lose confidence when a prompt change is not tied to an approval trail. End users feel inconsistent answers from the same workflow.

PromptWizard matters more in agentic systems because one request may pass through planner, retriever, tool-selection, and final-answer prompts. Optimizing only the last prompt can hide a bad planner prompt that chose the wrong tool. A multi-round optimizer gives engineers a repeatable way to search the prompt space, then keep or reject candidates based on measured eval deltas rather than preference.

How FutureAGI Handles PromptWizard

FutureAGI’s approach is to treat PromptWizard as an optimizer surface, not a magic prompt writer. The specific anchor is optimizer:PromptWizardOptimizer: it starts from a seed prompt, creates mutations, critiques failures, refines candidates over N rounds, and scores each candidate against an eval dataset. The winning artifact is still a prompt version that must pass release gates.

A concrete workflow: a LangChain support agent has a planner prompt that often routes billing questions to the account-update tool. The engineer instruments the chain with traceAI-langchain, logs llm.token_count.prompt and prompt version, then runs PromptWizardOptimizer over 250 labeled traces. The eval suite includes PromptAdherence for instruction following, TaskCompletion for end-to-end success, and ToolSelectionAccuracy when the candidate changes tool-routing instructions. The optimizer proposes five prompt versions per round, rejects variants that raise token cost by more than 8%, and promotes the candidate that improves task success without raising p99 latency.

Unlike a notebook-only DSPy teleprompter experiment, the FutureAGI workflow connects optimizer output to the same traces and regression cohorts used after release. The engineer commits the candidate with fi.prompt.Prompt, sends 10% of traffic through Agent Command Center prompt-versioning, and rolls back if eval-fail-rate-by-cohort crosses the agreed threshold.

How to Measure or Detect PromptWizard

Measure PromptWizard by comparing candidates to a frozen baseline on the same cohort:

PromptAdherence: returns whether outputs followed the instructions the candidate prompt gave.
TaskCompletion: captures whether the workflow completed the intended job, not just improved wording.
Optimizer score curve: track best score per round, rejected candidates, and plateau after N rounds.
Trace cost fields: compare llm.token_count.prompt, completion tokens, token-cost-per-trace, and p99 latency by prompt version.
Release proxy: watch thumbs-down rate, escalation rate, and manual override rate for the traffic slice.

from fi.evals import PromptAdherence

eval = PromptAdherence()
result = eval.evaluate(
    input=prompt_text,
    output=model_output,
)
print(result.score)

A candidate only wins when eval gains survive cohort splits: intent, locale, model, tool path, and customer tier. If PromptWizard improves global average but fails high-risk cohorts, keep it in experiment state.

Common Mistakes

PromptWizard fails when teams make the optimizer responsible for unclear eval design. The mistakes are usually operational:

Optimizing without a fixed eval cohort. If the dataset changes between rounds, score deltas are optimizer noise, not prompt progress.
Rewarding style over task success. A polished answer can still miss the tool call, refusal rule, or schema field the workflow needed.
Letting prompt length grow unchecked. Multi-round refinement can add clauses that improve one case while raising llm.token_count.prompt for every trace.
Comparing PromptWizard against manual edits on different data. Use the same baseline, evaluator config, and random seed where possible.
Shipping the winner without a rollback key. Store prompt id and version in traces before routing real traffic to the candidate.