Prompting

What Is PromptWizard?

An optimizer that mutates, critiques, and refines prompt candidates across eval rounds to improve measured LLM or agent behavior.

What Is PromptWizard?

PromptWizard is a prompt-family optimizer that improves LLM prompts through multi-stage mutate, critique, and refine rounds. It shows up in production eval pipelines when a team needs candidate prompt versions scored against the same dataset, not guessed from one transcript. In FutureAGI, the PromptWizardOptimizer surface proposes prompt variants, evaluates them with metrics such as PromptAdherence and TaskCompletion, and gives engineers a measured winner they can commit, trace, roll back, or compare with another optimizer.

Why It Matters in Production LLM and Agent Systems

Prompt changes fail quietly. A support agent that sounded better in a demo can start ignoring refund policy, over-calling tools, or omitting required JSON fields once traffic reaches long-tail intents. The common failure mode is prompt overfit: the new wording fits the examples a human stared at and regresses the cases the team forgot to check. A second failure mode is prompt bloat: extra examples and caveats lift one metric while pushing llm.token_count.prompt, p99 latency, and token-cost-per-trace above budget.

Developers feel this first because every regression looks ambiguous: model issue, retrieval issue, prompt issue, or tool issue. SREs see latency and cost graphs move without an obvious deploy failure. Product teams see thumbs-down rate rise for one cohort. Compliance teams lose confidence when a prompt change is not tied to an approval trail. End users feel inconsistent answers from the same workflow.

PromptWizard matters more in agentic systems because one request may pass through planner, retriever, tool-selection, and final-answer prompts. Optimizing only the last prompt can hide a bad planner prompt that chose the wrong tool. A multi-round optimizer gives engineers a repeatable way to search the prompt space, then keep or reject candidates based on measured eval deltas rather than preference.

How FutureAGI Handles PromptWizard

FutureAGI’s approach is to treat PromptWizard as an optimizer surface, not a magic prompt writer. The specific anchor is optimizer:PromptWizardOptimizer: it starts from a seed prompt, creates mutations, critiques failures, refines candidates over N rounds, and scores each candidate against an eval dataset. The winning artifact is still a prompt version that must pass release gates.

A concrete workflow: a LangChain support agent has a planner prompt that often routes billing questions to the account-update tool. The engineer instruments the chain with traceAI-langchain, logs llm.token_count.prompt and prompt version, then runs PromptWizardOptimizer over 250 labeled traces. The eval suite includes PromptAdherence for instruction following, TaskCompletion for end-to-end success, and ToolSelectionAccuracy when the candidate changes tool-routing instructions. The optimizer proposes five prompt versions per round, rejects variants that raise token cost by more than 8%, and promotes the candidate that improves task success without raising p99 latency.

Unlike a notebook-only DSPy teleprompter experiment, the FutureAGI workflow connects optimizer output to the same traces and regression cohorts used after release. The engineer commits the candidate with fi.prompt.Prompt, sends 10% of traffic through Agent Command Center prompt-versioning, and rolls back if eval-fail-rate-by-cohort crosses the agreed threshold.

How to Measure or Detect PromptWizard

Measure PromptWizard by comparing candidates to a frozen baseline on the same cohort:

  • PromptAdherence: returns whether outputs followed the instructions the candidate prompt gave.
  • TaskCompletion: captures whether the workflow completed the intended job, not just improved wording.
  • Optimizer score curve: track best score per round, rejected candidates, and plateau after N rounds.
  • Trace cost fields: compare llm.token_count.prompt, completion tokens, token-cost-per-trace, and p99 latency by prompt version.
  • Release proxy: watch thumbs-down rate, escalation rate, and manual override rate for the traffic slice.
from fi.evals import PromptAdherence

eval = PromptAdherence()
result = eval.evaluate(
    input=prompt_text,
    output=model_output,
)
print(result.score)

A candidate only wins when eval gains survive cohort splits: intent, locale, model, tool path, and customer tier. If PromptWizard improves global average but fails high-risk cohorts, keep it in experiment state.

Optimizer landscape, May 2026

The 2026 prompt-optimizer field is wider than it was a year ago. PromptWizard, ProTeGi, GEPA, DSPy’s MIPROv2, and bandit-based Bayesian prompt search all show up in production teams. Each has a different fit:

OptimizerStrengthWhen to reach for it
PromptWizardMulti-round mutate, critique, refinePipeline prompts with structured failures and an existing eval cohort
ProTeGiTextual gradients from error analysisSingle high-stakes prompt where you have clear failure transcripts
GEPAGenetic Pareto search across competing objectivesTrade-offs across cost, latency, task completion, and tone
MIPROv2 (DSPy)Joint instruction + demonstration searchDSPy pipelines with multi-stage LLM agents
Bayesian prompt searchProbabilistic exploration with priorsFew-shot example selection across a large candidate pool
Random prompt searchBaselineAlways run as a control before claiming an optimizer beat it

In our 2026 evals, the strongest pattern is to layer optimizers: a 2-round PromptWizard pass to fix obvious failures, then a GEPA Pareto run to balance cost, latency, and quality across GPT-5.1, Claude Opus 4.7, Claude Sonnet 4.6, and a self-hosted Llama 4 70B route. Unlike a single LangSmith experiment with one variant, the FutureAGI workflow keeps every candidate’s PromptAdherence score, TaskCompletion score, Groundedness score, prompt-token count, and trace anchor in the same view. so the winner is defensible, not anecdotal.

Common Mistakes

PromptWizard fails when teams make the optimizer responsible for unclear eval design. The mistakes are usually operational:

  • Optimizing without a fixed eval cohort. If the dataset changes between rounds, score deltas are optimizer noise, not prompt progress.
  • Rewarding style over task success. A polished answer can still miss the tool call, refusal rule, or schema field the workflow needed.
  • Letting prompt length grow unchecked. Multi-round refinement can add clauses that improve one case while raising llm.token_count.prompt for every trace.
  • Comparing PromptWizard against manual edits on different data. Use the same baseline, evaluator config, and random seed where possible.
  • Shipping the winner without a rollback key. Store prompt id and version in traces before routing real traffic to the candidate.
  • Treating one optimizer as a silver bullet. PromptWizard, GEPA, and ProTeGi explore the prompt space differently; layer them rather than choose one.
  • Skipping cohort cuts on agent prompts. A planner-prompt win on global average can hide a regression on tool-heavy LLM agent workflows that need ToolSelectionAccuracy and TaskCompletion on regulated cohorts.

PromptWizard with frontier 2026 models

In our 2026 evals, the optimizer-fit question shifts when the target model is Claude Opus 4.7, GPT-5.1, Gemini 3 Pro, or Llama 4 70B. A PromptWizard run that finds a good prompt for GPT-5.1 may overfit to its instruction-following style and fail on Claude Opus 4.7, which is more literal about system messages. The Microsoft Research PromptWizard paper (2024) reported 5-20 point gains on GSM8K (frontier >95% saturation now), BBH (BIG-Bench Hard, 23 tasks), and MMLU-Pro (14K Q, the harder MMLU successor) over manually-written baselines. a useful upper-bound when budgeting how much lift you can expect from multi-round refinement before the eval cohort itself becomes the ceiling. The recommended pattern is to run the optimizer against each target model with the same dataset and the same evaluator suite, then commit the winning prompt per route through Agent Command Center prompt versioning. Unlike a single LangSmith experiment that ships one prompt, this approach surfaces per-model trade-offs on PromptAdherence, TaskCompletion, and token-cost-per-trace before users see the difference.

Frequently Asked Questions

What is PromptWizard?

PromptWizard is a prompt optimizer that mutates, critiques, and refines prompt candidates across eval rounds so teams can choose a better prompt by measured score, not taste.

How is PromptWizard different from ProTeGi?

ProTeGi uses textual gradients from error analysis and beam-searched edits. PromptWizard uses a multi-stage mutate, critique, and refine loop, which fits pipeline prompts that need iterative correction.

How do you measure PromptWizard?

FutureAGI measures PromptWizard candidates with PromptAdherence, TaskCompletion, eval-fail-rate-by-cohort, llm.token_count.prompt, and latency before promoting a prompt version.