Agents that optimize
themselves - automatically

Tune every hyperparameter in your agent stack - prompts, model selection, temperature, retrieval config, tool descriptions, few-shot examples - using evolutionary algorithms. GEPA (ICLR 2026) outperforms RL by 6% with 35x fewer rollouts. No fine-tuning. No weight changes.

Agent Optimization

Converged
billing-support-agent | GEPA · 8 generations
Overall Score
0.91 from 0.72 +26%
Evaluations
342 rollouts
Params Tuned
12 hyperparams
Parameter Before After Impact
System Prompt
Instruction · Persona
v1 - 847 tokens
v8 - 623 tokens
+19%
Temperature
Sampling · Creativity
0.7
0.42
+8%
Few-shot Examples
Selection · Ordering
3 static
5 optimized
+11%
Retrieval top_k
RAG · Context Window
10
6
+5%
GEPA reflection - Generation 6 → 7
Reflection: Context noise Pareto-dominant
"top_k=10 retrieves 4 irrelevant chunks that dilute answer quality. Reducing to 6 with reranker threshold 0.7 removes noise without losing relevant context."
Mutation: top_k 10→6, reranker_threshold 0.5→0.7 - faithfulness +0.05, relevancy +0.03
Reflection: Temperature interaction +0.08 composite
"Lower temperature (0.42) works synergistically with tighter retrieval - fewer retrieved chunks + less sampling randomness = more grounded answers."
Tool Descriptions
Function Calling · Args
4 tools · manual
4 tools · refined
+7%
Model
Provider · Cost
gpt-4o
gpt-4o-mini
−$$$
12 params optimized | 342 rollouts · 8 gen | Pareto: 3 solutions
Core Features

Self-optimizing agents -
not manual prompt tweaking

GEPA Engine
evolving
GEPA - GUIDED EVOLUTIONARY PROMPT AGENT ICLR 2026 GENERATION EVOLUTION G0 seed G1 G2 ··· G6 G7 G8 best +6% vs GRPO REFLECTION CYCLE R "The prompt lacks specificity in edge cases - add guard clauses for ambiguous input" mutate + "When user input is ambiguous, request clarification first." PARETO FRONTIER - MULTI-OBJECTIVE ACCURACY SAFETY 0.5 0.65 0.75 0.85 0.95 0.5 0.65 0.75 0.85 0.95 G8 best candidates pareto-optimal 9 generations · 14 candidates · multi-objective pareto selection
Algorithm Selection
6 strategies
OPTIMIZATION ALGORITHMS - SELECT STRATEGY Random Quick baseline sampling Basic Bayesian Intelligent surrogate search Efficient ProTeGi Textual gradients descent Gradient Meta-Prompt LLM self-improvement loop Meta PromptWiz Critique + synthesis chain Synthetic GEPA Evolutionary + reflection State of Art CONVERGENCE COMPARISON Iterations Score Random Bayesian ProTeGi Meta PW GEPA 6 algorithms · configurable per task · auto-selection available
LLM Integration
universal
ANY LLM - ZERO WEIGHT CHANGES SUPPORTED PROVIDERS GPT-4o Claude Gemini Llama Mistral + more OPTIMIZATION PIPELINE INPUT PROMPT "You are a helpful assistant that..." v1.0 - baseline agent-opt Select algorithm Run evaluation loop Evolve & converge API-only · no fine-tuning OUTPUT PROMPT "You are a precise AI that validates..." v8.0 - optimized KEY PROPERTIES No weight changes Prompt-level only $ pip install agent-opt Python SDK · CLI · REST API POWERED BY LiteLLM proxy Streaming Caching Fallbacks 5+ providers · 100+ models · pip install · LiteLLM routing
Evaluation Engine
scoring
EVALUATION-DRIVEN OPTIMIZATION SCORING FUNCTIONS Heuristic BLEU ROUGE Embedding Token-level metrics LLM-as-Judge Custom Rubric Pairwise Likert Rubric-based scoring Platform 50+ Faithfulness Toxicity Relevance Fluency Coherence + 45 Pre-built templates OBJECTIVE COMPOSITION Faithfulness Toxicity Relevance w=0.4 w=0.3 w=0.3 objective(x) optimize( prompt ) maximize objective SCORE TRAJECTORY G0 .58 G1 .64 G2 .69 G3 .72 G4 .74 G5 .79 G6 .83 G7 .87 G8 .91 0.74 0.91 +23% improvement 3 scorer types · 50+ templates · weighted composition · auto-converge

System prompts, few-shot examples, tool descriptions, temperature, top_p, model selection, retrieval parameters (chunk size, top_k, reranker config), routing logic, output format - every hyperparameter that shapes your agent's behavior. The optimizer explores the full configuration space, not a single text field. One run tunes the whole agent stack.

See agent config space

Genetic-Pareto Evolutionary optimization uses natural language reflection - not scalar reward gradients - to evolve agent configurations over generations. Accepted at ICLR 2026 as an oral presentation. Outperforms GRPO by 6% on average (up to 19pp) while using 35x fewer rollouts. Multi-objective Pareto frontier balances accuracy, safety, cost, and latency simultaneously.

Read the GEPA paper

Random Search for quick baselines. Bayesian Search (Optuna-backed) for intelligent hyperparameter tuning. ProTeGi for textual gradients that identify failure patterns. Meta-Prompt for teacher-model rewriting. PromptWizard for mutation-critique-refinement pipelines. GEPA for multi-objective evolutionary search across the full config space. From 10 trials to 500 - you control the compute.

Compare optimizers

Every candidate configuration is scored by real evaluation metrics - faithfulness, toxicity, function call accuracy, instruction adherence, BLEU, ROUGE, embedding similarity, LLM-as-judge with custom rubrics, or 50+ pre-built templates. Compose multiple metrics into a single optimization objective. The optimizer improves what you measure, not what you hope for.

Explore evaluation templates
Use Cases

Optimize the full agent
for any use case

Query Retriever Context LLM Answer optimization feedback Faithfulness 0.71 Faithfulness 0.93 RAG pipeline → feedback-driven optimization

Tune a RAG agent end-to-end

Optimize the system prompt, chunk size, top_k retrieval count, reranker threshold, and temperature together. The optimizer finds configurations where retrieval and generation work in concert - not prompt-only fixes that ignore retrieval quality.

GEPA Faithfulness RAG Config
TOOL DEFINITION name: search_db params: query: string limit: int optimize REFINED DESCRIPTION + context, constraints ACCURACY 62% 94% Refine tool descriptions → higher accuracy

Optimize tool-calling agents

Tool descriptions, argument schemas, routing logic, and the system prompt that orchestrates them - optimize the full tool-calling stack. Score with function call accuracy and argument correctness across your entire tool suite.

ProTeGi Function Call Accuracy
GPT-4o $$$ / 1K tokens optimize GPT-4o-mini $ / 1K tokens QUALITY GAP before after Same quality, fraction of the cost

Same quality, cheaper model

Jointly optimize the prompt + temperature + few-shot selection so GPT-4o-mini matches GPT-4o quality. The optimizer searches for configurations that close the gap - not just prompt tweaks, but the full parameter set that makes the cheaper model perform.

Bayesian Search Cost + Quality
Helpfulness → Safety → optimal frontier Pareto frontier: safety vs helpfulness

Balance safety and helpfulness

Safety guardrails often hurt helpfulness. GEPA's multi-objective Pareto optimization finds agent configurations that maximize both simultaneously - safety constraints, temperature, response length, and prompt tone tuned together. Pareto frontier shows you exactly where the trade-offs are.

GEPA Pareto Frontier
PROMPT VARIANT You are a sales... v3 - optimized copy iter 1 2.1% iter 2 3.4% iter 3 4.8% CONV RATE 4.8% $ Revenue uplift: prompt optimization across 3 iterations Optimize prompt → maximize conversion rate

Maximize conversion and CSAT

Sales and support agents need to convert - but you can't fine-tune GPT-4o. Optimize the full agent config for conversion rate, CSAT, or deal value. Improvement ships as a config change - model selection, temperature, prompt, and few-shot examples all tuned together.

Meta-Prompt Custom Rewards
EN 0.94 ES 0.88 DE 0.85 JA 0.82 INDEPENDENT OPTIMIZATION EN tuned ES tuned DE tuned JA tuned Per-language prompt optimization

Per-language agent optimization

Multilingual agents need different configurations per language - not just different prompts, but different temperature, few-shot examples, and even model selection. Run independent optimization per locale with language-specific evaluation datasets.

PromptWizard Datasets
How It Works

From baseline agent to
optimized config in three steps

01

Define your agent config space

Specify which parameters to optimize - system prompt, few-shot examples, temperature, top_p, model, tool descriptions, retrieval settings, or any custom hyperparameter. Choose an optimizer (Random, Bayesian, ProTeGi, Meta-Prompt, PromptWizard, GEPA) and set your scoring metrics.

Configure Optimizer
OPTIMIZATION ALGORITHM Random Bayesian ProTeGi Meta-Prompt PromptWizard GEPA SCORING FUNCTIONS Faithfulness Toxicity Relevance Instruction Adherence Safety Multi-objective Pareto optimization Selected: 3 metrics · GEPA · Multi-objective config: algorithm: GEPA scoring: [faithfulness, relevance, adherence] mode: pareto_multi_objective
02

Run optimization on your dataset

The optimizer explores your config space over generations - mutating, reflecting, critiquing, and selecting the best agent configurations. GEPA converges in 100-500 evaluations, not 25,000. Multi-objective Pareto search balances competing goals automatically.

Optimization Running
Generation 8/8 - Converged
GENERATION PROGRESS Score G0 G1 G2 G3 G4 G5 G6 G7 G8 CONVERGENCE 1.0 0.8 0.6 0.4 0.74 0.91 Total rollouts 342 REFLECTION (G5 → G6) "Add explicit chain-of-thought before answering to improve faithfulness." Status Converged G0: 0.74 → G1: 0.77 → G2: 0.80 → G3: 0.83 → G4: 0.85 → G5: 0.87 → G6: 0.89 → G7: 0.90 → G8: 0.91
03

Deploy the optimized agent

The result is a better agent config - not a model checkpoint. Export as JSON/YAML, push to your deployment pipeline, or A/B test against the original in Experiments. No weight changes - works with any LLM provider, any model.

Deploy Optimized Prompt
PROMPT COMPARISON BEFORE You are a helpful assistant. Answer the user's question based on the provided context. Be concise and accurate. Score: 0.74 AFTER OPTIMIZED You are a precise research assistant. First, reason step-by-step about the query. Then provide a faithful answer using only the provided context. Cite sources inline. Score: 0.91 DEPLOYMENT OPTIONS </> Copy to codebase Paste into your app code Push to prompt manager Version-controlled prompts A|B A/B test in Experiments Works with any LLM OpenAI Anthropic Google Meta Mistral ...

Powering teams from
prototype to production

From ambitious startups to global enterprises, teams trust Future AGI to ship AI agents confidently.