Guides

LLM Prompt Format in 2026: 9 Patterns, Tested Templates, and the Eval Loop That Catches Regressions

Nine prompt-format patterns for GPT-5, Claude Opus 4.7, and Gemini 3 workflows in 2026. Templates, eval loop, and the mistakes to avoid in production.

·
Updated
·
8 min read
evaluations llms prompt-engineering
LLM prompt format patterns for GPT-5, Claude Opus 4.7, Gemini 3
Table of Contents

TL;DR: What changed in prompt format between 2025 and May 2026

QuestionShort answer
Biggest format lift in 2026?Instruction-first ordering; clear, measurable drift reduction on reasoning models.
Best delimiter style?XML tags on Claude, JSON schema for GPT-5 tool calls, markdown for everything else. Do not mix.
Few-shot example count?2-4 for classification or extraction, 0-1 for open-ended generation.
Should I add “think step by step”?No, not on reasoning models. It slows responses and rarely improves accuracy.
Where do hard rules go?Top of prompt plus system message. Mid-prompt rules get dropped more than top or bottom.
How do I prove a change is better?Score old and new prompts on the same labelled set with at least 30 examples per cell.

Why prompt format is still one of the highest-impact controls

Most teams reach for fine-tuning, RAG, or a bigger model first. Prompt format is cheaper, faster, and reversible. In one Future AGI internal test on a labelled extraction set of 500 customer-support tickets, swapping a 2025-style chat prompt for a 2026 instruction-first, schema-constrained prompt on the same GPT-5 mini model materially improved exact-match accuracy without changing the model or adding retrieval. Your numbers will vary with task and data, but the pattern is consistent.

The lift compounds. Better format means fewer retries, fewer guardrail catches, lower latency, and cheaper bills. For workloads doing millions of inferences a month, even a few accuracy points usually pays for the prompt-tuning effort many times over.

The 9 patterns: what to apply, in what order

1. Instruction-first ordering

Put the task instruction on line 1. Put long context (RAG chunks, conversation history, document blobs) after it.

Why it works: reasoning models decide on a plan before scanning context. An instruction buried at line 2,400 of a 4,000-line prompt is a coin flip on whether the planner uses it.

Template:

You are a customer-support classifier. Read the ticket below and return the
category as one of: billing, technical, account, other.

<ticket>
{ticket_body}
</ticket>

Output only the category string. No explanation.

The instruction comes first. The context (the ticket) sits inside a delimited block. The output rule comes last. Three lines, three jobs.

2. Delimiter-locked sections

Wrap each logical section in the same delimiter style end to end.

  • XML tags on Claude Opus 4.7. Anthropic recommends XML tags like <context>, <example>, and <thinking> for structuring prompts. See the Anthropic prompt engineering guide.
  • JSON schema for tool calls on GPT-5. Pass response_format={"type": "json_schema", "json_schema": {...}} per the OpenAI structured outputs reference.
  • Markdown for everything else. Headings, lists, code fences.

Do not mix. Smaller models like GPT-5 nano and Gemini 3 Flash get confused when XML, JSON, and markdown all show up in the same prompt.

3. Explicit output schema

Telling the model the format in prose tends to work most of the time but fails on edge cases. A request-time JSON schema or grammar reliably enforces structure and is what every major provider now recommends.

{
  "model": "gpt-5-mini",
  "messages": [
    {"role": "user", "content": "<the ticket body goes here>"}
  ],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "classify",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "category": {"type": "string", "enum": ["billing", "technical", "account", "other"]},
          "confidence": {"type": "number", "minimum": 0, "maximum": 1}
        },
        "required": ["category", "confidence"],
        "additionalProperties": false
      }
    }
  }
}

Anthropic and Google offer the same primitive under different names: tools with input schemas on Claude, response_schema on Gemini.

4. Constraint pinning

Models attend to the start and end of a prompt more than the middle. This is the “lost in the middle” finding from Liu et al. 2023, still measurable in 2026 on long-context evals.

Practical consequence: pin hard rules (privacy, brand tone, banned topics) to the top of the prompt, then repeat them in the system message. Do not bury them at line 800.

5. Two to four few-shot examples

For classification or extraction, 2-4 examples typically beats 0 examples and beats 10 examples.

Beyond four, GPT-5 and Claude 4.7 start over-fitting to the example surface form: matching example phrasing, example length, example word choice. Generality drops.

For open-ended generation (drafting, summarising, brainstorming), zero or one example is usually right. Multiple examples push the model toward the example tone instead of the requested tone.

6. Negative examples, sparingly

Show one or two “do not produce this” examples for high-risk formats. PII leaks, banned URL patterns, harmful tone.

More than two negative examples can prime the model to mimic the bad pattern. The fix is a guardrail layer at runtime, not more negative examples in the prompt.

7. Persona via system or developer message

OpenAI’s newer Responses-style APIs introduce a developer role for durable instruction messages above the per-turn user role, while the platform-level system message stays as the top-level authority where supported. Chat Completions continues to use system. Anthropic continues to use system. Both carry durable, session-level instructions, though exact precedence differs by provider, so check the current API reference for the SDK you target.

Put the persona, tone, language, and durable rules in the system or developer message. Keep the user turn lean and focused on the current task. This gives the persona persistence across turns and keeps user messages compact for logging and replay.

8. No chain-of-thought on reasoning models

“Let’s think step by step” was a 2022 breakthrough on davinci. On 2026 reasoning models it is a regression.

GPT-5, Claude Opus 4.7, and Gemini 3 Pro all run an internal scratchpad. Telling them to think step by step:

  • Adds tens to a few hundred tokens of redundant output (real cost).
  • Increases time-to-final-token, often noticeably.
  • On math and code tasks, internal tests at Future AGI have shown no consistent accuracy lift on reasoning models, and OpenAI specifically discourages explicit chain-of-thought for o-series and GPT-5 reasoning models.

Drop the phrase on reasoning models. Keep it for Gemini 3 Flash, Llama 4, and GPT-5 nano where it still helps.

9. The eval loop, with Future AGI evaluate

A prompt change without an eval is a vibes change. Run the old prompt and new prompt against the same labelled set, score both, require a meaningful delta before shipping.

Future AGI’s evaluation library exposes 50 plus built-in evaluators: groundedness, faithfulness, toxicity, format checks, custom LLM-as-judge, and more. The library is open source under Apache 2.0 at github.com/future-agi/ai-evaluation.

import os
from fi.evals import evaluate, Evaluator

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

# Score one (input, output) pair against the chosen evaluator.
score = evaluate(
    evaluator=Evaluator.GROUNDEDNESS,
    input="Summarise the ticket: customer cannot log in after password reset.",
    output="The customer is locked out post-reset; advise SSO sync.",
    context=["Ticket #4321 logs: ...", "Auth provider runbook: ..."],
)
print(score)

Loop the same call across your labelled prompts and aggregate the scores per prompt variant to compare versions. Set FI_API_KEY and FI_SECRET_KEY to log results to the Future AGI dashboard.

The eight common mistakes

MistakeWhat happensFix
Burying the task at line 2,400Reasoning models miss itMove instruction to line 1
Mixing XML + JSON + markdownSmaller models output garbagePick one delimiter style
Describing JSON in proseHigh format-error rateUse a strict JSON schema
Hard rule at line 800 of 1,000Rule ignored more often than at top or bottomPin to top and repeat in system
10 few-shot examplesOver-fit to example surface formCap at 2-4
”Think step by step” on GPT-5Slower, rarely more accurateDrop on reasoning models
Persona in every user turnBloated logs, drift across turnsPersona in system message
Shipping without an eval setCannot tell better from worseScore 30 plus examples per cell

Instruction-based, few-shot, and zero-shot: when to use which in 2026

PatternWhen it winsWhen it loses
Zero-shotReasoning models on tasks the model was trained on. Most 2026 work.Niche jargon, custom labels, structured extraction.
Few-shot (2-4)Custom labels, brand-specific tone, narrow extraction schema.Open-ended generation where the example tone leaks into the output.
Instruction + schemaAnything that returns JSON, key-value pairs, or a fixed enum.Pure free-form text (essays, drafts).
Tool useAnything that needs to look up live data or call deterministic code.Tasks that fit in pure text and would just add a round trip.

Example: rewriting a 2025 customer-support prompt for 2026

2025 version (chat-style, prose format):

You are a helpful assistant. Given a customer support ticket, please carefully read
it and think step by step about what category it belongs to. The possible categories
are billing, technical, account, or other. Please return your answer in JSON with
fields "category" and "confidence" (a number from 0 to 1). Be polite.

Here is the ticket: {ticket_body}

Issues: persona at top, but the instruction is on line 3. “Think step by step” hurts on GPT-5. JSON described in prose. Politeness instruction does nothing for a classifier.

2026 version (instruction-first, schema-constrained):

System message:

You are a customer-support ticket classifier. Output a JSON object matching the
provided schema. Never output any other text. Never include PII from the ticket
in the confidence reasoning.

User message:

<ticket>{ticket_body}</ticket>

Plus a strict json_schema constraint on the response. In one Future AGI internal test on a 500-ticket labelled eval, this version produced a clear lift in exact-match accuracy on GPT-5 mini, with lower latency and fewer output tokens. Run your own eval to size the lift for your data.

Iterate, test, stay context-aware: the loop

  1. Write v1. Apply patterns 1, 2, 3, 4 from above. Ship to a small staging eval set, 50 examples minimum.
  2. Score with fi.evals.evaluate. Look for failure clusters, not just aggregate accuracy.
  3. Patch the largest cluster. If 12 of the 50 failures are format errors, tighten the schema. If 8 are missing constraints, pin the constraint higher.
  4. Re-score. Require a 2-3 point lift before shipping. Lower than that is noise unless the eval set is large.
  5. Promote. Push v2 to production. Log every call. Pull a fresh 500-example eval set from production logs weekly. Repeat.

This is the loop that turns prompt engineering from a vibes practice into a measurable engineering discipline.

How Future AGI helps teams test and ship prompt formats

Future AGI ships Apache 2.0 open-source libraries (ai-evaluation, traceAI) alongside a commercial managed platform. Three pieces matter for prompt-format work:

  • Evaluate runs accuracy, groundedness, format compliance, toxicity, and 50 plus built-in metrics in parallel across prompts and models. One call, one dashboard, per-prompt deltas with confidence intervals.
  • Prompt management in the Future AGI dashboard lets you store, version, and compare prompts alongside their eval results. See the Future AGI prompts docs for current capabilities.
  • traceAI is the Apache 2.0 OpenTelemetry instrumentation that captures every prompt, response, latency, and token count per span. Source at github.com/future-agi/traceAI.

Start free: futureagi.com/pricing.

Sources

Frequently asked questions

What is the highest-impact prompt-format change to make in 2026?
Move the task instruction to the very first line and put long context after it. On GPT-5 and Claude Opus 4.7, instruction-first prompts noticeably reduce output drift versus context-first prompts at the same token count. Reasoning models commit to a plan early, so an instruction buried at line 2,400 of a long prompt may not influence the plan. Anthropic's prompt-engineering guide and OpenAI's long-context guidance both recommend putting durable instructions early.
Should I use XML tags, JSON, or markdown to delimit sections of my prompt?
XML tags work well on Claude Opus 4.7 (Anthropic's prompt-engineering guide recommends them), JSON-schema constraints win for tool-calling on GPT-5, and markdown is the safe default everywhere else. Mixing all three in one prompt confuses smaller models like GPT-5 nano and Gemini 3 Flash. Pick one delimiter style per prompt and keep it consistent end to end.
How many few-shot examples should I include in a 2026 prompt?
Two to four examples for classification or extraction, one or zero for open-ended generation. Beyond four examples, GPT-5 and Claude 4.7 start over-fitting to the example surface form and lose generality. If you need more than four examples to land the behaviour, the task probably needs a tool, a structured output schema, or fine-tuning, not a longer prompt.
Do reasoning models like GPT-5 and Claude 4.7 still need explicit chain-of-thought instructions?
No, and adding them can hurt. Reasoning models already produce internal scratchpads under the hood. Adding 'think step by step' to a 2026 reasoning model burns tokens, slows responses, and in internal tests has shown no accuracy lift on Claude Opus 4.7 math tasks. OpenAI's own guidance for o-series and GPT-5 reasoning models specifically discourages explicit chain-of-thought prompting. Use the hint only on non-reasoning models like Gemini 3 Flash or Llama 4.
How do I measure whether a prompt change is actually better?
Run the new and old prompt against the same labelled eval set, score both on a metric that matches your task (exact match for extraction, LLM-as-judge for open-ended, deterministic regex for format compliance), and require at least 30 examples per cell to declare a statistically meaningful win. Future AGI's evaluate runs 50 plus built-in metrics in parallel and surfaces per-prompt deltas in one dashboard.
What is the most common prompt-format mistake teams ship to production in 2026?
Putting the most important constraint in the middle of a long prompt. Models attend to the start and end of a prompt more than the middle (the 'lost in the middle' result from Liu et al. 2023, still observable in 2026 on long-context evals), so a critical rule placed mid-prompt is dropped more often than one pinned to the top. Pin hard rules to the top, repeat them in the system prompt, and add a guardrail layer that catches violations at runtime.
When should I switch from prompt tuning to fine-tuning a model?
When your prompt is over 2,000 tokens, you need the same behaviour on every call, and you have at least 500 labelled examples. Fine-tuning a small model like GPT-5 mini on those examples can reduce cost and latency on narrow tasks once you validate accuracy on a held-out set.
What is the difference between a system prompt and a developer message in 2026?
Both system and developer messages carry durable, session-level instructions, but the exact precedence depends on the provider and API. OpenAI's newer Responses-style APIs introduce a 'developer' role that sits as a durable instruction message above the per-turn user role; the platform-level 'system' message remains the top-level authority where it is supported. Anthropic continues to use 'system'. In practice both serve as the place to pin guardrails and persona, with user messages handling the per-turn task. Always check the provider's API reference for current precedence rules.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.