Guides

Hard Prompt vs Soft Prompt in 2026: Differences, Code, and When to Use Each

Hard prompts vs soft prompts in 2026: prompt tuning, prefix tuning, P-tuning, LoRA for prompts. Decision guide, code, and benchmarks for production teams.

·
Updated
·
8 min read
agents evaluations llms rag
Hard prompt vs soft prompt
Table of Contents

TL;DR: Hard Prompt vs Soft Prompt in 2026

DimensionHard PromptSoft Prompt
RepresentationNatural-language text tokensContinuous embedding vectors
Human-readableYesNo
Works with closed APIs (GPT-5, Claude, Gemini)YesNo
Training costZero, just iterationGPU hours per task
Inference costTokens in the promptA few extra embedding vectors
Best atGeneral reasoning, prototypes, RAG, agentsNarrow tasks with labelled data
Common 2026 methodsFew-shot, CoT, structured prompts, DSPy, APEPrompt tuning, prefix tuning, P-tuning v2
PortabilityMove freely across modelsTied to one base model checkpoint

If you only have time for one move in 2026: ship hard prompts, evaluate them with a held-out test set and judge-based evaluators, and only reach for soft prompts when you self-host an open-source base and have labels.

What Are Hard Prompts in 2026

A hard prompt is the text you send to a model. It is composed of regular tokens that the tokenizer recognizes and the model attends to, exactly the same as user input. In a 2026 production stack, hard prompts include:

  • The system prompt that scopes the assistant’s role.
  • Tool and function definitions in JSON or YAML.
  • Retrieved chunks from a vector store, knowledge graph, or hybrid retriever.
  • Few-shot demonstrations.
  • The user’s actual request.

Hard prompts are the only prompt representation supported by closed APIs like the gpt-5-2025-08-07 endpoint, Anthropic’s claude-opus-4-7, and Google’s gemini-3-pro. Optimizing them is a search problem over wording, structure, ordering of demonstrations, and few-shot example selection.

Characteristics

  • Token-level. Each part of the prompt costs tokens at inference time.
  • Portable. A hard prompt that works on Claude often transfers with light edits to GPT-5 or Gemini.
  • Inspectable. You can log the exact prompt that produced each response, which makes failure analysis tractable.
  • Engineering surface. Few-shot selection, chain-of-thought scaffolding, role conditioning, and output schemas are all hard-prompt levers.

Modern hard-prompt patterns

  • Structured prompting. XML, JSON, or Markdown sections to separate system rules, context, and user input.
  • Few-shot retrieval. Pull demonstrations from a labelled pool by semantic similarity rather than hardcoding them.
  • Chain-of-thought and reasoning. Reasoning models like the o-series and Gemini’s thinking modes have shifted CoT to be partially native.
  • Prompt optimization. DSPy, APE, OPRO, and EvoPrompt search the hard-prompt space against an evaluator. Future AGI’s prompt-opt loop wraps this pattern with an evaluation suite.

What Are Soft Prompts in 2026

A soft prompt is a sequence of continuous vectors prepended to the model’s input embeddings. These vectors do not correspond to any token. They live in the same dimensional space as token embeddings and are trained by backpropagation against a downstream task loss while the base model weights stay frozen.

You cannot type a soft prompt. You learn one with a small dataset, save it as a tensor, and load it at inference time. Closed APIs do not accept soft prompts, so soft prompting is exclusively an open-weights technique used on self-hosted models like Llama 4, Qwen 3, Mistral, and Gemma 3.

The soft-prompt family of methods

  • Prompt tuning (Lester et al., 2021). Learn a small set of embedding vectors prepended only at the input layer. Strong on large bases, weaker on smaller models.
  • Prefix tuning (Li and Liang, 2021). Learn key and value prefixes at every transformer layer, giving the model deeper conditioning. More expressive than prompt tuning at the cost of more trainable parameters.
  • P-tuning (Liu et al., 2021) and P-tuning v2. Adds a small encoder over learned embeddings. P-tuning v2 effectively brings prefix tuning to mid-size models and matches fine-tuning on many NLU tasks.
  • LoRA and adapters as adjacent techniques. Not soft prompts, but parameter-efficient. LoRA modifies attention weight projections via low-rank updates. People sometimes describe combined LoRA plus prefix-tuning recipes as “LoRA for prompts.”

Characteristics

  • No text-token cost. A 20-vector soft prompt adds 20 virtual positions to attention rather than 20 visible billable tokens, though attention compute still scales with those extra positions.
  • Model-specific. A soft prompt trained on Llama-4-70B does not transfer to Qwen-3.
  • Requires gradients. You need labelled data and GPU access to train.
  • Opaque. You cannot read what a soft prompt encodes, only test its effect on evals.

Hard Prompt vs Soft Prompt: Side-by-Side

PropertyHard PromptSoft Prompt
Lives atToken layerEmbedding layer
Created byHumans, or hard-prompt optimizers like DSPy / APE / OPROGradient descent on a labelled task
Needs GPUsNo, only the inference APIYes, for training
Needs labelsOptional, helpful for evalsRequired, typically 1,000 to 100,000 examples
Works on hosted frontier APIsYesNo
Cost at inferenceToken cost of the promptTiny, a few extra vectors
Storage sizeA stringA few hundred KB to a few MB tensor
DebuggabilityHigh, exact text is loggedLow, only via behaviour
Best use casesAgents, RAG, prototypes, broad tasksNarrow classification, structured extraction, domain-specific style

When to Use Hard Prompts in 2026

Choose hard prompts when any of the following is true:

  • You are calling a hosted model (GPT-5, Claude Opus 4.7, Gemini 3, Mistral Large via API, etc.).
  • The task is broad or open-ended, like an agent, multi-turn assistant, or RAG product.
  • You need cross-model portability for procurement or fallback reasons.
  • You need to log, audit, and reproduce the exact instruction.
  • You have less than a few hundred labels but high developer iteration speed.

A 2026 production checklist for hard prompts:

  1. Pin to a dated model snapshot (gpt-5-2025-08-07, claude-opus-4-7) in production, with a canary on the “latest” alias.
  2. Define a small but representative held-out eval set.
  3. Run a hard-prompt optimizer or manual search to find a strong baseline.
  4. Track regressions with replay and online evals before promoting.
  5. Use online evaluators at the gateway to catch model-side drift.

When to Use Soft Prompts in 2026

Reach for soft prompts when all of the following hold:

  • You self-host an open-source base model (Llama 4, Qwen 3, Mistral, Gemma 3).
  • The task is narrow and well-specified, like single-label classification, schema extraction, code intent labeling, or domain-specific style transfer.
  • You have a labelled dataset of roughly 1,000 to 100,000 examples.
  • Per-request latency and token cost matter enough to want a near-zero prompt overhead.
  • You can train and version embedding tensors alongside your model.

If you only have a few hundred labels, prefer few-shot hard prompts plus retrieval. If you have hundreds of thousands of labels, consider full or LoRA fine-tuning instead of pure soft prompting.

Hybrid Pattern: Hard Prompt Shell + Soft Prompt Adapter

A common 2026 pattern on open-source bases:

  1. A short hard prompt scopes the task and provides the output schema.
  2. A trained soft prompt or LoRA adapter conditions the model on domain-specific style and entities.
  3. An evaluator scores faithfulness and structure on each response.

This combines the auditability of hard prompts with the parameter efficiency of soft prompting. It is most common in regulated domains (healthcare, legal, finance) where the base model is open-source for data residency reasons but the task is narrow enough to benefit from learned prompts.

Code: Hard-Prompt Evaluation Loop with Future AGI

from fi.evals import evaluate

def call_model(system: str, user: str) -> str:
    """Replace with your LLM client (OpenAI, Anthropic, LiteLLM, etc.)."""
    raise NotImplementedError

system_prompt = "You are a clinical-summary assistant. Output JSON with keys: chief_complaint, history, plan."
user = "Summarize: patient presents with cough, fever, and shortness of breath for 3 days..."
output = call_model(system_prompt, user)

score = evaluate(
    "faithfulness",
    output=output,
    context=user,
)
print(score)

For a custom LLM-judge against a soft-prompt-conditioned open-source model:

from fi.opt.base import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = Evaluator(
    metric=CustomLLMJudge(
        prompt="Rate clinical-summary correctness from 0 to 1.",
        provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
    ),
)
score = judge.evaluate(input=user, output=output)

Code: Soft-Prompt Training with PEFT

A minimal prompt-tuning recipe on an open-source base, framework-faithful and adapted from the Hugging Face PEFT prompt-tuning docs:

from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "meta-llama/Llama-4-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)

config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Classify the clinical note.",
    num_virtual_tokens=20,
    tokenizer_name_or_path=base,
)
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

Train with standard Hugging Face Trainer or Accelerate, then save the adapter and load it at inference. Evaluate with the same Future AGI evaluators you use on the hard-prompt side, so the two approaches are scored on the same ruler.

Prompt Optimization in 2026

Modern hard-prompt optimization tooling that operates at the text level:

  • DSPy. Compile prompts and few-shot demos against metrics, with structured signatures.
  • APE. Automatic Prompt Engineer, sampler-based candidate generation and scoring.
  • OPRO. Optimization by PROmpting, uses an LLM as the optimizer over the prompt space.
  • EvoPrompt. Evolutionary search over prompt templates.
  • Future AGI prompt-opt. Wraps the above patterns inside an evaluation suite with offline replay, statistical comparison, and online evaluators at deploy.

All of these operate on hard prompts because that is what ships to hosted APIs. Soft-prompt optimization remains a gradient-descent training problem rather than a search problem.

How to Choose: Decision Tree

  1. Are you calling a hosted frontier API? If yes, hard prompts only.
  2. Do you self-host an open-source base and have a narrow task plus labels? Consider soft prompts (prompt tuning, prefix tuning, P-tuning v2) or LoRA.
  3. Do you have more than 50,000 high-quality labels and care about per-token cost? Full or LoRA fine-tuning often beats soft prompting.
  4. Do you need full audit logs of the instruction sent to the model? Hard prompts win, because soft prompts cannot be inspected.
  5. Do you need to ship a change today? Hard prompts plus a prompt-opt loop is the lowest-latency path.

For most 2026 product teams the answer is hard prompts, refined inside an evaluation suite, with soft prompts reserved for the narrow self-hosted slice of the stack.

References

Frequently asked questions

What is the difference between a hard prompt and a soft prompt?
A hard prompt is a natural-language string a human writes and sends to a model, like 'Summarize this report in 3 bullets.' A soft prompt is a sequence of continuous embedding vectors prepended to the model's input embeddings during training, with no token-level interpretation. Hard prompts are interpretable and portable across models. Soft prompts are model-specific, learned from gradient updates on a frozen base model, and usually outperform hard prompts on narrow tasks once trained.
Are hard prompts and soft prompts the same as prompt tuning and prefix tuning?
No. Hard prompt and soft prompt describe the prompt representation: text tokens vs continuous embeddings. Prompt tuning and prefix tuning are specific training methods for learning soft prompts. Prompt tuning learns embeddings at the input layer only. Prefix tuning learns key-value prefixes at every transformer layer. P-tuning v2 generalizes prefix tuning. All three are parameter-efficient fine-tuning (PEFT) methods that operate on soft prompts.
When should I use a hard prompt vs a soft prompt in production in 2026?
Use hard prompts for nearly every shipped agent and RAG product in 2026, because most teams call hosted models like gpt-5-2025-08-07, claude-opus-4-7, or gemini-3-pro through an API that does not accept soft prompts. Use soft prompts when you have GPU access to a self-hosted open-source base model, a narrow task with a labelled dataset above roughly 1,000 examples, and a need for the lowest per-token cost at inference time.
Do soft prompts outperform hard prompts?
Soft prompts typically match or exceed careful hard prompting on narrow tasks once trained on a few hundred to a few thousand labelled examples, while using roughly 0.01% of the parameters of full fine-tuning. On broad reasoning or instruction-following tasks, well-engineered hard prompts plus retrieval still dominate. The relevant axis in 2026 is task narrowness and label budget rather than absolute capability.
Can I use soft prompts with closed models like GPT-5 or Claude?
No. OpenAI, Anthropic, and Google do not currently expose soft-prompt or prefix-tuning APIs on their hosted frontier models. They expose hard prompts, system prompts, structured outputs, and supervised fine-tuning via labelled examples. Soft prompts are available on open-source bases like Llama 4, Qwen 3, and Mistral derivatives through Hugging Face PEFT or NeMo.
What is prompt optimization and does it work on hard or soft prompts?
Prompt optimization is the automated search for better prompt wording, structure, or few-shot examples against an evaluation metric. Methods like DSPy, APE, OPRO, and Future AGI's prompt-opt operate on hard prompts because they need to ship to hosted APIs. Soft-prompt training is a different optimization problem solved via gradient descent on a model you control. Both can run inside the same evaluation harness.
How do I evaluate a hard prompt change before shipping?
Run the new prompt on a held-out eval set with reference answers or judge-based evaluators for faithfulness, correctness, and task success. Compare against the production prompt on the same examples. Use a sequential test or fixed-N statistical test rather than eyeballing two scores. Future AGI's evaluation library (`fi.evals`) and prompt-opt loop wrap this pattern with offline replay and online evaluators.
What about LoRA for prompts?
LoRA is parameter-efficient fine-tuning of attention weight matrices, not a prompt technique per se. People sometimes call combined LoRA plus prefix-tuning setups 'LoRA for prompts' because the LoRA adapter conditions the model's behaviour similarly to a learned soft prompt. In practice LoRA tunes a small fraction of weights, while prompt tuning and prefix tuning leave all base weights frozen and only learn prepended vectors.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.