Guides

Hard Prompt vs Soft Prompt in 2026: Differences, Code, and When to Use Each

Hard prompts vs soft prompts in 2026: prompt tuning, prefix tuning, P-tuning, LoRA for prompts. Decision guide, code, and benchmarks for production teams.

January 3, 2025

Updated May 14, 2026

8 min read

agents evaluations llms rag

Table of Contents

TL;DR: Hard Prompt vs Soft Prompt in 2026

Dimension	Hard Prompt	Soft Prompt
Representation	Natural-language text tokens	Continuous embedding vectors
Human-readable	Yes	No
Works with closed APIs (GPT-5, Claude, Gemini)	Yes	No
Training cost	Zero, just iteration	GPU hours per task
Inference cost	Tokens in the prompt	A few extra embedding vectors
Best at	General reasoning, prototypes, RAG, agents	Narrow tasks with labelled data
Common 2026 methods	Few-shot, CoT, structured prompts, DSPy, APE	Prompt tuning, prefix tuning, P-tuning v2
Portability	Move freely across models	Tied to one base model checkpoint

If you only have time for one move in 2026: ship hard prompts, evaluate them with a held-out test set and judge-based evaluators, and only reach for soft prompts when you self-host an open-source base and have labels.

What Are Hard Prompts in 2026

A hard prompt is the text you send to a model. It is composed of regular tokens that the tokenizer recognizes and the model attends to, exactly the same as user input. In a 2026 production stack, hard prompts include:

The system prompt that scopes the assistant’s role.
Tool and function definitions in JSON or YAML.
Retrieved chunks from a vector store, knowledge graph, or hybrid retriever.
Few-shot demonstrations.
The user’s actual request.

Hard prompts are the only prompt representation supported by closed APIs like the gpt-5-2025-08-07 endpoint, Anthropic’s claude-opus-4-7, and Google’s gemini-3-pro. Optimizing them is a search problem over wording, structure, ordering of demonstrations, and few-shot example selection.

Characteristics

Token-level. Each part of the prompt costs tokens at inference time.
Portable. A hard prompt that works on Claude often transfers with light edits to GPT-5 or Gemini.
Inspectable. You can log the exact prompt that produced each response, which makes failure analysis tractable.
Engineering surface. Few-shot selection, chain-of-thought scaffolding, role conditioning, and output schemas are all hard-prompt levers.

Modern hard-prompt patterns

Structured prompting. XML, JSON, or Markdown sections to separate system rules, context, and user input.
Few-shot retrieval. Pull demonstrations from a labelled pool by semantic similarity rather than hardcoding them.
Chain-of-thought and reasoning. Reasoning models like the o-series and Gemini’s thinking modes have shifted CoT to be partially native.
Prompt optimization. DSPy, APE, OPRO, and EvoPrompt search the hard-prompt space against an evaluator. Future AGI’s prompt-opt loop wraps this pattern with an evaluation suite.

What Are Soft Prompts in 2026

A soft prompt is a sequence of continuous vectors prepended to the model’s input embeddings. These vectors do not correspond to any token. They live in the same dimensional space as token embeddings and are trained by backpropagation against a downstream task loss while the base model weights stay frozen.

You cannot type a soft prompt. You learn one with a small dataset, save it as a tensor, and load it at inference time. Closed APIs do not accept soft prompts, so soft prompting is exclusively an open-weights technique used on self-hosted models like Llama 4, Qwen 3, Mistral, and Gemma 3.

The soft-prompt family of methods

Prompt tuning (Lester et al., 2021). Learn a small set of embedding vectors prepended only at the input layer. Strong on large bases, weaker on smaller models.
Prefix tuning (Li and Liang, 2021). Learn key and value prefixes at every transformer layer, giving the model deeper conditioning. More expressive than prompt tuning at the cost of more trainable parameters.
P-tuning (Liu et al., 2021) and P-tuning v2. Adds a small encoder over learned embeddings. P-tuning v2 effectively brings prefix tuning to mid-size models and matches fine-tuning on many NLU tasks.
LoRA and adapters as adjacent techniques. Not soft prompts, but parameter-efficient. LoRA modifies attention weight projections via low-rank updates. People sometimes describe combined LoRA plus prefix-tuning recipes as “LoRA for prompts.”

Characteristics

No text-token cost. A 20-vector soft prompt adds 20 virtual positions to attention rather than 20 visible billable tokens, though attention compute still scales with those extra positions.
Model-specific. A soft prompt trained on Llama-4-70B does not transfer to Qwen-3.
Requires gradients. You need labelled data and GPU access to train.
Opaque. You cannot read what a soft prompt encodes, only test its effect on evals.

Hard Prompt vs Soft Prompt: Side-by-Side

Property	Hard Prompt	Soft Prompt
Lives at	Token layer	Embedding layer
Created by	Humans, or hard-prompt optimizers like DSPy / APE / OPRO	Gradient descent on a labelled task
Needs GPUs	No, only the inference API	Yes, for training
Needs labels	Optional, helpful for evals	Required, typically 1,000 to 100,000 examples
Works on hosted frontier APIs	Yes	No
Cost at inference	Token cost of the prompt	Tiny, a few extra vectors
Storage size	A string	A few hundred KB to a few MB tensor
Debuggability	High, exact text is logged	Low, only via behaviour
Best use cases	Agents, RAG, prototypes, broad tasks	Narrow classification, structured extraction, domain-specific style

When to Use Hard Prompts in 2026

Choose hard prompts when any of the following is true:

You are calling a hosted model (GPT-5, Claude Opus 4.7, Gemini 3, Mistral Large via API, etc.).
The task is broad or open-ended, like an agent, multi-turn assistant, or RAG product.
You need cross-model portability for procurement or fallback reasons.
You need to log, audit, and reproduce the exact instruction.
You have less than a few hundred labels but high developer iteration speed.

A 2026 production checklist for hard prompts:

Pin to a dated model snapshot (gpt-5-2025-08-07, claude-opus-4-7) in production, with a canary on the “latest” alias.
Define a small but representative held-out eval set.
Run a hard-prompt optimizer or manual search to find a strong baseline.
Track regressions with replay and online evals before promoting.
Use online evaluators at the gateway to catch model-side drift.

When to Use Soft Prompts in 2026

Reach for soft prompts when all of the following hold:

You self-host an open-source base model (Llama 4, Qwen 3, Mistral, Gemma 3).
The task is narrow and well-specified, like single-label classification, schema extraction, code intent labeling, or domain-specific style transfer.
You have a labelled dataset of roughly 1,000 to 100,000 examples.
Per-request latency and token cost matter enough to want a near-zero prompt overhead.
You can train and version embedding tensors alongside your model.

If you only have a few hundred labels, prefer few-shot hard prompts plus retrieval. If you have hundreds of thousands of labels, consider full or LoRA fine-tuning instead of pure soft prompting.

Hybrid Pattern: Hard Prompt Shell + Soft Prompt Adapter

A common 2026 pattern on open-source bases:

A short hard prompt scopes the task and provides the output schema.
A trained soft prompt or LoRA adapter conditions the model on domain-specific style and entities.
An evaluator scores faithfulness and structure on each response.

This combines the auditability of hard prompts with the parameter efficiency of soft prompting. It is most common in regulated domains (healthcare, legal, finance) where the base model is open-source for data residency reasons but the task is narrow enough to benefit from learned prompts.

Code: Hard-Prompt Evaluation Loop with Future AGI

from fi.evals import evaluate

def call_model(system: str, user: str) -> str:
    """Replace with your LLM client (OpenAI, Anthropic, LiteLLM, etc.)."""
    raise NotImplementedError

system_prompt = "You are a clinical-summary assistant. Output JSON with keys: chief_complaint, history, plan."
user = "Summarize: patient presents with cough, fever, and shortness of breath for 3 days..."
output = call_model(system_prompt, user)

score = evaluate(
    "faithfulness",
    output=output,
    context=user,
)
print(score)

For a custom LLM-judge against a soft-prompt-conditioned open-source model:

from fi.opt.base import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = Evaluator(
    metric=CustomLLMJudge(
        prompt="Rate clinical-summary correctness from 0 to 1.",
        provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
    ),
)
score = judge.evaluate(input=user, output=output)

Code: Soft-Prompt Training with PEFT

A minimal prompt-tuning recipe on an open-source base, framework-faithful and adapted from the Hugging Face PEFT prompt-tuning docs:

from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "meta-llama/Llama-4-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)

config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Classify the clinical note.",
    num_virtual_tokens=20,
    tokenizer_name_or_path=base,
)
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

Train with standard Hugging Face Trainer or Accelerate, then save the adapter and load it at inference. Evaluate with the same Future AGI evaluators you use on the hard-prompt side, so the two approaches are scored on the same ruler.

Prompt Optimization in 2026

Modern hard-prompt optimization tooling that operates at the text level:

DSPy. Compile prompts and few-shot demos against metrics, with structured signatures.
APE. Automatic Prompt Engineer, sampler-based candidate generation and scoring.
OPRO. Optimization by PROmpting, uses an LLM as the optimizer over the prompt space.
EvoPrompt. Evolutionary search over prompt templates.
Future AGI prompt-opt. Wraps the above patterns inside an evaluation suite with offline replay, statistical comparison, and online evaluators at deploy.

All of these operate on hard prompts because that is what ships to hosted APIs. Soft-prompt optimization remains a gradient-descent training problem rather than a search problem.

How to Choose: Decision Tree

Are you calling a hosted frontier API? If yes, hard prompts only.
Do you self-host an open-source base and have a narrow task plus labels? Consider soft prompts (prompt tuning, prefix tuning, P-tuning v2) or LoRA.
Do you have more than 50,000 high-quality labels and care about per-token cost? Full or LoRA fine-tuning often beats soft prompting.
Do you need full audit logs of the instruction sent to the model? Hard prompts win, because soft prompts cannot be inspected.
Do you need to ship a change today? Hard prompts plus a prompt-opt loop is the lowest-latency path.

For most 2026 product teams the answer is hard prompts, refined inside an evaluation suite, with soft prompts reserved for the narrow self-hosted slice of the stack.

References

Lester, Al-Rfou, Constant. The Power of Scale for Parameter-Efficient Prompt Tuning, 2021.
Li, Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021.
Liu et al. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, 2022.
Hu et al. LoRA: Low-Rank Adaptation of Large Language Models, 2021.
Hugging Face PEFT prompt-tuning docs.
Stanford DSPy.
Zhou et al. Large Language Models Are Human-Level Prompt Engineers (APE).
Yang et al. Large Language Models as Optimizers (OPRO).
Future AGI traceAI and ai-evaluation (Apache 2.0).

Frequently asked questions

What is the difference between a hard prompt and a soft prompt?

A hard prompt is a natural-language string a human writes and sends to a model, like 'Summarize this report in 3 bullets.' A soft prompt is a sequence of continuous embedding vectors prepended to the model's input embeddings during training, with no token-level interpretation. Hard prompts are interpretable and portable across models. Soft prompts are model-specific, learned from gradient updates on a frozen base model, and usually outperform hard prompts on narrow tasks once trained.

Are hard prompts and soft prompts the same as prompt tuning and prefix tuning?

No. Hard prompt and soft prompt describe the prompt representation: text tokens vs continuous embeddings. Prompt tuning and prefix tuning are specific training methods for learning soft prompts. Prompt tuning learns embeddings at the input layer only. Prefix tuning learns key-value prefixes at every transformer layer. P-tuning v2 generalizes prefix tuning. All three are parameter-efficient fine-tuning (PEFT) methods that operate on soft prompts.

When should I use a hard prompt vs a soft prompt in production in 2026?

Use hard prompts for nearly every shipped agent and RAG product in 2026, because most teams call hosted models like gpt-5-2025-08-07, claude-opus-4-7, or gemini-3-pro through an API that does not accept soft prompts. Use soft prompts when you have GPU access to a self-hosted open-source base model, a narrow task with a labelled dataset above roughly 1,000 examples, and a need for the lowest per-token cost at inference time.

Do soft prompts outperform hard prompts?

Soft prompts typically match or exceed careful hard prompting on narrow tasks once trained on a few hundred to a few thousand labelled examples, while using roughly 0.01% of the parameters of full fine-tuning. On broad reasoning or instruction-following tasks, well-engineered hard prompts plus retrieval still dominate. The relevant axis in 2026 is task narrowness and label budget rather than absolute capability.

Can I use soft prompts with closed models like GPT-5 or Claude?

No. OpenAI, Anthropic, and Google do not currently expose soft-prompt or prefix-tuning APIs on their hosted frontier models. They expose hard prompts, system prompts, structured outputs, and supervised fine-tuning via labelled examples. Soft prompts are available on open-source bases like Llama 4, Qwen 3, and Mistral derivatives through Hugging Face PEFT or NeMo.

What is prompt optimization and does it work on hard or soft prompts?

Prompt optimization is the automated search for better prompt wording, structure, or few-shot examples against an evaluation metric. Methods like DSPy, APE, OPRO, and Future AGI's prompt-opt operate on hard prompts because they need to ship to hosted APIs. Soft-prompt training is a different optimization problem solved via gradient descent on a model you control. Both can run inside the same evaluation harness.

How do I evaluate a hard prompt change before shipping?

Run the new prompt on a held-out eval set with reference answers or judge-based evaluators for faithfulness, correctness, and task success. Compare against the production prompt on the same examples. Use a sequential test or fixed-N statistical test rather than eyeballing two scores. Future AGI's evaluation library (`fi.evals`) and prompt-opt loop wrap this pattern with offline replay and online evaluators.

What about LoRA for prompts?

LoRA is parameter-efficient fine-tuning of attention weight matrices, not a prompt technique per se. People sometimes call combined LoRA plus prefix-tuning setups 'LoRA for prompts' because the LoRA adapter conditions the model's behaviour similarly to a learned soft prompt. In practice LoRA tunes a small fraction of weights, while prompt tuning and prefix tuning leave all base weights frozen and only learn prepended vectors.

View all

Guides

Build a Generative AI Chatbot in 2026: Step-by-Step Guide

Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.

Rishav Hada · Jul 24, 2025

8 min

Guides

Top 5 LLM Evaluation Tools 2026: Future AGI, Galileo, Arize Compared

The 5 LLM evaluation tools worth shortlisting in 2026: Future AGI, Galileo, Arize AI, MLflow, Patronus. Features, pricing, and which workload each wins.

Rishav Hada · Apr 30, 2025

11 min

Guides

LangChain Callbacks 2026: Handlers, Events, Tracing Guide

LangChain callbacks in 2026: every lifecycle event, sync vs async handlers, runnable config patterns, and how to wire callbacks into OpenTelemetry traces.

Vrinda Damani · Mar 7, 2025

7 min

TL;DR: Hard Prompt vs Soft Prompt in 2026

What Are Hard Prompts in 2026

Characteristics

Modern hard-prompt patterns

What Are Soft Prompts in 2026

The soft-prompt family of methods

Characteristics

Hard Prompt vs Soft Prompt: Side-by-Side

When to Use Hard Prompts in 2026

When to Use Soft Prompts in 2026

Hybrid Pattern: Hard Prompt Shell + Soft Prompt Adapter

Code: Hard-Prompt Evaluation Loop with Future AGI

Code: Soft-Prompt Training with PEFT

Prompt Optimization in 2026

How to Choose: Decision Tree

Related Reading

References

Frequently asked questions