Prompting

What Is In-Context Learning?

A prompt-time LLM behavior where examples or facts in the context window steer outputs without changing model weights.

What is In-Context Learning?

In-context learning is a prompt-time LLM behavior where the model adapts to the current task from instructions, examples, retrieved facts, memory, or tool outputs placed in its context window, without changing model weights. It is a prompting pattern that appears in production traces as few-shot examples, RAG context, prior turns, or tool observations. FutureAGI treats it as a reliability surface: measure whether the model uses the supplied context, ignores irrelevant context, and preserves task quality as prompts change.

Why It Matters in Production LLM and Agent Systems

Failures start when the prompt context teaches the wrong behavior. A support agent may copy a stale few-shot refund example into a billing answer; a RAG assistant may overfit to a retrieved chunk that is semantically close but policy-incompatible; a workflow agent may treat a previous tool observation as current state. The user sees a confident wrong answer, while logs only show a normal 200 response.

The pain is spread. Developers debug a prompt that works with one example ordering and fails when a retriever adds two paragraphs. SREs see p99 latency and token cost climb because every route adds “just one more” example. Product owners see unstable task completion across cohorts. Compliance reviewers ask why a model followed a demo example instead of the policy text in the same context.

In 2026 multi-step pipelines, in-context learning is less predictable than a single chat completion. The planner prompt, retriever prompt, tool schema, and final answer prompt all teach the model. Unlike fine-tuning, the behavior can change on every request because the context changes on every request. Symptoms include rising prompt-token counts, answer drift after retrieval changes, higher eval-fail-rate-by-cohort, and traces where the final answer cites an example instead of the live source.

How FutureAGI Handles In-Context Learning

FutureAGI’s approach is to connect in-context learning to prompt optimization, not treat it as prompt folklore. In a FutureAGI workflow, the engineer starts with a fi.prompt.Prompt template that contains the system instruction, optional few-shot examples, and variables for retrieved context. A traceAI-langchain or OpenTelemetry trace records the prompt version, llm.token_count.prompt, response, and eval metadata for each run.

For the optimizer:* prompt-opt anchor, the concrete surface is agent-opt. The engineer can use GEPAOptimizer as the scoring loop over candidate prompts that vary example choice, example order, and context-instruction wording while optimizing multiple objectives: TaskCompletion, ContextUtilization, Groundedness, and prompt-token cost. ProTeGi is better when failures can be explained as textual edits; PromptWizardOptimizer fits multi-stage prompt pipelines that need mutate-critique-refine cycles.

Example: a loan-servicing assistant uses three few-shot examples plus retrieved policy snippets. Offline evals show TaskCompletion at 0.73, but ContextUtilization flags that failed answers copy the second example even when the retrieved policy disagrees. The engineer runs GEPA over a 300-row cohort, pins Groundedness above 0.90, and rejects candidates that add more than 15% prompt tokens. The winning prompt changes example order, adds a “prefer retrieved policy over examples” clause, and ships as a new prompt version. FutureAGI then alerts if eval-fail-rate-by-cohort rises after release.

How to Measure or Detect It

Measure it by comparing the same task under different context payloads, not by inspecting prompt text alone.

  • Context-use scoreContextUtilization measures whether the model actually uses the supplied context instead of ignoring it or copying an irrelevant example.
  • Task delta — compare TaskCompletion or an exact business metric with no examples, one example, few-shot examples, and retrieved context.
  • Grounding safetyGroundedness and Faithfulness catch answers that outgrow the supplied facts.
  • Trace fields — track llm.token_count.prompt, prompt version, retriever document ids, and example ids per span.
  • Dashboard signal — alert on eval-fail-rate-by-cohort, token-cost-per-trace, and thumbs-down rate after a prompt-context change.
from fi.evals import ContextUtilization

score = ContextUtilization().evaluate(
    input=user_query,
    context=prompt_context,
    output=model_response,
)
print(score)

Run this on a fixed eval cohort before release and on sampled production traces after release. A high task score with low ContextUtilization means the model may be succeeding from priors, which will break when the task distribution shifts.

Common Mistakes

Common failures are almost always context-control failures, not mysterious model behavior:

  • Treating few-shot examples as harmless decoration. Models often copy format, tone, and hidden assumptions from examples before reading retrieved facts.
  • Adding examples until accuracy improves once. Prompt-token growth can hide latency and cost regressions until traffic spikes.
  • Mixing policy snippets, memory, and tool outputs without priority rules. The model needs a tie-breaker when sources conflict.
  • Testing only the final answer. Inspect the trace to see which example, retrieved chunk, or prior turn drove the response.
  • Calling it fine-tuning. In-context learning is request-local; a deployment can regress because one upstream retriever changed.

Frequently Asked Questions

What is in-context learning?

In-context learning is an LLM behavior where examples or context included in the prompt steer the model on the current task without updating model weights.

How is in-context learning different from fine-tuning?

Fine-tuning changes model weights through training. In-context learning keeps the model fixed and changes only the request context, such as examples, retrieved passages, memory, or tool results.

How do you measure in-context learning?

In FutureAGI, compare task scores across prompt contexts and use ContextUtilization to check whether the output used the supplied examples or facts. Trace `llm.token_count.prompt` to catch expensive context expansion.