How is Chain-of-Draft different from Chain-of-Thought?

Chain-of-thought usually writes full reasoning prose. Chain-of-Draft compresses those steps into short notes, equations, or labels, so the model still plans but spends fewer completion tokens.

How do you measure Chain-of-Draft?

FutureAGI measures it by replaying the same cohort through full chain-of-thought and Chain-of-Draft prompts, then comparing `ReasoningQuality`, `TaskCompletion`, `StepEfficiency`, `llm.token_count.completion`, p99 latency, and eval-fail-rate-by-cohort.

Chain-of-Draft: Definition & FutureAGI Guide (2026)

Q: What is Chain-of-Draft?

Chain-of-Draft is a model prompting technique that asks an LLM to keep terse draft reasoning notes before the final answer. It aims to preserve useful intermediate structure while reducing tokens, latency, and exposed reasoning text.

What Is Chain-of-Draft?

Chain-of-Draft is a model reasoning technique that prompts an LLM to keep only terse draft steps before giving the final answer, rather than writing full chain-of-thought. It is part of inference-time model behavior: the weights stay fixed, but the trace changes through shorter completion tokens, lower latency, and fewer verbose intermediate claims. FutureAGI evaluates Chain-of-Draft by comparing answer quality, ReasoningQuality, and token cost on the same production or regression cohort.

Why Chain-of-Draft matters in production LLM and agent systems

Full chain-of-thought can improve hard tasks, but it also creates operational debt. The model writes long intermediate prose, the gateway pays for every completion token, and trace storage keeps reasoning text that may contain sensitive business logic or user data. If a team removes reasoning entirely, math, planning, coding, and tool-use quality often drop. Chain-of-Draft is the middle path: ask for reasoning, but make the reasoning compact enough to run in production.

The failure modes are practical. A support agent that uses full chain-of-thought on every refund request may show a 2x completion-token increase, higher p99 latency, and a higher timeout rate during traffic spikes. A developer sees clean final answers in tests, but SREs see llm.token_count.completion and cost-per-trace rising after the prompt change. A compliance reviewer worries that saved traces now include unnecessary internal reasoning. End users feel it as slower responses, inconsistent answers on multi-step questions, or terse drafts leaking into the UI.

Agentic systems make the tradeoff sharper. A planner may need a compact scratchpad for each tool decision, but a five-step agent cannot afford a full essay at every step. Symptoms include repeated tool calls, longer agent.trajectory.step spans, higher fallback rate, and eval failures clustered on tasks that require arithmetic, branching, or evidence comparison.

How FutureAGI Optimizes Chain-of-Draft Prompts

FutureAGI’s approach is to treat Chain-of-Draft as an optimization candidate, not as a free replacement for reasoning evaluation. The concrete FutureAGI Evaluate surface for this term is agent-opt: ProTeGi, GEPAOptimizer, and PromptWizardOptimizer can search for prompts that keep draft reasoning short while preserving task success. A seed instruction might say: “Use draft notes only: symbols, short labels, no prose. Then answer.” The optimizer tests variants against the same eval cohort instead of relying on one hand-written prompt.

Example: a claims-processing agent currently uses full chain-of-thought to decide whether to approve, deny, or escalate a case. The team wants lower latency without losing correct routing. They store the baseline prompt in fi.prompt.Prompt, replay 500 production-like rows, and score each candidate with TaskCompletion, ReasoningQuality, StepEfficiency, and llm.token_count.completion. ProTeGi uses failed rows to propose textual fixes, while GEPAOptimizer keeps multiple objectives in view: answer quality, draft length, latency p99, and safety threshold.

If the winning prompt cuts completion tokens by 38% but drops TaskCompletion on medical edge cases, the engineer does not ship it broadly. They either add a cohort-specific instruction, keep full chain-of-thought for high-risk routes through Agent Command Center prompt-versioning, or set a model fallback path for failures. Unlike Ragas-style faithfulness checks, this workflow does not only ask whether the final answer is supported; it asks whether compact reasoning still preserves the steps needed to reach the right answer.

How to measure or detect Chain-of-Draft

Measure Chain-of-Draft by running an A/B cohort against the full chain-of-thought prompt and the no-reasoning prompt. The target is not maximum brevity; it is the best cost-quality frontier.

ReasoningQuality: scores whether the draft steps form a valid reasoning path. Calibrate thresholds separately from full chain-of-thought because drafts are intentionally shorter.
TaskCompletion: checks whether the final workflow still succeeds after the draft format changes.
StepEfficiency: catches agents that save text tokens but add redundant tool calls or failed actions.
Trace signals: compare llm.token_count.completion, latency p99, cost-per-trace, fallback rate, and eval-fail-rate-by-cohort.
User proxies: watch thumbs-down rate, escalation rate, and reviewer override rate by prompt version.

from fi.evals import ReasoningQuality

metric = ReasoningQuality()
result = metric.evaluate(
    input=user_task,
    output=draft_reasoning_and_answer,
    trajectory=trace_steps,
)
print(result.score)

Common mistakes

The mistakes are mostly measurement mistakes. Chain-of-Draft should be treated like a model-behavior change with release gates, not as a formatting preference.

Treating brevity as correctness. A four-word draft can be cheap and still skip the key constraint.
Comparing token savings on different cohorts. Use the same rows, model, tools, and prompt version when measuring the delta.
Forcing every task into Chain-of-Draft. Easy classification may need no reasoning; regulated decisions may need fuller trace evidence.
Letting draft notes reach the UI. Keep draft text in trace-only fields or strip it before user delivery.
Ignoring agent path quality. A shorter answer is worse if StepEfficiency drops because tools repeat or fail.