Guides

Chain of Draft Prompting in 2026: Cut Tokens 80%, Match CoT Accuracy

Chain of Draft (Xu et al. 2025) cuts reasoning tokens by ~80% while matching Chain of Thought accuracy on math, symbolic, and commonsense benchmarks.

·
Updated
·
5 min read
prompting chain-of-draft chain-of-thought reasoning token-efficiency evaluations llms 2026
Chain-of-Draft prompting improves LLM output quality in GenAI workflow
Table of Contents

TL;DR: Chain of Draft in One Table

QuestionAnswer
Who introduced CoD?Xu et al., February 2025, arXiv:2502.18600
What it doesForces the model to emit short, dense reasoning steps (~5 words each) instead of full CoT prose
Token savingsRoughly 80% fewer reasoning tokens vs CoT in the paper
AccuracyMatches or beats CoT on GSM8k, symbolic, and commonsense reasoning in the paper
Where it wins in 2026General-purpose LLMs (gpt-4o, Claude Opus 4.7, Llama 4) where you control the prompt
Where it losesSmall models under 3B params, zero-shot, tasks needing nuanced NL reasoning
How to use itSystem prompt: “minimum draft per step, 5 words at most” + a separator before the answer

What Chain of Draft Actually Is

Chain of Draft (CoD) is a prompting technique introduced by Silei Xu and colleagues at Zoom Communications in arXiv:2502.18600. The core idea: instead of asking the model to think step by step in natural language (Chain of Thought), instruct it to emit only short, information-dense draft steps before the answer. The paper caps each step at 5 words; the result is roughly 80 percent fewer reasoning tokens with accuracy close to CoT, and sometimes better, on the benchmarks tested (GSM8k, symbolic reasoning, commonsense).

The paper’s evaluations cover gpt-4o and Claude 3.5 Sonnet. The technique is prompt-only, so practitioners report applying the same template to newer general-purpose LLMs; run your own A/B before assuming the paper’s numbers transfer to a different model.

The intuition is human. When a person solves a math problem on paper, they do not write “First I will subtract the number of eaten apples from the total number of apples to find how many apples are remaining.” They write “4 - 2 = 2.” CoD pushes the model toward that compressed scratchpad style.

The Paper’s Prompt, Verbatim

The system prompt template the authors use:

Think step by step, but only keep a minimum draft for each
thinking step, with 5 words at most. Return the answer at the
end of the response after a separator ####.

That is the entire intervention. No fine-tuning. No new model. No tools.

A worked example from the GSM8k category:

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now
Jason has 12 lollipops. How many lollipops did Jason give to
Denny?

CoD: 20 - x = 12; x = 20 - 12 = 8.
#### 8

The CoT version of the same answer would be three to five full sentences. The CoD version is one line plus the answer.

Why CoD Beats CoT on Cost and Latency in 2026

Output tokens on most frontier APIs are priced higher than input tokens (typically several times higher; check current vendor pricing). For a fixed input, a 200-token CoT trace costs more than a 30-token CoD trace. Latency follows the same shape because output decoding is sequential.

The economics can get sharper at scale. On reasoning workloads where the CoD paper’s accuracy held, a production agent running CoD with self-consistency at N=5 may still spend fewer output tokens than a single-shot CoT call and may recover some accuracy through majority vote. The exact tradeoff is workload-specific (task, model, prompt and answer length), so measure on your own data.

This is not magic; it is the same model emitting fewer tokens. The win comes from the prompt forcing the model to skip the natural-language padding that CoT generates by default.

Implementing Chain of Draft in Your Stack

The full recipe is three lines of code. Here is a minimal LangChain implementation:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

COD_SYSTEM = (
    "Think step by step, but only keep a minimum draft for each "
    "thinking step, with 5 words at most. Return the answer at the "
    "end of the response after a separator ####."
)

prompt = ChatPromptTemplate.from_messages([
    ("system", COD_SYSTEM),
    ("human", "{question}"),
])

llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = prompt | llm

response = chain.invoke({"question": "Jason had 20 lollipops..."})
text = response.content
draft, _, answer = text.rpartition("####")
print({"draft": draft.strip(), "answer": answer.strip()})

For self-consistency at N=5, instantiate a separate sampling chain at temperature 0.7:

import collections

sampling_llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
sampling_chain = prompt | sampling_llm

q = "Jason had 20 lollipops..."
samples = [sampling_chain.invoke({"question": q}) for _ in range(5)]
answers = [s.content.rpartition("####")[-1].strip() for s in samples]
final = collections.Counter(answers).most_common(1)[0][0]

Parse the separator deterministically. Do not let the model decide its own delimiter.

Evaluating CoD on Your Workload

The paper’s accuracy numbers are on public benchmarks. Your task distribution is different. Before swapping CoT for CoD in production, run an A/B:

  1. Hold out 100 to 500 examples from your task distribution.
  2. Run the same model with CoT and CoD prompts.
  3. Score with task-specific metrics (exact match, ROUGE, programmatic checks) plus an LLM-as-judge on reasoning quality.
  4. Track average output tokens, p95 latency, and accuracy.

A reproducible evaluation pattern with FAGI’s ai-evaluation library:

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

question = "..."
gold_answer = "8"
cot_output = "..."   # full CoT response from the model
cod_output = "..."   # CoD response from the model

cot_score = evaluate(
    "answer_correctness",
    output=cot_output,
    context=gold_answer,
)
cod_score = evaluate(
    "answer_correctness",
    output=cod_output,
    context=gold_answer,
)

For span-level traces (so you can see prompt, intermediate reasoning, and final answer attached to a single trace ID), use traceAI (Apache 2.0) to instrument the OpenAI, Anthropic, or LangChain client. The trace tree shows token counts on each span, which is exactly the comparison you need.

When Chain of Draft Loses

Three failure modes called out in the paper:

  1. Zero-shot underperformance. Without few-shot examples or a system instruction, the model often ignores the brevity constraint. Always provide either a system prompt or two CoD demonstrations.
  2. Small models lose accuracy. Models under roughly 3B parameters drop more accuracy with CoD than larger ones, likely because compressed chains skip steps the small model needs.
  3. Tasks needing nuance. Legal opinions, policy analysis, open-ended writing, and creative synthesis still benefit from CoT or extended thinking. CoD is for reasoning where the intermediate steps have a deterministic shape.

On dedicated reasoning models like OpenAI o3, DeepSeek-R1, and Gemini 2.5 thinking, CoD has less effect because the model is already running its own compressed internal chain. CoD shines on general-purpose LLMs where you, the developer, control the prompt.

Where Future AGI Fits in a CoD Pipeline

FAGI is the evaluation companion for any reasoning prompt change. The full pattern:

  • Score CoD vs CoT outputs with fi.evals.evaluate on faithfulness, answer correctness, and any custom judge.
  • Trace every prompt variant with traceAI OpenTelemetry instrumentation; per-span token counts and latency are exactly the diff you need.
  • Compare on the same prompt set with replay so the A/B is apples to apples.

For more on reasoning evaluation, see LLM Evaluation Frameworks, Metrics, and Best Practices, Top 5 LLM Evaluation Tools 2025, and LLM Prompts Best Practices 2025.

References

  1. Xu, S., Xie, W., Zhao, L., He, P. Chain of Draft: Thinking Faster by Writing Less. arXiv:2502.18600, February 2025. https://arxiv.org/abs/2502.18600
  2. Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
  3. Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. arXiv:2203.11171

Frequently asked questions

What is Chain of Draft (CoD)?
Chain of Draft is a prompting technique introduced by Xu et al. in February 2025 (arXiv:2502.18600). Instead of asking the LLM to think step by step in full prose, CoD instructs the model to emit only short, dense draft steps (often capped at 5 words per step) before the final answer. The paper reports it matches or beats Chain of Thought on math (GSM8k), symbolic reasoning, and commonsense benchmarks while cutting reasoning tokens by roughly 80 percent on gpt-4o and Claude 3.5 Sonnet, with corresponding latency and cost savings.
How does CoD differ from Chain of Thought (CoT)?
CoT prompts the model to explain its reasoning in natural language sentences. CoD prompts the model to emit only the minimum tokens needed per step. A CoT trace for an arithmetic problem might be 200 tokens; the CoD trace for the same problem might be 30. Accuracy stays close to CoT on the original paper's benchmarks. The intuition is that humans solving math do not write full sentences either; they write the next number, the next operation, then the answer.
When should I use Chain of Draft instead of Chain of Thought?
Use CoD when you have a clear reasoning task with deterministic intermediate steps (math, symbolic logic, multi-hop retrieval, schema extraction), when you are paying per output token, and when end-to-end latency matters. Stick with CoT or extended thinking when the task needs nuanced natural language reasoning, when explainability for end users matters, or when the model is small (under 3B parameters) and benefits from longer chains.
Does Chain of Draft work on reasoning models like o3 or DeepSeek-R1?
On dedicated reasoning models that already emit hidden chains of thought (OpenAI o-series, DeepSeek-R1, Gemini thinking modes), CoD has less to do because the model is already using compressed internal reasoning. Where CoD shines is on general-purpose chat models where you control the visible prompt and want CoT-level accuracy at a fraction of the output cost; the paper's evaluations are on gpt-4o and Claude 3.5 Sonnet, so run an A/B on your own model before assuming the result transfers.
What is the example CoD prompt from the paper?
The paper uses a system instruction along the lines of: 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator ####.' The model then emits compressed steps like '4 apples, 2 eaten, 2 left' instead of full sentences. The separator pattern makes parsing the final answer trivial.
How do I evaluate whether CoD is actually working for my task?
Run an A/B eval. Hold out 100 to 500 examples from your task distribution, score CoT and CoD outputs with task-specific metrics (exact match, ROUGE, programmatic checks) plus an LLM-as-judge on reasoning quality, and compare average output tokens, latency, and accuracy. FAGI's ai-evaluation library and traceAI tracing make this comparison reproducible at the span level.
What are the limitations of Chain of Draft?
Three limitations from the paper and follow-up work. First, CoD underperforms on zero-shot prompting; few-shot or system-instruction examples are needed. Second, very small models (under 3B parameters) lose accuracy because their compressed chains skip too much. Third, on tasks that genuinely require long natural-language reasoning (legal analysis, open-ended philosophy), CoT or extended thinking still wins.
Can I combine CoD with self-consistency or majority vote?
Yes. Sampling N CoD traces at temperature 0.7 and taking the majority answer is the CoD analog of CoT self-consistency. Because each CoD trace is roughly 1/5 the tokens, you can afford a higher N within the same cost budget, which may recover some of the accuracy gap on hard math benchmarks. A practical option to test in cost-sensitive reasoning workloads is CoD with N=5 majority vote; measure on your data.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.