Chain of Draft Prompting in 2026: Cut Tokens 80%, Match CoT Accuracy
Chain of Draft (Xu et al. 2025) cuts reasoning tokens by ~80% while matching Chain of Thought accuracy on math, symbolic, and commonsense benchmarks.
Table of Contents
TL;DR: Chain of Draft in One Table
| Question | Answer |
|---|---|
| Who introduced CoD? | Xu et al., February 2025, arXiv:2502.18600 |
| What it does | Forces the model to emit short, dense reasoning steps (~5 words each) instead of full CoT prose |
| Token savings | Roughly 80% fewer reasoning tokens vs CoT in the paper |
| Accuracy | Matches or beats CoT on GSM8k, symbolic, and commonsense reasoning in the paper |
| Where it wins in 2026 | General-purpose LLMs (gpt-4o, Claude Opus 4.7, Llama 4) where you control the prompt |
| Where it loses | Small models under 3B params, zero-shot, tasks needing nuanced NL reasoning |
| How to use it | System prompt: “minimum draft per step, 5 words at most” + a separator before the answer |
What Chain of Draft Actually Is
Chain of Draft (CoD) is a prompting technique introduced by Silei Xu and colleagues at Zoom Communications in arXiv:2502.18600. The core idea: instead of asking the model to think step by step in natural language (Chain of Thought), instruct it to emit only short, information-dense draft steps before the answer. The paper caps each step at 5 words; the result is roughly 80 percent fewer reasoning tokens with accuracy close to CoT, and sometimes better, on the benchmarks tested (GSM8k, symbolic reasoning, commonsense).
The paper’s evaluations cover gpt-4o and Claude 3.5 Sonnet. The technique is prompt-only, so practitioners report applying the same template to newer general-purpose LLMs; run your own A/B before assuming the paper’s numbers transfer to a different model.
The intuition is human. When a person solves a math problem on paper, they do not write “First I will subtract the number of eaten apples from the total number of apples to find how many apples are remaining.” They write “4 - 2 = 2.” CoD pushes the model toward that compressed scratchpad style.
The Paper’s Prompt, Verbatim
The system prompt template the authors use:
Think step by step, but only keep a minimum draft for each
thinking step, with 5 words at most. Return the answer at the
end of the response after a separator ####.
That is the entire intervention. No fine-tuning. No new model. No tools.
A worked example from the GSM8k category:
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now
Jason has 12 lollipops. How many lollipops did Jason give to
Denny?
CoD: 20 - x = 12; x = 20 - 12 = 8.
#### 8
The CoT version of the same answer would be three to five full sentences. The CoD version is one line plus the answer.
Why CoD Beats CoT on Cost and Latency in 2026
Output tokens on most frontier APIs are priced higher than input tokens (typically several times higher; check current vendor pricing). For a fixed input, a 200-token CoT trace costs more than a 30-token CoD trace. Latency follows the same shape because output decoding is sequential.
The economics can get sharper at scale. On reasoning workloads where the CoD paper’s accuracy held, a production agent running CoD with self-consistency at N=5 may still spend fewer output tokens than a single-shot CoT call and may recover some accuracy through majority vote. The exact tradeoff is workload-specific (task, model, prompt and answer length), so measure on your own data.
This is not magic; it is the same model emitting fewer tokens. The win comes from the prompt forcing the model to skip the natural-language padding that CoT generates by default.
Implementing Chain of Draft in Your Stack
The full recipe is three lines of code. Here is a minimal LangChain implementation:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
COD_SYSTEM = (
"Think step by step, but only keep a minimum draft for each "
"thinking step, with 5 words at most. Return the answer at the "
"end of the response after a separator ####."
)
prompt = ChatPromptTemplate.from_messages([
("system", COD_SYSTEM),
("human", "{question}"),
])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = prompt | llm
response = chain.invoke({"question": "Jason had 20 lollipops..."})
text = response.content
draft, _, answer = text.rpartition("####")
print({"draft": draft.strip(), "answer": answer.strip()})
For self-consistency at N=5, instantiate a separate sampling chain at temperature 0.7:
import collections
sampling_llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
sampling_chain = prompt | sampling_llm
q = "Jason had 20 lollipops..."
samples = [sampling_chain.invoke({"question": q}) for _ in range(5)]
answers = [s.content.rpartition("####")[-1].strip() for s in samples]
final = collections.Counter(answers).most_common(1)[0][0]
Parse the separator deterministically. Do not let the model decide its own delimiter.
Evaluating CoD on Your Workload
The paper’s accuracy numbers are on public benchmarks. Your task distribution is different. Before swapping CoT for CoD in production, run an A/B:
- Hold out 100 to 500 examples from your task distribution.
- Run the same model with CoT and CoD prompts.
- Score with task-specific metrics (exact match, ROUGE, programmatic checks) plus an LLM-as-judge on reasoning quality.
- Track average output tokens, p95 latency, and accuracy.
A reproducible evaluation pattern with FAGI’s ai-evaluation library:
import os
from fi.evals import evaluate
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
question = "..."
gold_answer = "8"
cot_output = "..." # full CoT response from the model
cod_output = "..." # CoD response from the model
cot_score = evaluate(
"answer_correctness",
output=cot_output,
context=gold_answer,
)
cod_score = evaluate(
"answer_correctness",
output=cod_output,
context=gold_answer,
)
For span-level traces (so you can see prompt, intermediate reasoning, and final answer attached to a single trace ID), use traceAI (Apache 2.0) to instrument the OpenAI, Anthropic, or LangChain client. The trace tree shows token counts on each span, which is exactly the comparison you need.
When Chain of Draft Loses
Three failure modes called out in the paper:
- Zero-shot underperformance. Without few-shot examples or a system instruction, the model often ignores the brevity constraint. Always provide either a system prompt or two CoD demonstrations.
- Small models lose accuracy. Models under roughly 3B parameters drop more accuracy with CoD than larger ones, likely because compressed chains skip steps the small model needs.
- Tasks needing nuance. Legal opinions, policy analysis, open-ended writing, and creative synthesis still benefit from CoT or extended thinking. CoD is for reasoning where the intermediate steps have a deterministic shape.
On dedicated reasoning models like OpenAI o3, DeepSeek-R1, and Gemini 2.5 thinking, CoD has less effect because the model is already running its own compressed internal chain. CoD shines on general-purpose LLMs where you, the developer, control the prompt.
Where Future AGI Fits in a CoD Pipeline
FAGI is the evaluation companion for any reasoning prompt change. The full pattern:
- Score CoD vs CoT outputs with
fi.evals.evaluateon faithfulness, answer correctness, and any custom judge. - Trace every prompt variant with traceAI OpenTelemetry instrumentation; per-span token counts and latency are exactly the diff you need.
- Compare on the same prompt set with replay so the A/B is apples to apples.
For more on reasoning evaluation, see LLM Evaluation Frameworks, Metrics, and Best Practices, Top 5 LLM Evaluation Tools 2025, and LLM Prompts Best Practices 2025.
References
- Xu, S., Xie, W., Zhao, L., He, P. Chain of Draft: Thinking Faster by Writing Less. arXiv:2502.18600, February 2025. https://arxiv.org/abs/2502.18600
- Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903
- Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. arXiv:2203.11171
Frequently asked questions
What is Chain of Draft (CoD)?
How does CoD differ from Chain of Thought (CoT)?
When should I use Chain of Draft instead of Chain of Thought?
Does Chain of Draft work on reasoning models like o3 or DeepSeek-R1?
What is the example CoD prompt from the paper?
How do I evaluate whether CoD is actually working for my task?
What are the limitations of Chain of Draft?
Can I combine CoD with self-consistency or majority vote?
Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.
Top 10 prompt optimization tools in 2026 ranked: FutureAGI, DSPy, TextGrad, PromptHub, PromptLayer, LangSmith, Helicone, Humanloop, DeepEval, Prompt Flow.
Gemini 2.5 Pro in May 2026: pricing, benchmarks, retirement status, and whether to upgrade to Gemini 3.1 Pro for new builds. With migration checklist.