How is Skeleton-of-Thought different from chain-of-thought?

Chain-of-thought asks for intermediate reasoning. Skeleton-of-Thought asks for the structure of the final answer first, so expansion can be checked, streamed, or run in parallel.

How do you measure Skeleton-of-Thought?

Use FutureAGI PromptAdherence and TaskCompletion on SoT and non-SoT prompt variants, then compare `llm.token_count.prompt`, `llm.token_count.completion`, p99 latency, and eval-fail-rate-by-cohort.

What Is Skeleton-of-Thought? FutureAGI Guide (2026)

Q: What is Skeleton-of-Thought?

Skeleton-of-Thought is a prompt pattern where an LLM first creates a compact outline, then expands each point into the final response. It is mainly used for long answers, agent plans, and structured generation.

What Is Skeleton-of-Thought?

Skeleton-of-Thought (SoT) is a prompt pattern that asks an LLM to first produce a compact answer skeleton, then expand each skeleton point into the final response. It belongs to the prompt-engineering family and appears in production traces as a two-phase generation plan: outline first, expansion second. Teams use it when long answers need lower perceived latency, clearer structure, or parallel expansion. In FutureAGI, SoT is evaluated like any prompt variant, using task-quality, latency, and token-cost signals before rollout.

Why Skeleton-of-Thought matters in production LLM and agent systems

Skeleton-of-Thought changes the failure shape of long-form generation. Instead of one model call producing one answer, the system now has an outline step that can constrain every later expansion. If the skeleton omits a requirement, selects the wrong ordering, or frames the answer around a false assumption, the expansion stage can make the mistake look deliberate and polished. That is outline anchoring: the first compact plan becomes the path the rest of the response follows.

The second production failure mode is expansion inconsistency. Many SoT implementations expand skeleton bullets independently to reduce wall-clock latency. That can work, but without a merge check the sections may repeat claims, contradict each other, cite different assumptions, or vary tone across one answer. Developers feel this as harder prompt debugging because the bug may live in the skeleton prompt, one expansion prompt, or the coordinator. SREs see extra spans, wider p99 latency variance, and higher completion-token volume. Product teams see complaints that answers look organized but miss the user’s real goal.

Unlike Chain-of-Thought, which exposes intermediate reasoning, Skeleton-of-Thought primarily controls answer structure and generation strategy. That distinction matters in 2026 agent pipelines: an agent may use SoT to draft a research memo, plan a tool sequence, or stream a report section by section. The trace should show whether the skeleton improved quality enough to justify the extra calls.

How FutureAGI handles Skeleton-of-Thought

FutureAGI’s approach is to treat Skeleton-of-Thought as an optimization candidate, not as an automatic quality improvement. The required optimizer surface for this entry is optimizer:PromptWizardOptimizer, the agent-opt class built for multi-stage prompt pipelines. It can mutate, critique, and refine prompts across rounds, which fits SoT because the pattern has at least two prompt surfaces: “make the skeleton” and “expand this point.”

A practical workflow starts with a baseline long-answer prompt stored through fi.prompt.Prompt, an eval dataset, and a control route that uses the current single-call prompt. The engineer creates a SoT candidate with a fixed skeleton schema: bullet id, claim, source requirement, and expansion instruction. PromptWizardOptimizer proposes variants for the skeleton and expansion prompts, then FutureAGI evaluates each candidate with PromptAdherence, TaskCompletion, token cost, and latency. The exact signals to inspect are the prompt version, eval score by dataset row, llm.token_count.prompt, llm.token_count.completion, and p99 end-to-end latency for the SoT trace.

If the candidate raises TaskCompletion but doubles p99 latency, the engineer can cap skeleton length, expand fewer bullets in parallel, or reject the variant. If PromptAdherence fails, the skeleton schema is too loose. Unlike Tree-of-Thoughts, which searches multiple reasoning paths, SoT is usually a single structure-first path. FutureAGI keeps that distinction visible so teams do not confuse cleaner formatting with better reasoning.

How to measure or detect Skeleton-of-Thought

Measure Skeleton-of-Thought by comparing a SoT cohort with a control prompt on the same dataset and traffic slice.

PromptAdherence: checks whether the final response followed the skeleton contract, section labels, and required constraints.
TaskCompletion: checks whether the structured answer still completed the user’s goal, not just whether it looked organized.
Trace shape: expect one skeleton span plus N expansion spans; missing or repeated expansion spans indicate coordinator bugs.
Token and latency metrics: track llm.token_count.prompt, llm.token_count.completion, p99 latency, and token-cost-per-trace.
User-feedback proxy: compare thumbs-down rate and escalation rate for SoT traffic against the control prompt.

Do not let the aggregate score hide the split between structure and usefulness. A good dashboard keeps rows for skeleton prompt version, expansion prompt version, model, dataset slice, route, and evaluator result. Alert when SoT beats the control on PromptAdherence but loses on TaskCompletion or cost. That usually means the outline improved readability, not reliability.

from fi.evals import PromptAdherence

evaluator = PromptAdherence()
result = evaluator.evaluate(
    input=user_request,
    output=final_answer,
    prompt=sot_prompt,
)
print(result.score)

Common mistakes with Skeleton-of-Thought

Most SoT incidents come from treating the outline as harmless scaffolding. It is actually a production decision point that shapes the final answer.

Treating the skeleton as ground truth; if the outline is wrong, every expansion reinforces the wrong frame.
Expanding bullets in parallel without a merge pass; sections may repeat facts, contradict policy, or use different assumptions.
Scoring only the final answer; SoT can hide an invalid outline behind polished prose.
Using SoT for short answers; extra calls and tokens can lose to a direct prompt.
Letting the model choose arbitrary skeleton length; cap bullets and require stable section labels.

A useful review asks whether each skeleton field has an owner: prompt, coordinator, evaluator, or UI. If nobody owns the field, it will drift when the prompt changes.