What is multi-hop reasoning in evals?

Multi-hop reasoning is an eval metric that checks whether a model or agent connects two or more evidence steps before answering. FutureAGI maps it to the `MultiHopReasoning` evaluator for RAG and agent workflows.

How is multi-hop reasoning different from reasoning quality?

Multi-hop reasoning checks whether the required facts are connected to support the answer. Reasoning quality scores the coherence of the model or agent's reasoning path, even when no external evidence is required.

How do you measure multi-hop reasoning?

Use FutureAGI's `fi.evals.MultiHopReasoning` with the question, response, retrieved contexts, and optional expected answer. Pair it with `ContextRecall` or `ContextRelevance` when you need to separate retriever misses from synthesis failures.

What Is Multi-Hop Reasoning? Eval Guide (2026)

What Is Multi-Hop Reasoning (Eval)?

Multi-hop reasoning is an LLM-evaluation metric that checks whether a model or agent correctly connects two or more evidence steps before producing an answer. It matters in eval pipelines, RAG traces, and tool-using agents where the final answer depends on intermediate facts, not a single retrieved chunk. In FutureAGI, the eval:MultiHopReasoning surface maps to the MultiHopReasoning evaluator, which helps teams catch bridge hallucinations, missing evidence hops, and correct-looking answers built from incomplete reasoning.

Why It Matters in Production LLM and Agent Systems

Single-hop evals can pass an answer that is locally supported but globally wrong. A RAG system might retrieve a vendor’s SOC 2 renewal and a separate EU data-residency policy, then answer that the vendor is eligible without connecting the policy condition to the vendor record. The visible symptom is a confident answer with citations, but the hidden failure mode is a broken bridge between facts.

This hurts different teams in different ways. Developers see flaky regression cases where Groundedness is high but the answer fails a gold label. SREs see retries, long traces, and rising token cost because the agent keeps searching for a missing intermediate fact. Product teams see users reporting “the facts are right, but the conclusion is wrong.” Compliance teams lose the ability to prove how a regulated decision moved from evidence to outcome.

The problem is sharper in 2026-era agentic systems because the reasoning path is often split across retrievers, tools, memory, and planner steps. A support agent can select the right customer, call the right billing tool, and still issue the wrong refund if it skips the policy hop that ties the tool result to the action. Logs often show this as a normal-looking agent.trajectory.step sequence with one absent or irrelevant retrieval span. Multi-hop reasoning evaluation gives teams a metric for that missing join, instead of treating every failure as generic hallucination.

How FutureAGI Handles Multi-Hop Reasoning

FutureAGI anchors this term at eval:MultiHopReasoning. In a dataset eval, an engineer passes the user question, final response, retrieved contexts, and optional expected answer into fi.evals.MultiHopReasoning. The metric is designed for RAG responses where the answer should depend on multiple pieces of context rather than one quoted sentence.

Consider an enterprise-search agent answering: “Can Vendor A process EU healthcare data after its latest audit?” A correct answer must connect at least three hops: the vendor identity, the audit status, and the policy that maps audit status to EU healthcare processing. FutureAGI records the eval result beside the trace, so the engineer can inspect the retriever span, the policy-tool call, and the final synthesis together. If MultiHopReasoning fails while ContextRecall passes, the retriever probably found enough evidence and the synthesis step missed the bridge. If both fail, the retrieval corpus or query rewrite needs work.

FutureAGI’s approach is to score the reasoning chain at the eval surface and keep the production trace next to it, so the failed hop remains tied to the component that created it. Unlike running only Ragas faithfulness, which mainly asks whether the answer is supported by supplied context, this workflow asks whether the answer uses the required sequence of evidence. Engineers usually turn the result into a release gate: alert when the multi-hop fail rate rises by dataset slice, add the failed case to a regression eval, and compare prompt or retriever changes before rollout.

How to Measure or Detect It

Useful signals for multi-hop reasoning are:

fi.evals.MultiHopReasoning — returns a score for whether the response connects required evidence hops across the supplied context.
ContextRecall and ContextRelevance — separate retriever failures from synthesis failures when the multi-hop score drops.
agent.trajectory.step traces — show whether the agent retrieved, tool-called, or skipped the intermediate step needed for the conclusion.
multi-hop-fail-rate-by-cohort dashboard — catches regressions by customer segment, prompt version, retriever version, or model route.
User feedback proxy — track escalations where users mark the facts correct but the conclusion wrong.

Minimal Python:

from fi.evals import MultiHopReasoning

metric = MultiHopReasoning()
result = metric.evaluate(
    question="Which vendor can serve EU data after SOC 2 renewal?",
    response="Vendor A can serve EU data after renewing SOC 2.",
    contexts=["Vendor A renewed SOC 2.", "EU data requires renewed SOC 2."],
)
print(result.score)

Common Mistakes

Scoring only final-answer correctness. A correct final answer can hide a missing hop that will fail on the next similar case.
Confusing context recall with reasoning. Retrieved evidence is necessary, but the model must still connect the evidence to the conclusion.
Using one-hop gold questions. Multi-hop evals need gold cases with explicit bridge facts, not simple lookup questions.
Treating every failure as hallucination. Some failures are retriever misses, some are synthesis misses, and some are tool-order mistakes.
Ignoring trace shape. A low score without the agent.trajectory.step timeline is harder to debug and easier to misroute.