How is formal logic different from informal logic?

Informal logic studies arguments in natural language, including rhetoric and fallacies. Formal logic abstracts away the language and grades validity from a fixed grammar and inference rules — every step is mechanically checkable.

How does FutureAGI use formal logic to evaluate LLMs?

FutureAGI does not embed a theorem prover, but evaluators like ReasoningQuality, ClaimSupport, and FactualConsistency grade chain-of-thought validity and entailment, the LLM-relevant slice of formal logic.

Formal Logic: Definition and AI Eval Guide (2026)

Q: What is formal logic?

Formal logic is the mathematical study of inference, where arguments are evaluated by structure rather than content. It includes propositional, first-order, modal, and higher-order systems, each with their own syntax, semantics, and proof rules.

What Is Formal Logic?

Formal logic is the branch of logic that studies inference using a fixed grammar of symbols, well-defined semantics, and explicit rules of derivation. The major systems are propositional logic (whole statements with connectives ∧, ∨, ¬, →), first-order logic (predicates, variables, quantifiers ∀, ∃), modal logic (necessity □, possibility ◇), and higher-order logic (quantifying over predicates and functions). It powers automated theorem proving, programming-language semantics, knowledge representation, and software verification. For production LLM systems, FutureAGI uses that vocabulary to separate valid reasoning from fluent but unsupported text.

Why Formal Logic Matters in Production LLM and Agent Systems

LLMs produce text that looks like reasoning. Whether the text actually is reasoning is a different question, and formal logic is how engineers can ask it precisely. A chain-of-thought may use the right words while smuggling in invalid inferences: affirming the consequent, denying the antecedent, equivocating on a quantifier, or treating a counter-example as a refutation when it is actually a special case. Without a formal-logic vocabulary, debugging “the agent sounds confident but the answer is wrong” is guesswork.

The pain shows up wherever the cost of an invalid conclusion is high. A legal agent reads a contract clause and confidently states an obligation that does not actually follow from the text. A compliance bot answers a regulatory question by paraphrasing a “may” as a “must.” A planning agent infers preconditions for a tool call from premises that were never established. Each is a logic failure. SREs see them as user complaints; product engineers see them as one-off “the model was wrong” reports; compliance leads see them as audit risk.

Unlike BLEU or exact-match accuracy, formal-logic review asks whether the conclusion follows from the premises, even when the wording changes. In 2026 stacks pairing LLMs with retrieval, structured knowledge bases, and multi-step planners, formal-logic competence — or at least, evaluator coverage of logical-validity — is increasingly the difference between a demo that wins a deal and an agent that ships.

How FutureAGI Handles Logical Validity

FutureAGI does not embed a theorem prover; we evaluate the LLM-relevant slice of formal logic — chain validity, claim support, contradiction detection, and entailment. The anchor surfaces are ReasoningQuality, ClaimSupport, ContradictionDetection, and FactualConsistency.

Concretely: a contract-analysis team running a legal agent on the langchain traceAI integration instruments every reasoning step. ReasoningQuality evaluates whether the agent’s chain-of-thought is logically valid given its observations, returning a score and a reason that names the failure when one exists (“equivocation between ‘may’ and ‘shall’”). ClaimSupport checks each claim against the cited evidence, while ContradictionDetection catches a response that conflicts with the source text. FactualConsistency uses NLI to check entailment: if the agent claims clause 3 implies obligation X, NLI verifies whether the source text actually entails X.

For datasets, the same evaluators run offline against a curated suite of logic-shaped problems and domain-specific entailment pairs. Scores feed a regression evaluation run, so a prompt change that nudges fluency up but logical validity down is caught before release. FutureAGI’s approach is to treat logical correctness as a peer of fluency and helpfulness — three orthogonal axes that production agents must satisfy together rather than trade off silently.

How to Measure or Detect Formal Logic Failures

Logical validity is graded step-by-step and aggregated across trajectories:

ReasoningQuality — returns 0–1 plus a reason for the validity of an agent’s reasoning chain through the trajectory.
ClaimSupport — checks whether individual claims are supported by the supplied reference text.
ContradictionDetection — flags generated statements that conflict with the retrieved or provided context.
FactualConsistency — NLI between a generated claim and provided context; the entailment-validity check.
Per-step pass rate (dashboard signal) — track each reasoning step independently; long chains fail more often than aggregate scores suggest.
Logic test set — a small held-out suite of FOL- and modal-logic-shaped problems; sharp regression detector when a model swap breaks logical structure preservation.

from fi.evals import ReasoningQuality

rq = ReasoningQuality()
result = rq.evaluate(
    input="If P, then Q. P is true.",
    output="Therefore Q.",
)
print(result.score, result.reason)

Common mistakes

Treating fluent chain-of-thought as valid. Most fluent chains contain at least one logical sleight-of-hand; check, don’t assume.
Using the same model as judge and generator. Self-policing inflates pass rates on logic; pin the judge to a different family.
Aggregating logical-validity into one number. A 0.9 average can hide a step that fails 30% of the time on a critical case.
Treating chain-of-thought as proof. The chain is evidence; an evaluator is the verifier. Grade the chain, don’t trust it.
Ignoring modal scope. “Must,” “may,” and “should” change everything in policy/legal contexts; ensure the eval set covers modal cases.