What's the difference between sequence-to-sequence models and decoder-only LLMs?

Seq2seq models have a separate encoder that compresses the input and a decoder that generates the output. Decoder-only LLMs use a single autoregressive stack — they're a special case of the general seq2seq idea.

How do you evaluate sequence-to-sequence models in production?

Pair task-specific evaluators (TranslationAccuracy, SummaryQuality) with reference-free signals like Faithfulness, attached to a Dataset for cohort-level regression eval.

What Are Sequence-to-Sequence Models? FutureAGI Guide (2026)

Q: What are sequence-to-sequence models?

Neural networks that map a variable-length input sequence to a variable-length output sequence — used for translation, summarisation, code generation, and structured extraction.

What Are Sequence-to-Sequence Models?

Sequence-to-sequence (seq2seq) models are neural networks that take a variable-length input sequence and produce a variable-length output sequence. The pattern was introduced for neural machine translation in 2014 by Sutskever, Vinyals, and Le, originally with RNN encoder-decoder pairs and quickly extended with attention. The 2026 production stack uses encoder-decoder transformers — T5, BART, mT5 — for the same set of tasks: translation, summarisation, code generation, structured-output extraction, and the encoder side of many retrieval and reranker pipelines. They live in production alongside decoder-only LLMs and bring their own evaluation needs.

Why It Matters in Production LLM and Agent Systems

Sequence-to-sequence models are still the right architecture when the input/output asymmetry is sharp, the training data is plentiful, and the per-inference cost has to stay low. Translation services running at hundreds of millions of requests per day cannot afford a frontier decoder-only LLM at every request — fine-tuned T5 or mT5 hits the latency and cost target with task-specific quality. Customer-support summarisation pipelines processing every closed ticket use BART-class models for the same reason.

The pain shows up in two specific places. First, surface metrics lie. BLEU and ROUGE were designed for these models, but they grade meaning-preserving paraphrases as failures and verbatim n-gram copies as wins. Engineers ship “regressions” that are not regressions and miss real ones. Second, hallucination is real. Encoder-decoder summarisers fluently introduce facts that aren’t in the source; without Faithfulness-style evaluation the hallucination rate is invisible.

The role-by-role pain is concrete. ML platform leads are asked why a 40% cost cut from swapping to a smaller seq2seq model came with a 12-point CSAT drop on summarised tickets. Localisation managers see machine-translation BLEU climb while human-review reject rates climb in parallel. Compliance leads have no documented evaluation of summarisation faithfulness for legal documents.

In 2026-era stacks, sequence-to-sequence models also reappear inside larger LLM pipelines as input classifiers, output formatters, and rerankers. Treating “the LLM is the only thing to evaluate” as a settled question is how a quiet seq2seq regression takes down the headline metric.

How FutureAGI Handles Sequence-to-Sequence Models

FutureAGI’s approach is to evaluate seq2seq outputs by task, not by surface n-gram overlap, while keeping classical metrics for trend continuity. fi.evals.TranslationAccuracy runs a meaning-preservation rubric for translation tasks, alongside fi.evals.BLEUScore for backwards-compatible trend lines. fi.evals.SummaryQuality and fi.evals.IsGoodSummary score summarisation against a rubric. fi.evals.Faithfulness and fi.evals.RAGFaithfulness catch hallucinated facts that surface-overlap metrics miss completely.

For traceability, traceAI-huggingface instruments locally-hosted seq2seq inference, while traceAI-bedrock, traceAI-vertexai, and traceAI-vllm cover managed and self-hosted inference paths. Each emits the standard span attributes — llm.token_count.prompt, llm.token_count.completion, latency — for cost attribution. For seq2seq components embedded inside larger pipelines (e.g., a reranker between retrieval and a generator LLM), Dataset.add_evaluation() attaches the appropriate per-component evaluator to its labeled dataset.

Concretely: a media company running a multilingual summarisation service on a fine-tuned mT5 model instruments inference with traceAI-huggingface and runs SummaryQuality and Faithfulness on every production trace. When an A/B test of a new model variant shows BLEU +0.8 but Faithfulness -6 points, the team blocks the rollout — the new variant is more fluent but introduces facts absent from source documents. Reference-only metrics would have green-lit the regression.

How to Measure or Detect It

Use task-specific evaluators and reserve surface metrics for trend continuity:

TranslationAccuracy: meaning-preservation rubric for machine translation seq2seq.
SummaryQuality + IsGoodSummary: rubric-based summarisation evaluators.
Faithfulness: detects hallucinated facts in seq2seq outputs against source.
BLEUScore + ROUGEScore: classical surface metrics; trend continuity only, not gating.
EmbeddingSimilarity: semantic floor that protects against false-fail on paraphrase.
llm.token_count.completion (OTel attribute): cost-attribution signal for high-volume workloads.

Minimal Python:

from fi.evals import SummaryQuality, Faithfulness

quality = SummaryQuality()
faith = Faithfulness()

result = faith.evaluate(
    input=source_document,
    output=generated_summary,
)
print(result.score, result.reason)

Common Mistakes

Reporting BLEU/ROUGE as the headline metric. Surface metrics; meaning-preserving outputs can score near zero.
Skipping Faithfulness on summarisation models. Hallucinated facts are common in fluent encoder-decoder outputs.
Treating decoder-only and encoder-decoder as interchangeable. Different inductive biases, different failure modes — evaluate accordingly.
No per-language slicing on multilingual seq2seq. A 0.84 mean across languages can hide a 0.55 score on the rare language that just signed an enterprise contract.
Ignoring length-distribution shift after fine-tuning. Average output length can move 20–30% silently and break downstream pipelines.