Are modern LLMs seq2seq models?

GPT-style models are decoder-only autoregressive transformers, technically a special case. Classical seq2seq has separate encoder and decoder; T5 and BART are the canonical encoder-decoder transformer seq2seq examples.

How do you evaluate a seq2seq model in FutureAGI?

Pair task-specific evaluators (TranslationAccuracy, SummaryQuality, BLEUScore) with reference-free signals like FactualConsistency, all attached to a Dataset for regression tracking.

What Is a Seq2Seq Model? Definition & FutureAGI Guide (2026)

Q: What is a seq2seq model?

A seq2seq model is a neural architecture that maps a variable-length input sequence to a variable-length output sequence — used for translation, summarisation, and code generation.

What Is a Seq2Seq Model?

A seq2seq (sequence-to-sequence) model is a neural architecture that maps a variable-length input sequence to a variable-length output sequence. The original 2014 design from Sutskever et al. used an RNN encoder to compress the input into a fixed vector and an RNN decoder to generate the output token by token. Modern seq2seq is dominated by encoder-decoder transformers like T5 and BART, which keep the same input-output framing but replace recurrence with attention. Seq2seq powers translation, summarisation, code generation, and the encoder side of many retrieval-augmented generation pipelines.

Why It Matters in Production LLM and Agent Systems

Seq2seq models still ship in production for tasks where the input/output asymmetry is sharp and the training data is plentiful. Translation pipelines remain seq2seq because the contract — one source sentence, one target sentence — fits the architecture perfectly. Document-to-summary pipelines at scale (news, customer-support tickets, legal documents) often use fine-tuned T5 or BART because the parameter budget per inference is much smaller than a frontier decoder-only LLM, and the quality is acceptable for the task.

The pain shows up in two places. First, evaluation: classical seq2seq metrics like BLEU and ROUGE were designed for these models, and they are notoriously bad at scoring open-ended generation. A summarisation model can produce a perfect summary that scores 0.2 BLEU because the wording differs from the reference. Engineering teams that rely on these metrics ship “regressions” that are nothing of the sort. Second, hallucination: encoder-decoder models hallucinate too, often more confidently than decoder-only LLMs because their pretraining objectives encourage fluency.

In 2026-era stacks, seq2seq models also reappear inside larger LLM pipelines. Embedding models are encoders. Reranker models are encoder-style scorers. Some agent architectures use a smaller seq2seq model for input-classification or output-formatting steps before/after a larger decoder-only LLM. Treating “is the LLM the only thing to evaluate” as a settled question is how seq2seq components quietly become the regression source.

How FutureAGI Handles Seq2Seq Models

FutureAGI’s approach is to evaluate seq2seq outputs by what the model is for rather than by surface metrics. For translation, fi.evals.TranslationAccuracy runs a judge-model rubric that checks meaning preservation, fluency, and edge cases like named entities, alongside the classical BLEUScore for trend continuity with prior baselines. For summarisation, fi.evals.SummaryQuality and fi.evals.IsGoodSummary give task-level rubric scores; fi.evals.Faithfulness checks that the summary doesn’t introduce facts absent from the source. For seq2seq components inside larger pipelines, the standard Dataset.add_evaluation() workflow attaches whichever evaluators match the component’s role.

For traceability, traceAI-huggingface and traceAI-vllm instrument seq2seq inference, emitting input/output spans plus latency and token-count attributes that feed cost-attribution dashboards. If the model runs on managed inference (traceAI-bedrock, traceAI-vertexai), those integrations capture the same span shape.

Concretely: an enterprise translation team running a fine-tuned T5 model on traceAI-huggingface runs every translation through TranslationAccuracy plus BLEUScore against the canonical reference set. A model-version swap that improves BLEU by 1.2 points but drops TranslationAccuracy by 4 points reveals the new model is gaming surface n-gram overlap while losing meaning preservation — the team rolls back before the regression hits localisation QA. Without the layered eval, the team would have shipped on the BLEU win.

How to Measure or Detect It

Pick task-specific evaluators; reserve surface metrics for trend continuity, not gating:

TranslationAccuracy: judge-model rubric for meaning preservation, fluency, named-entity handling.
SummaryQuality + IsGoodSummary: task-level rubric scores for summarisation seq2seq.
Faithfulness: scores whether the output introduces facts absent from the source — the canonical hallucination signal for summarisation.
BLEUScore + ROUGEScore: classical n-gram-overlap metrics; useful for trend continuity, not for gating decisions.
EmbeddingSimilarity: semantic-similarity floor; catches summaries that paraphrase correctly but score zero on BLEU.
llm.token_count.completion (OTel attribute): tracked per-trace for cost attribution on high-volume seq2seq workloads.

Minimal Python:

from fi.evals import TranslationAccuracy, BLEUScore

acc = TranslationAccuracy()
bleu = BLEUScore()

result = acc.evaluate(
    input="Hello, how are you?",
    output="Hola, ¿cómo estás?",
    expected_response="Hola, ¿cómo estás?",
)
print(result.score, result.reason)

Common Mistakes

Using BLEU as the only quality metric. BLEU is a surface metric — meaning-preserving paraphrases score zero.
Skipping faithfulness evaluation on summarisation. Encoder-decoder summarisers hallucinate; reference-free Faithfulness catches what reference metrics miss.
Comparing seq2seq and decoder-only LLM by perplexity. Architectures aren’t comparable on perplexity; use task-level rubrics.
Not evaluating embedded seq2seq components. A reranker or classifier inside a larger pipeline is a seq2seq surface that needs its own evals.
Ignoring length distribution shift. A new seq2seq variant can change average output length by 30%, which silently breaks downstream parsers.