Guides

How Multimodal LLMs Work in 2026: Vision Encoders, Fusion, and Cross-Attention Explained

Multimodal LLM internals in 2026. Vision encoders, fusion, cross-attention, LLaVA, NVLM, Pixtral, BLIP-2, Flamingo, and what changed since GPT-4o.

·
Updated
·
9 min read
evaluations llms multimodal
How multimodal large language models work in 2026
Table of Contents

TL;DR

LayerJobCommon implementations
Vision encoderTurn an image into patch embeddingsCLIP ViT-L/14, SigLIP, InternViT-6B, Pixtral-ViT
ConnectorMap visual features into LLM token spaceProjection-based (LLaVA, NVLM, Pixtral), Q-Former (BLIP-2), cross-attention layers (Flamingo)
Language decoderGenerate text conditioned on combined tokensLlama 3 or 4, Vicuna, Qwen2 / Qwen3, MPT, Mistral
OutputText (and sometimes image, audio)Standard autoregressive decoding

This guide explains how a multimodal large language model takes an image plus a question and produces a text answer in 2026. It covers vision encoders, the three dominant fusion strategies (projection-based connectors, Q-Former, cross-attention), and the training recipes that keep text-only performance intact after multimodal training. The closing section covers how to evaluate vision-LLMs in production.

What Multimodality Means in 2026

A “modality” is a kind of input or output: text, image, audio, video, code. A multimodal LLM accepts at least two of these as input. In May 2026, that almost always means text plus image (vision-LLMs), with text plus audio (Gemini 3, GPT-5 voice) close behind, and text plus video catching up.

The core architectural pattern is the same across all of them:

  1. Encode each non-text modality with a domain-specific encoder.
  2. Project the encoder’s features into the language model’s token space.
  3. Decode with a text-only LLM that sees both the projected tokens and the user’s text prompt.

The interesting design choices live in step 2 and in how the three components are trained together.

Step 1: Vision Encoders

The vision encoder is a Vision Transformer or a Vision Transformer variant. It takes an image, splits it into patches (typically 14x14 or 16x16 pixels), and produces a sequence of patch embeddings. The dominant encoders in 2026 are:

  • CLIP ViT-L/14 (Radford et al. 2021). 24 transformer layers, 224 or 336 px inputs. Still common in LLaVA-family models.
  • SigLIP (Zhai et al. 2023). Same idea as CLIP but with a sigmoid loss; better data efficiency.
  • InternViT-6B-448px (Chen et al. 2023). 6B parameter encoder used by NVLM and InternVL.
  • Pixtral-ViT (Mistral). 1B parameter encoder built specifically for variable-resolution inputs.

The output of the encoder is a sequence of vectors, one per patch. A 224x224 image with 14x14 patches produces 256 patch tokens.

Step 2: Fusion / Connector Strategies

This is where the design decisions matter. The connector turns visual patch embeddings into something the LLM can read. Three options dominate 2026 architectures.

Option A: Projection-based connector (LLaVA, NVLM, Pixtral)

A small feed-forward network maps each patch embedding to an LLM token. The most common choice is a two-layer MLP with a GeLU or ReLU non-linearity (LLaVA, NVLM). Pixtral Large uses a lighter linear projection plus image-token markers. Either way the projection is small, fast, and easy to train.

  • Token count equals patch count. 256 patches means 256 extra tokens in the context.
  • Used by LLaVA 1.5+, NVLM 1.0, Pixtral, and most newer open-weight models.

Option B: Q-Former (BLIP-2, InstructBLIP)

A learnable transformer compresses N patch embeddings into K query tokens (K is much smaller than N). For BLIP-2, K=32. The Q-Former is itself a 12-layer transformer trained in two phases.

  • Saves context tokens significantly.
  • Adds parameters and a separate training phase.
  • Used by BLIP-2, InstructBLIP, and several enterprise vision pipelines.

Reference: Li et al. 2023.

Option C: Cross-attention adapter layers (Flamingo, OpenFlamingo)

Insert new cross-attention layers between the existing self-attention layers of a frozen LLM. Each cross-attention layer takes language tokens as queries and visual tokens as keys and values.

  • Keeps the LLM frozen entirely, so text-only performance is preserved by construction.
  • Adds parameters per inserted layer.
  • Used by DeepMind Flamingo and the open-source OpenFlamingo.

Reference: Alayrac et al. 2022.

Step 3: The Language Decoder

The decoder is a standard autoregressive text LLM. In 2026, the dominant open-weight choices are:

  • Llama 3 / Llama 4 family
  • Qwen2 / Qwen3 family
  • Mistral / Mixtral family
  • Vicuna (LLaMA-derived instruction-tuned)

The decoder reads the concatenated sequence of projected visual tokens and the user’s text tokens, then generates a response one token at a time. For Flamingo-style cross-attention, the LLM is frozen and unmodified; the cross-attention layers do the integration.

Reference Architectures

LLaVA (LLaVA-NeXT in 2026)

LLaVA is the canonical open-source vision-language model.

  • Vision encoder: CLIP ViT-L/14 (336 px in LLaVA-NeXT, with AnyRes for higher resolutions)
  • Connector: Two-layer MLP projector
  • Decoder: Vicuna 7B or 13B (originally), with newer LLaVA-NeXT variants on Llama 3 backbones
  • Training: Two-stage. Stage 1 trains the projector on CC3M-style image-text pairs. Stage 2 fine-tunes on GPT-generated multimodal instruction data, with both the projector and the LLM unfrozen.

LLaVA’s key contribution was showing that high-quality visual instruction tuning data plus a tiny connector can match much larger proprietary systems on academic benchmarks. The recipe scales to 70B+ backbones.

NVIDIA NVLM 1.0

NVLM 1.0 is NVIDIA’s open-source vision-language model.

  • Vision encoder: InternViT-6B-448px-V1-5
  • Connector: MLP projector with a tile-tagging mechanism that splits high-resolution images into tiles and tags each tile with positional text
  • Decoder: Qwen2-72B-Instruct
  • Training: Two-stage. Pretrain the projector on diverse VQA, OCR, and reasoning data. SFT unfreezes the LLM while keeping the vision encoder frozen, blending multimodal and high-quality text-only data.

NVLM is notable for reporting an average 4.3-point gain on text-only math and coding benchmarks after multimodal training, addressing the catastrophic-forgetting problem. Reference: Dai et al. 2024.

Mistral Pixtral Large

Pixtral Large is Mistral’s 124B parameter open-weight vision-language model.

  • Vision encoder: Pixtral-ViT (1B parameters), designed for variable resolution
  • Connector: A linear projection from the Pixtral-ViT output into the text decoder, paired with image-token markers; the decoder then treats visual and text tokens in a unified attention layout
  • Decoder: 123B Mistral text decoder
  • Key tricks: Block-diagonal attention masks and 2D Rotary Position Embedding (RoPE-2D) handle variable image sizes; context length up to 128k tokens.

Pixtral Large is currently among the strongest open-weight multimodal models on MathVista, DocVQA, and ChartQA.

BLIP-2 / InstructBLIP

BLIP-2 is the reference for Q-Former-based architectures.

  • Vision encoder: Frozen pretrained ViT-L/14 (or EVA-ViT)
  • Connector: 12-layer Q-Former that compresses patches into 32 query tokens
  • Decoder: Frozen Flan-T5 or OPT, depending on the BLIP-2 variant
  • Training: Two-phase. Phase 1 trains the Q-Former on image-text representation tasks against the frozen vision encoder. Phase 2 connects the Q-Former to the frozen LLM and trains for vision-to-language generation.

Both the encoder and the decoder are frozen throughout. Only the Q-Former and a few small fusion layers are trained. This gives BLIP-2 strong sample efficiency.

OpenFlamingo

OpenFlamingo is the open-source replication of DeepMind Flamingo.

  • Vision encoder: CLIP ViT-L/14
  • Connector: Cross-attention adapter layers inserted between self-attention layers in a frozen LLM
  • Decoder: Llama, MPT, or other frozen autoregressive LLMs
  • Training: Fine-tune the cross-attention layers on interleaved image-text web data (Multimodal C4, LAION-2B).

OpenFlamingo’s freezing strategy preserves the underlying LLM’s text-only capabilities by construction, at the cost of a more complex parameter layout.

Benchmark Snapshot (May 2026)

ModelVQAv2DocVQAChartQAMathVistaMMMUOCRBench
LLaVA-NeXT 34B~83%~84%~70%~46%~52%~57%
NVLM 1.0 (D-72B)~80%~85%~76%~65%~78%~92%
Pixtral Largen/a~90%~88%~85%~74%n/a
BLIP-2 (Flan-T5 XXL)~65%n/an/an/an/an/a
OpenFlamingo-9B~52%n/an/an/an/an/a

Numbers come from each model’s paper or release notes; check vendor sources for the exact configuration and prompting protocol. Cross-model comparisons here are directional only.

How Multimodal Training Avoids Catastrophic Forgetting

The biggest failure mode of naive multimodal fine-tuning is that the LLM loses text-only ability. Two techniques mitigate this in 2026:

  1. Selective unfreezing. Freeze the vision encoder during SFT; only unfreeze the connector and the LLM. NVLM and LLaVA both follow this.
  2. Mixed data. Blend high-quality text-only fine-tuning data into the multimodal SFT mix. NVLM reports a 4.3-point gain on text-only math and coding after multimodal training using this recipe.

A third option, used by Flamingo and OpenFlamingo, is to never unfreeze the LLM at all and only train the cross-attention adapters. That preserves text-only performance by construction but limits the model’s ability to learn fundamentally new behaviors. Other recent open-weight releases (Pixtral Large, Qwen-VL, LLaVA-NeXT) document their own variants of the mixed-data approach in their respective papers.

Evaluating Multimodal LLMs in Production

Multimodal models are harder to evaluate than text-only models because:

  • The input is high-dimensional and not easily diffable
  • Hallucinations can be visual (the model invents content not in the image) or textual (the model misreads chart values)
  • Standard heuristic metrics (BLEU, ROUGE) miss most failure modes

A practical 2026 evaluation stack:

  1. Public benchmarks as a starting baseline: MMMU, MathVista, DocVQA, ChartQA, OCRBench, VQAv2.
  2. Task-specific deterministic checks on production data: JSON-schema validation, output length caps, citation regex.
  3. LLM-judge metrics for the open-ended failures: faithfulness to the image, instruction adherence, tone, safety.
  4. Tracing so every score is tied to a replayable run.

Future AGI fits at steps 3 and 4 as the eval and observability companion. The evaluate() function in ai-evaluation ships built-in metrics like faithfulness and instruction_adherence that work on multimodal outputs, and traceAI captures the prompt, image, output, and score as a span for replay.

from fi.evals import evaluate

result = evaluate(
    "instruction_adherence",
    output="<the model's text answer>",
    input="<the user's text prompt and a description of the image>",
)

print(result.score, result.explanation)

For custom rubrics (for example, “score whether the model correctly read the chart legend”), use a custom LLM-judge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

provider = LiteLLMProvider(model="gpt-5-2025-08-07")

judge = CustomLLMJudge(
    name="chart_legend_read",
    prompt=(
        "Given a chart description and a model answer, score 0 or 1 "
        "whether the model correctly identified the legend labels. "
        "Respond with JSON: {\"score\": int, \"reason\": string}."
    ),
    provider=provider,
)

evaluator = Evaluator(judge)
score = evaluator.evaluate(output="<the model's answer>")
print(score)

The turing_flash hosted judge returns in about 1 to 2 seconds, turing_small in 2 to 3 seconds, and turing_large in 3 to 5 seconds. Configure with FI_API_KEY and FI_SECRET_KEY, and open the runs at /platform/monitor/command-center.

Open-Source Multimodal LLMs and Why They Matter

Open-weight multimodal models matter because:

  • They let researchers and engineers reproduce results without relying on a closed API.
  • They allow domain-specific fine-tuning (medical imaging, satellite imagery, manufacturing defect detection) without sending images to a third-party.
  • They are the only option for air-gapped or compliance-bound environments.

The trade-off is engineering overhead: serving a 72B or 124B vision-language model requires multi-GPU inference and careful batching. The frontier API products (GPT-5, Claude Opus 4.7, Gemini 3) are easier to start with; open-weight models become attractive once you have a steady production workload to amortize the serving cost.

Frequently asked questions

What is a multimodal large language model?
A multimodal LLM is a model that accepts inputs from more than one modality (text, image, audio, sometimes video) and produces text or another modality as output. The standard 2026 design pairs a vision (or audio) encoder with a text decoder and a connector that maps the encoder's features into the decoder's token space. Examples include GPT-5 with vision, Claude Opus 4.7 with vision, Gemini 3, Pixtral Large, NVLM 1.0, and LLaVA-NeXT.
How do vision encoders connect to a language model?
A vision encoder (typically a Vision Transformer such as CLIP ViT-L/14, SigLIP, or InternViT) turns an image into a sequence of patch embeddings. A connector then maps those embeddings into the LLM's token space. The two dominant connectors are a projection-based connector that maps patch embeddings one-to-one into LLM tokens (LLaVA, NVLM, Pixtral) and a Q-Former that compresses many patches into a small number of learned query tokens (BLIP-2). Cross-attention adapters inserted into the LLM (Flamingo style) are a third option.
What is cross-attention in a multimodal model?
Cross-attention lets one stream of tokens attend to another. In a Flamingo-style architecture, new cross-attention layers are inserted between the existing self-attention layers of a frozen language model. Each cross-attention layer takes language tokens as queries and visual tokens as keys and values, so the language model can pull in visual context while still using its existing language weights.
What is the difference between MLP projectors and Q-Formers?
An MLP projector is a small feed-forward network that maps each patch embedding to a token in the LLM's space. It is simple, fast, and used by LLaVA and NVLM. A linear projection variant (Pixtral) is even lighter. A Q-Former is a trainable transformer that compresses N visual patches into K learned query tokens (K is much smaller than N). It saves context tokens but adds parameters and a separate training phase. BLIP-2 and InstructBLIP use Q-Formers.
How are multimodal LLMs trained?
Two-stage is standard. Stage one (alignment): freeze the vision encoder and the LLM, train only the connector (MLP, Q-Former, or cross-attention layers) on image-text pairs so visual features land in the right region of the token space. Stage two (instruction tuning): unfreeze the connector and often the LLM, train on multimodal instruction data so the model can follow complex visual prompts. NVLM and LLaVA both follow this pattern.
Why do some multimodal models lose text-only performance after vision training?
If you unfreeze the LLM during multimodal fine-tuning and train only on multimodal data, the model can forget text-only capabilities (catastrophic forgetting). NVLM 1.0 addresses this by blending high-quality text-only data into the supervised fine-tuning mix, and reports a 4.3-point average improvement on text-only math and coding benchmarks after multimodal training. Other recent open-weight releases use related mixed-data approaches; check each model's paper for the exact recipe.
How do I evaluate a multimodal LLM in 2026?
Use a mix of academic benchmarks (MMMU, MathVista, DocVQA, ChartQA, OCRBench, VQAv2) and task-specific evaluations on real production inputs. For production use, add an LLM-judge that scores faithfulness to the image content and instruction adherence. A tracing layer such as Future AGI traceAI captures the prompt, the image, the output, and the score for every run so regressions are visible across releases.
What changed in multimodal LLMs between 2024 and 2026?
Three things. First, native multimodal frontier models (GPT-5, Claude Opus 4.7, Gemini 3) ship with vision and often audio as first-class inputs rather than bolted-on features. Second, open-weight multimodal models (Pixtral Large, NVLM 1.0, LLaVA-NeXT, Qwen-VL) closed most of the quality gap to closed models. Third, high-resolution support via tile-tagging or dynamic resolution became standard, so DocVQA and ChartQA results improved significantly.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.