How Multimodal LLMs Work in 2026: Vision Encoders, Fusion, and Cross-Attention Explained
Multimodal LLM internals in 2026. Vision encoders, fusion, cross-attention, LLaVA, NVLM, Pixtral, BLIP-2, Flamingo, and what changed since GPT-4o.
Table of Contents
TL;DR
| Layer | Job | Common implementations |
|---|---|---|
| Vision encoder | Turn an image into patch embeddings | CLIP ViT-L/14, SigLIP, InternViT-6B, Pixtral-ViT |
| Connector | Map visual features into LLM token space | Projection-based (LLaVA, NVLM, Pixtral), Q-Former (BLIP-2), cross-attention layers (Flamingo) |
| Language decoder | Generate text conditioned on combined tokens | Llama 3 or 4, Vicuna, Qwen2 / Qwen3, MPT, Mistral |
| Output | Text (and sometimes image, audio) | Standard autoregressive decoding |
This guide explains how a multimodal large language model takes an image plus a question and produces a text answer in 2026. It covers vision encoders, the three dominant fusion strategies (projection-based connectors, Q-Former, cross-attention), and the training recipes that keep text-only performance intact after multimodal training. The closing section covers how to evaluate vision-LLMs in production.
What Multimodality Means in 2026
A “modality” is a kind of input or output: text, image, audio, video, code. A multimodal LLM accepts at least two of these as input. In May 2026, that almost always means text plus image (vision-LLMs), with text plus audio (Gemini 3, GPT-5 voice) close behind, and text plus video catching up.
The core architectural pattern is the same across all of them:
- Encode each non-text modality with a domain-specific encoder.
- Project the encoder’s features into the language model’s token space.
- Decode with a text-only LLM that sees both the projected tokens and the user’s text prompt.
The interesting design choices live in step 2 and in how the three components are trained together.
Step 1: Vision Encoders
The vision encoder is a Vision Transformer or a Vision Transformer variant. It takes an image, splits it into patches (typically 14x14 or 16x16 pixels), and produces a sequence of patch embeddings. The dominant encoders in 2026 are:
- CLIP ViT-L/14 (Radford et al. 2021). 24 transformer layers, 224 or 336 px inputs. Still common in LLaVA-family models.
- SigLIP (Zhai et al. 2023). Same idea as CLIP but with a sigmoid loss; better data efficiency.
- InternViT-6B-448px (Chen et al. 2023). 6B parameter encoder used by NVLM and InternVL.
- Pixtral-ViT (Mistral). 1B parameter encoder built specifically for variable-resolution inputs.
The output of the encoder is a sequence of vectors, one per patch. A 224x224 image with 14x14 patches produces 256 patch tokens.
Step 2: Fusion / Connector Strategies
This is where the design decisions matter. The connector turns visual patch embeddings into something the LLM can read. Three options dominate 2026 architectures.
Option A: Projection-based connector (LLaVA, NVLM, Pixtral)
A small feed-forward network maps each patch embedding to an LLM token. The most common choice is a two-layer MLP with a GeLU or ReLU non-linearity (LLaVA, NVLM). Pixtral Large uses a lighter linear projection plus image-token markers. Either way the projection is small, fast, and easy to train.
- Token count equals patch count. 256 patches means 256 extra tokens in the context.
- Used by LLaVA 1.5+, NVLM 1.0, Pixtral, and most newer open-weight models.
Option B: Q-Former (BLIP-2, InstructBLIP)
A learnable transformer compresses N patch embeddings into K query tokens (K is much smaller than N). For BLIP-2, K=32. The Q-Former is itself a 12-layer transformer trained in two phases.
- Saves context tokens significantly.
- Adds parameters and a separate training phase.
- Used by BLIP-2, InstructBLIP, and several enterprise vision pipelines.
Reference: Li et al. 2023.
Option C: Cross-attention adapter layers (Flamingo, OpenFlamingo)
Insert new cross-attention layers between the existing self-attention layers of a frozen LLM. Each cross-attention layer takes language tokens as queries and visual tokens as keys and values.
- Keeps the LLM frozen entirely, so text-only performance is preserved by construction.
- Adds parameters per inserted layer.
- Used by DeepMind Flamingo and the open-source OpenFlamingo.
Reference: Alayrac et al. 2022.
Step 3: The Language Decoder
The decoder is a standard autoregressive text LLM. In 2026, the dominant open-weight choices are:
- Llama 3 / Llama 4 family
- Qwen2 / Qwen3 family
- Mistral / Mixtral family
- Vicuna (LLaMA-derived instruction-tuned)
The decoder reads the concatenated sequence of projected visual tokens and the user’s text tokens, then generates a response one token at a time. For Flamingo-style cross-attention, the LLM is frozen and unmodified; the cross-attention layers do the integration.
Reference Architectures
LLaVA (LLaVA-NeXT in 2026)
LLaVA is the canonical open-source vision-language model.
- Vision encoder: CLIP ViT-L/14 (336 px in LLaVA-NeXT, with AnyRes for higher resolutions)
- Connector: Two-layer MLP projector
- Decoder: Vicuna 7B or 13B (originally), with newer LLaVA-NeXT variants on Llama 3 backbones
- Training: Two-stage. Stage 1 trains the projector on CC3M-style image-text pairs. Stage 2 fine-tunes on GPT-generated multimodal instruction data, with both the projector and the LLM unfrozen.
LLaVA’s key contribution was showing that high-quality visual instruction tuning data plus a tiny connector can match much larger proprietary systems on academic benchmarks. The recipe scales to 70B+ backbones.
NVIDIA NVLM 1.0
NVLM 1.0 is NVIDIA’s open-source vision-language model.
- Vision encoder: InternViT-6B-448px-V1-5
- Connector: MLP projector with a tile-tagging mechanism that splits high-resolution images into tiles and tags each tile with positional text
- Decoder: Qwen2-72B-Instruct
- Training: Two-stage. Pretrain the projector on diverse VQA, OCR, and reasoning data. SFT unfreezes the LLM while keeping the vision encoder frozen, blending multimodal and high-quality text-only data.
NVLM is notable for reporting an average 4.3-point gain on text-only math and coding benchmarks after multimodal training, addressing the catastrophic-forgetting problem. Reference: Dai et al. 2024.
Mistral Pixtral Large
Pixtral Large is Mistral’s 124B parameter open-weight vision-language model.
- Vision encoder: Pixtral-ViT (1B parameters), designed for variable resolution
- Connector: A linear projection from the Pixtral-ViT output into the text decoder, paired with image-token markers; the decoder then treats visual and text tokens in a unified attention layout
- Decoder: 123B Mistral text decoder
- Key tricks: Block-diagonal attention masks and 2D Rotary Position Embedding (RoPE-2D) handle variable image sizes; context length up to 128k tokens.
Pixtral Large is currently among the strongest open-weight multimodal models on MathVista, DocVQA, and ChartQA.
BLIP-2 / InstructBLIP
BLIP-2 is the reference for Q-Former-based architectures.
- Vision encoder: Frozen pretrained ViT-L/14 (or EVA-ViT)
- Connector: 12-layer Q-Former that compresses patches into 32 query tokens
- Decoder: Frozen Flan-T5 or OPT, depending on the BLIP-2 variant
- Training: Two-phase. Phase 1 trains the Q-Former on image-text representation tasks against the frozen vision encoder. Phase 2 connects the Q-Former to the frozen LLM and trains for vision-to-language generation.
Both the encoder and the decoder are frozen throughout. Only the Q-Former and a few small fusion layers are trained. This gives BLIP-2 strong sample efficiency.
OpenFlamingo
OpenFlamingo is the open-source replication of DeepMind Flamingo.
- Vision encoder: CLIP ViT-L/14
- Connector: Cross-attention adapter layers inserted between self-attention layers in a frozen LLM
- Decoder: Llama, MPT, or other frozen autoregressive LLMs
- Training: Fine-tune the cross-attention layers on interleaved image-text web data (Multimodal C4, LAION-2B).
OpenFlamingo’s freezing strategy preserves the underlying LLM’s text-only capabilities by construction, at the cost of a more complex parameter layout.
Benchmark Snapshot (May 2026)
| Model | VQAv2 | DocVQA | ChartQA | MathVista | MMMU | OCRBench |
|---|---|---|---|---|---|---|
| LLaVA-NeXT 34B | ~83% | ~84% | ~70% | ~46% | ~52% | ~57% |
| NVLM 1.0 (D-72B) | ~80% | ~85% | ~76% | ~65% | ~78% | ~92% |
| Pixtral Large | n/a | ~90% | ~88% | ~85% | ~74% | n/a |
| BLIP-2 (Flan-T5 XXL) | ~65% | n/a | n/a | n/a | n/a | n/a |
| OpenFlamingo-9B | ~52% | n/a | n/a | n/a | n/a | n/a |
Numbers come from each model’s paper or release notes; check vendor sources for the exact configuration and prompting protocol. Cross-model comparisons here are directional only.
How Multimodal Training Avoids Catastrophic Forgetting
The biggest failure mode of naive multimodal fine-tuning is that the LLM loses text-only ability. Two techniques mitigate this in 2026:
- Selective unfreezing. Freeze the vision encoder during SFT; only unfreeze the connector and the LLM. NVLM and LLaVA both follow this.
- Mixed data. Blend high-quality text-only fine-tuning data into the multimodal SFT mix. NVLM reports a 4.3-point gain on text-only math and coding after multimodal training using this recipe.
A third option, used by Flamingo and OpenFlamingo, is to never unfreeze the LLM at all and only train the cross-attention adapters. That preserves text-only performance by construction but limits the model’s ability to learn fundamentally new behaviors. Other recent open-weight releases (Pixtral Large, Qwen-VL, LLaVA-NeXT) document their own variants of the mixed-data approach in their respective papers.
Evaluating Multimodal LLMs in Production
Multimodal models are harder to evaluate than text-only models because:
- The input is high-dimensional and not easily diffable
- Hallucinations can be visual (the model invents content not in the image) or textual (the model misreads chart values)
- Standard heuristic metrics (BLEU, ROUGE) miss most failure modes
A practical 2026 evaluation stack:
- Public benchmarks as a starting baseline: MMMU, MathVista, DocVQA, ChartQA, OCRBench, VQAv2.
- Task-specific deterministic checks on production data: JSON-schema validation, output length caps, citation regex.
- LLM-judge metrics for the open-ended failures: faithfulness to the image, instruction adherence, tone, safety.
- Tracing so every score is tied to a replayable run.
Future AGI fits at steps 3 and 4 as the eval and observability companion. The evaluate() function in ai-evaluation ships built-in metrics like faithfulness and instruction_adherence that work on multimodal outputs, and traceAI captures the prompt, image, output, and score as a span for replay.
from fi.evals import evaluate
result = evaluate(
"instruction_adherence",
output="<the model's text answer>",
input="<the user's text prompt and a description of the image>",
)
print(result.score, result.explanation)
For custom rubrics (for example, “score whether the model correctly read the chart legend”), use a custom LLM-judge:
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator
provider = LiteLLMProvider(model="gpt-5-2025-08-07")
judge = CustomLLMJudge(
name="chart_legend_read",
prompt=(
"Given a chart description and a model answer, score 0 or 1 "
"whether the model correctly identified the legend labels. "
"Respond with JSON: {\"score\": int, \"reason\": string}."
),
provider=provider,
)
evaluator = Evaluator(judge)
score = evaluator.evaluate(output="<the model's answer>")
print(score)
The turing_flash hosted judge returns in about 1 to 2 seconds, turing_small in 2 to 3 seconds, and turing_large in 3 to 5 seconds. Configure with FI_API_KEY and FI_SECRET_KEY, and open the runs at /platform/monitor/command-center.
Open-Source Multimodal LLMs and Why They Matter
Open-weight multimodal models matter because:
- They let researchers and engineers reproduce results without relying on a closed API.
- They allow domain-specific fine-tuning (medical imaging, satellite imagery, manufacturing defect detection) without sending images to a third-party.
- They are the only option for air-gapped or compliance-bound environments.
The trade-off is engineering overhead: serving a 72B or 124B vision-language model requires multi-GPU inference and careful batching. The frontier API products (GPT-5, Claude Opus 4.7, Gemini 3) are easier to start with; open-weight models become attractive once you have a steady production workload to amortize the serving cost.
Related reading
Frequently asked questions
What is a multimodal large language model?
How do vision encoders connect to a language model?
What is cross-attention in a multimodal model?
What is the difference between MLP projectors and Q-Formers?
How are multimodal LLMs trained?
Why do some multimodal models lose text-only performance after vision training?
How do I evaluate a multimodal LLM in 2026?
What changed in multimodal LLMs between 2024 and 2026?
Gemini 2.5 Pro in May 2026: pricing, benchmarks, retirement status, and whether to upgrade to Gemini 3.1 Pro for new builds. With migration checklist.
OpenAI AgentKit (Oct 2025) + Future AGI in 2026: visual builder, traceAI auto-instrumentation, fi.evals scoring, BYOK gateway. Real code, real APIs, no hype.
Cut LLM costs 30% in 90 days. 2026 playbook on model routing, caching, BYOK gateways, cost tracking. Includes best LLM cost-tracking tools.