Articles

Mistral Small 3.1 in 2026: Benchmarks, Lineup, and How It Compares to GPT-5, Claude Opus 4.7, and Gemini 2.5

Mistral Small 3.1 in May 2026: 128k context, vision, 80.6% MMLU, Apache 2.0. Plus where Small 3.2, Medium 3, and Mistral Large 2 fit the lineup.

April 30, 2025

Updated May 14, 2026

10 min read

agents llms

Table of Contents

Mistral Small 3.1 in May 2026: TL;DR

Field	Value
Model	Mistral Small 3.1 (24B parameters, instruct + base)
Released	March 17, 2025 (Small 3.2 update June 2025)
License	Apache 2.0 (weights on Hugging Face)
Context window	128,000 tokens
Modalities	Text + image input, text output
Headline benchmarks	MMLU 80.6%, HumanEval ~88%, GSM8K ~69%
Hardware floor	Single RTX 4090 (24 GB) with quantization, or 32 GB Mac
Best fit in May 2026	On-prem RAG, chat, function calling, edge deployment
Where it slots in lineup	Below Mistral Medium 3 and Mistral Large 2; Devstral 2507 reuses the Small 3.1 base for code

Mistral Small 3.1 shipped in March 2025 and is still, in May 2026, the open-weight model of choice for teams that want strong multilingual + vision capabilities under Apache 2.0 on a single GPU. This guide covers the lineup as it stands today, fresh benchmarks, hardware setup, and how Small 3.1 stacks up against frontier 2026 models.

What is new in Mistral Small 3.1

Mistral Small 3.1 is the 24-billion-parameter instruct model released on Hugging Face under Apache 2.0. Compared with Small 3 (24B, January 2025) it adds:

Multimodal vision input. Pixtral-style encoder fused into the decoder. The model reads images and answers questions about them, including OCR, document understanding, and chart reading.
128,000-token context window. Up from 32k in Small 3. Matches GPT-4o Mini and is around two-thirds of Claude Sonnet 4’s 200k window.
Stronger English and multilingual reasoning. The Mistral blog reports 80.6% on MMLU (5-shot), up from Small 3’s 79.0%, and improved coverage on French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese, Korean, Hindi, and Arabic.
Native function calling and JSON mode. The model accepts a function schema and emits structured tool calls, which makes it agent-ready in LangGraph, CrewAI, and the OpenAI Agents SDK.
Faster inference. Mistral claims around 150 tokens per second on optimized inference stacks at the 24B size, broadly competitive with Gemma 3 27B and GPT-4o Mini.
Apache 2.0 weights. Free commercial use, modification, and redistribution. This is the single biggest reason teams choose Small 3.1 over GPT-4o Mini or Claude 3.5 Haiku for self-hosted production.

For the canonical release post, see Mistral’s Small 3.1 announcement and the Hugging Face model card.

How Mistral Small 3.1 differs from earlier Mistral models

Feature	Mistral 7B (2023)	Mixtral 8x7B (2023)	Mistral Small 3.1 (2025)
Parameters	7B dense	47B MoE (12.9B active)	24B dense
Context	8k (sliding window)	32k	128k
Modality	Text	Text	Text + vision
License	Apache 2.0	Apache 2.0	Apache 2.0
Hardware floor	1x 16 GB GPU	2x A100 or quantized	1x RTX 4090 quantized
Function calling	No	Community add-on	Native

Mistral 7B is still useful for tight memory budgets, but Small 3.1 is the right open-weight upgrade for anything that needs vision, long context, function calling, or improved multilingual coverage.

Mistral Small 3.1 vs GPT-5, Claude Opus 4.7, Gemini 2.5 Pro, and Llama 4

The frontier closed models in May 2026 are GPT-5 (released August 2025), Claude Opus 4.7 (Anthropic’s flagship), and Gemini 2.5 Pro (Google’s reasoning model). Meta’s Llama 4 family is the strongest open-weight competitor.

Model	Type	Context	Vision	Best benchmark area	Pricing direction
Mistral Small 3.1	Open weights, Apache 2.0	128k	Yes	Multilingual + on-prem RAG	Free weights, you pay for hardware
Llama 4 Maverick	Open weights, Llama Community License	1M	Yes	Long-context reasoning	Free weights, you pay for hardware
Mistral Medium 3	Closed weights, API	128k	Yes	Enterprise reasoning	Mid-tier API pricing
GPT-5	Closed weights, API	400k input	Yes	Reasoning, agents, code	Frontier-tier pricing
Claude Opus 4.7	Closed weights, API	200k	Yes	Coding (SWE-bench), tool use	Frontier-tier pricing
Gemini 2.5 Pro	Closed weights, API	1M+	Yes	Long-context QA, math	Mid-to-high tier

What this means in practice:

Total cost of ownership. Small 3.1 on your own GPU is the cheapest option once you hit roughly 50 to 100 million tokens per month. Below that, the closed APIs are usually cheaper after engineering and on-call cost.
Reasoning ceiling. On GPQA, AIME 2025, and SWE-bench Verified, GPT-5 and Claude Opus 4.7 are clearly ahead. Small 3.1 is competitive on MMLU, HumanEval, and IFEval but falls behind on multi-step reasoning.
Vision quality. All four read images. Closed frontier models are stronger on dense document VQA and chart reading; Small 3.1 covers the common cases.
Latency. Small 3.1 has the lowest time-to-first-token on local hardware. Frontier APIs vary by region and load.
Data residency. Small 3.1 is the only option here that lets you keep all data inside a VPC with no third-party API call.

Hardware and runtime setup

Single GPU local

Mistral states Small 3.1 fits on a single NVIDIA RTX 4090 (24 GB VRAM) with INT4 or INT5 quantization through GGUF, AWQ, or GPTQ. Real-world memory usage for a 128k-context request will be higher than the base weight footprint; budget 22 to 23 GB at long context. Reference: the Mistral-Small-3.1-24B-Instruct-2503 model card on Hugging Face.

Apple Silicon

On a 32 GB Mac (M2 Max, M3 Max, M4 Max) the MLX and llama.cpp builds run Small 3.1 at usable speeds. Mistral has tested the 32 GB MacBook configuration. Expect 20 to 35 tokens per second on M3 Max with 4-bit quantization.

Multi-GPU and production serving

For full BF16 you need ~50 GB of GPU memory. Common configurations:

Two A100 40 GB or two L40S 48 GB with tensor parallel.
One H100 80 GB with room for batched serving.
Hugging Face text-generation-inference (TGI) for batched HTTP serving with quantization, paged attention, and OpenAI-compatible endpoints.
vLLM for the highest throughput with paged KV cache and continuous batching. Mistral officially recommends vLLM for Small 3.1.
DeepSpeed ZeRO-Inference for memory-efficient sharding when you cannot fit the model on available accelerators.

Inference cost benchmark

On a single H100 80 GB with vLLM and BF16, Mistral Small 3.1 serves roughly 1,500 to 2,500 input tokens per second and 100 to 200 generated tokens per second per request, depending on context length and batch size. At common single-H100 rental rates (around $2 to $3 per hour on Lambda Labs, RunPod, or CoreWeave at the time of writing) this typically works out to a small fraction of a dollar per million generated tokens at high batch utilization. Compare with GPT-4o Mini (commonly $0.15 per million input, $0.60 per million output) and Claude Haiku 3.5 (similar) for break-even sizing.

Real-world performance across six task areas

Knowledge QA and reasoning

Small 3.1 scores 80.6% on 5-shot MMLU per Mistral’s release notes. That puts it close to GPT-4o Mini and Claude Haiku 3.5 on textbook QA. On harder reasoning benchmarks (GPQA Diamond, AIME 2025) the model lags GPT-5 and Claude Opus 4.7 by a wide margin.

Mathematical problem solving

Around 69% on GSM8K, per public reporting. Useful for grade-school and basic word problems. For Olympiad-level math (AIME, HMMT, Putnam) the model is not the right tool; route those through a frontier model or a reasoning-specialist sibling like Magistral Small.

Coding and code generation

Roughly 88% on HumanEval (Python). For real engineering tasks measured on SWE-bench Verified the gap to Claude Opus 4.7 and GPT-5 is large. For routine code completion, lint fixes, and snippet generation across Python, JavaScript, C++, and Go, Small 3.1 is a solid free option. Devstral 2507 is the dedicated agentic-coding sibling tuned on the same base.

Natural language generation

Summarization, drafting, emails, story writing all work well. The instruction-following improvements in Small 3.2 are noticeable here.

Multimodal tasks

Small 3.1 is one of the strongest open-weight vision-language models. It handles document OCR, chart reading, screen UI extraction, and basic visual QA. For research-grade document understanding (DocVQA, ChartQA at the frontier) closed frontier models are still ahead.

Multilingual applications

Around 71% average accuracy across non-English benchmarks, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, Japanese, Korean, Hindi, and Arabic. Strong choice for non-English on-prem workloads.

Best use cases for Mistral Small 3.1 in 2026

Enterprise knowledge assistants and RAG

128k context plus Apache 2.0 weights make Small 3.1 a strong RAG backbone for internal corporate assistants. Pair with a dense embedding model (e.g., bge-large or jina-embeddings-v3) and a vector store inside your VPC.

Multimodal applications

Document review, invoice extraction, screenshot-to-text, chart reading. Small 3.1 is the open-weight option here.

Software development copilot

Use Small 3.1 or Devstral 2507 inside your IDE for line and block completion, code explanation, and refactoring suggestions. Pair with a higher-tier model (GPT-5 or Claude Opus 4.7) for the agentic SWE-bench-style tasks.

Virtual assistants and chatbots

Customer support deflection, internal helpdesk, personal-organizer bots. The Apache 2.0 license matters here for regulated workloads.

Reasoning engines for AI agents

Native function calling and JSON output make Small 3.1 usable as the planner inside LangGraph, CrewAI, the OpenAI Agents SDK, the Microsoft Agent Framework, and Mastra. For deeper multi-step reasoning, run Small 3.1 as the cheap planner and route hard sub-tasks to a frontier model. See our agentic AI frameworks guide for orchestration choices.

On-device AI

The 4-bit quantized Small 3.1 runs on consumer laptops with 16 to 32 GB RAM. This is the most interesting open option for offline, privacy-bound deployments on developer workstations.

Cost-conscious alternative to closed APIs

Once your monthly call volume crosses ~50 to 100 million tokens, self-hosted Small 3.1 on rented or owned GPUs is often the cheapest option. Below that, GPT-4o Mini and Claude Haiku 3.5 are usually cheaper after engineering and on-call.

How to evaluate Mistral Small 3.1 against frontier models on your own data

Public benchmarks correlate weakly with your specific workload. The disciplined approach in May 2026 is:

Curate 200 to 500 representative examples from production logs, anonymized.
Define metrics. For each example, decide if you score with exact match, JSON-schema validity, regex match, structural correctness, or LLM-as-judge for open-ended quality.
Run the same dataset through all candidate models (Small 3.1, Small 3.2, Mistral Medium 3, GPT-5, Claude Opus 4.7, Gemini 2.5 Pro).
Score with a consistent grader. An LLM-as-judge graded by a single frontier model gives comparable open-ended scores. Deterministic checks cover the structured parts.
Break down by intent or domain, not just overall accuracy. Small 3.1 may match GPT-5 on routine RAG but fall apart on tool-call orchestration.
Track regressions over time. Vendors update closed APIs silently; pinning weights for Small 3.1 is the only way to lock baseline performance.

Future AGI’s open-source ai-evaluation library (Apache 2.0) and the cloud evals API ship faithfulness, groundedness, instruction following, and custom LLM-judge metrics that work across all providers, so the same eval harness scores Small 3.1, Mistral Medium 3, GPT-5, and Claude Opus 4.7 on identical examples.

from fi.evals import evaluate

# Score an answer against retrieved context for faithfulness.
result = evaluate(
    "faithfulness",
    output="Mistral Small 3.1 supports a 128,000 token context window.",
    context="Mistral Small 3.1 ships with a 128k context window and native vision.",
)
print(result.score, result.reason)

For trace-level observability of Small 3.1 calls in agentic stacks, the Apache 2.0 traceAI library (OpenTelemetry-compatible) is the companion you wire in once and reuse across LangGraph, CrewAI, Mistral Medium 3, and the OpenAI Agents SDK.

from fi_instrumentation import register, FITracer

tracer_provider = register(project_name="mistral-small-rag-prod")
tracer = FITracer(tracer_provider)

with tracer.start_as_current_span("rag.query") as span:
    span.set_attribute("model", "mistral-small-3.1-24b-instruct")
    # ... your retrieval + generation logic

For BYOK gateway routing, prompt versioning, and live guardrails across Mistral, OpenAI, Anthropic, and Google, the Future AGI Agent Command Center sits in front of all four providers, configured with your Future AGI credentials (FI_API_KEY and FI_SECRET_KEY) plus your existing provider keys.

How Mistral Small 3.1 fits in the 2026 open-weight stack

Small 3.1 is one of the major open-weight model families to compare in 2026:

Mistral Small 3.1 / 3.2: Apache 2.0, 24B, vision, 128k context, broad multilingual coverage.
Meta Llama 3.3 70B and Llama 4 family: Llama Community License, stronger reasoning at the cost of higher memory.
Qwen 3 (Alibaba): Tongyi Qianwen License (close to Apache 2.0 for most uses), leading on Chinese and bilingual benchmarks, strong on math.

For a head-to-head with the rest of the 2026 open-weight landscape, see our open-source LLMs guide. For frontier closed-model comparisons in May 2026, see best LLMs in May 2026.

Bottom line

Mistral Small 3.1 remains, in May 2026, the open-weight 24B model worth running when you need vision, 128k context, function calling, and Apache 2.0 licensing on a single GPU. Small 3.2 is a drop-in instruct upgrade. Above it, Mistral Medium 3 and Mistral Large 2 cover the closed-weight enterprise tier, while GPT-5, Claude Opus 4.7, and Gemini 2.5 Pro hold the reasoning frontier. The right comparison is not “is Small 3.1 the best model in 2026” but “what is the cheapest model that meets my quality bar on my own data,” and that question has to be answered with evaluation, not vendor blog posts.

Frequently asked questions

Is Mistral Small 3.1 still relevant in May 2026?

Yes. Mistral Small 3.1 (24B, March 2025) remains the most-downloaded weights in Mistral's open lineup on Hugging Face and is still actively used for on-prem chatbots, RAG, and agent reasoning. The June 2025 Small 3.2 update added better instruction following on the same weights size, and Devstral 2507 reused the Small 3.1 base for code. For brand-new deployments in 2026, evaluate Small 3.2 first, but the 3.1 release notes, benchmarks, and recipes you find online are still mostly applicable.

What hardware do you need to run Mistral Small 3.1 locally?

Mistral states the 24B Small 3.1 model fits on a single NVIDIA RTX 4090 (24 GB VRAM) with quantization, or on a 32 GB RAM MacBook once quantized to GGUF or MLX formats. For full-precision BF16 you need ~50 GB of GPU memory, typically two A100 40 GB cards or one H100 80 GB. vLLM, llama.cpp, and Ollama all ship working configurations.

Is Mistral Small 3.1 open source, and what license applies?

Yes. Mistral Small 3.1 base and instruct weights are released on Hugging Face under the Apache 2.0 license, which permits free commercial use, modification, and redistribution with no royalty. This matches the Small 3 (24B) and Small 3.2 releases. Mistral Medium 3 and Mistral Large 2 (123B) are NOT Apache 2.0: they ship under the Mistral Research License or a commercial license and require a paid agreement for production use.

How does Mistral Small 3.1 compare to GPT-5, Claude Opus 4.7, and Gemini 2.5 Pro?

Mistral Small 3.1 is a 24B open-weight model that lands well below frontier closed models on reasoning benchmarks (GPQA, AIME, SWE-bench) but lands close on MMLU (80.6%), HumanEval (~88%), and instruction following. The trade-off is total cost of ownership: a single RTX 4090 or A100 runs Small 3.1 at low marginal cost, while GPT-5 and Claude Opus 4.7 charge per-token. Use Small 3.1 for high-volume on-prem workloads with bounded difficulty; use frontier models when answer quality matters more than throughput.

What is the difference between Mistral Small 3, 3.1, and 3.2?

Mistral Small 3 (Jan 2025, 24B) introduced the 24B Apache 2.0 base with 32k context. Small 3.1 (March 2025) added vision input, 128k context, multilingual coverage, function calling, and pushed MMLU to 80.6%. Small 3.2 (June 2025) is an instruct-only update on the Small 3.1 base with better instruction following, less repetition, and more reliable tool calls, but the same architecture and same hardware footprint.

Does Mistral Small 3.1 support vision and multimodal inputs?

Yes. Small 3.1 is the first Mistral model to ship native vision-language support. You can pass an image plus text in the same prompt for OCR, document understanding, chart reading, and visual QA. The vision pathway uses a Pixtral-style encoder fused into the 24B decoder, and the model returns text only (no image generation).

How do you evaluate Mistral Small 3.1 against frontier models on your own data?

Don't trust public benchmarks alone. Run a controlled side-by-side on 200 to 500 of your own examples, score with deterministic checks (exact match, regex, JSON validity) plus an LLM-as-judge for open-ended quality, and break down by intent or domain. Future AGI's Apache 2.0 ai-evaluation library and the cloud evals (faithfulness, groundedness, instruction following) give you a comparable score across Small 3.1, GPT-5, Claude Opus 4.7, and Gemini 2.5 Pro on the same dataset.

Should I deploy Mistral Small 3.1, Mistral Medium 3, or Mistral Large 2 for production?

Pick Small 3.1 (or 3.2) when latency, cost, and data residency dominate: it is Apache 2.0, fits one GPU, and handles most chat, RAG, and tool-use workloads. Pick Mistral Medium 3 (May 2025, closed-weight via Mistral API and partners) when you need stronger reasoning and 128k context without operating the model yourself. Pick Mistral Large 2 (123B, July 2024 release, refreshed July 2025) when frontier-tier reasoning matters and the Mistral Research License or a commercial agreement covers your use case.

View all

Guide

LLM Benchmarks 2026: GPT-5, Claude 4.7, Gemini 2.5 Pro, Grok 4 Compared

Compare GPT-5, Claude Opus 4.7, Gemini 2.5 Pro, and Grok 4 on GPQA, SWE-bench, AIME, context, $/1M tokens, and latency. May 2026 leaderboard scores.

Vrinda Damani · Sep 26, 2025

9 min

Guide

Top 6 AI Guardrailing Tools in 2026: Coverage, Latency, Fit

Compare the top AI guardrail tools in 2026: Future AGI, NeMo Guardrails, GuardrailsAI, Lakera Guard, Protect AI, and Presidio. Coverage, latency, and how to choose.

NVJK Kartik · Jul 23, 2025

11 min

Guide

Top 11 LLM API Providers 2026: Pricing, Latency, Context Compared

11 LLM APIs ranked for 2026: OpenAI, Anthropic, Google, Mistral, Together AI, Fireworks, Groq. Token pricing, context windows, latency, and how to choose.

NVJK Kartik · Jul 4, 2025

11 min