What Is the Phare Safety Benchmark? FutureAGI Guide (2026)

What Is the Phare Safety Benchmark?

Phare is a multilingual safety benchmark for large language models, released by Giskard and collaborators. It evaluates four safety dimensions on realistic, user-style prompts: harmful generation, bias, factual reliability, and instruction-following. The design goal is to reflect what users actually send rather than only adversarial probes, so a passing score is closer to a real “is this model safe to ship” signal. It belongs to the compliance and safety family. FutureAGI does not host Phare, but exposes runtime evaluators that score the same dimensions on production traces.

Why Phare Matters in Production LLM and Agent Systems

A team picking a base model in 2026 has dozens of options across closed and open weights. Comparing them on raw capability benchmarks (MMLU, GSM8K) misses the safety side. Phare gives an apples-to-apples view of how models behave on realistic safety-sensitive prompts in multiple languages — important for any product shipping outside English-only markets.

The pain shows up when a team chooses on capability alone. A model that aces coding benchmarks may still produce biased recommendations in non-English prompts, hallucinate factual answers about regulated topics, or fail to follow refusal instructions when asked politely in a roleplay framing. Engineers see this as inconsistent behavior across user segments. Compliance teams cannot certify a model on capability scores alone — they need a safety scorecard.

In 2026 agentic stacks the surface widens. A multi-step agent may use one model for planning and another for response generation; each must pass safety checks. Phare’s multilingual coverage matters more for agents that handle voice, search, and tool-calling across languages. Skipping the safety benchmark means the team rediscovers known failure modes on their own users.

How FutureAGI Works With Phare

Phare is a static benchmark; FutureAGI is the runtime that takes the same prompts into production. The honest connection is to load Phare prompts into fi.datasets.Dataset and run them against the candidate model with FutureAGI’s safety evaluators attached. The named anchors are Toxicity, BiasDetection, ContentSafety, and FactualAccuracy — all available through fi.evals. A pre-guardrail configured in Agent Command Center can use the same evaluators inline at runtime.

Real example: a multilingual support team is comparing Llama 3.1 70B, Mistral Large, and Claude Sonnet 4 for a regulated customer-service deployment. They load the Phare prompt set into a Dataset, attach Toxicity, BiasDetection, ContentSafety, and IsHarmfulAdvice, and run all three models against it. FutureAGI returns per-model, per-dimension, per-language scorecards. The Llama variant performs well in English but trails on Spanish bias; the Claude variant is consistent. The team picks Claude for the Spanish-speaking cohort and Llama for English (with stricter pre-guardrails). Compared with reading the Phare leaderboard alone, this approach ties the benchmark to the team’s actual eval pipeline.

After release, the same evaluators run on production traces, so a regression on bias surfaces in the same dashboard as the benchmark scoreboard.

How to Measure or Detect It

Treat Phare as a release-gate signal plus a regression seed.

Per-dimension score — harm, bias, factuality, instruction-following — scored with Toxicity, BiasDetection, FactualAccuracy, PromptAdherence.
Per-language score — break results by language; multilingual coverage is the point of Phare.
Failure cohort — build a regression dataset from failed Phare prompts so they re-run on every release.
Trace fields — model.version, prompt.version, language, evaluator.score per span, so live regressions tie back to the benchmark.
Guardrail trigger rate — how often a pre-guardrail blocks a Phare-style prompt at runtime; rising rate may indicate prompt-set drift.

from fi.evals import Toxicity, BiasDetection

tox = Toxicity().evaluate(output=model_response)
bias = BiasDetection().evaluate(output=model_response)
print(tox.score, bias.score)

Common Mistakes

Reading the leaderboard once. Models update; rerun the benchmark on each model version.
Only running English prompts. Phare’s multilingual coverage is its differentiator; stripping it loses the value.
Treating Phare as exhaustive. It complements HarmBench and AgentHarm; it does not replace adversarial testing.
Skipping the regression dataset. Phare prompts that fail today should re-run on every release until they pass, with the failing rows pinned in a versioned Dataset.
Ignoring instruction-following. Refusal and policy compliance are part of the benchmark for a reason; capability scores alone do not certify a model.

Frequently Asked Questions

What is the Phare safety benchmark?

Phare is a multilingual LLM safety benchmark that evaluates harmful generation, bias, factual reliability, and instruction-following on realistic prompts, designed to be closer to real user behavior than purely adversarial test sets.

How is Phare different from HarmBench or SafetyBench?

HarmBench focuses on adversarial jailbreak attempts; SafetyBench is a multiple-choice safety knowledge benchmark. Phare emphasizes realistic, multilingual prompts spanning generation harm, bias, factuality, and instruction-following.

How do you use Phare with FutureAGI?

Load Phare prompts into `fi.datasets.Dataset`, attach `Toxicity`, `BiasDetection`, `ContentSafety`, and `FactualAccuracy`, and compare per-model scores. Use the failures as regression seeds for runtime guardrails.