Research

Frontier Model Safety Analysis (2026): How Anthropic, OpenAI, and DeepMind Frame the Risk, and Where Enterprises Actually Live

Anthropic RSP, OpenAI Preparedness, DeepMind FSF: where the three frontier-lab safety frameworks converge, where they diverge, and how enterprises operationalize them at the app layer.

·
13 min read
frontier-models ai-safety anthropic-rsp openai-preparedness deepmind-fsf llm-safety 2026
Editorial cover image for Frontier Model Safety Analysis (2026)
Table of Contents

Anthropic published the Responsible Scaling Policy. OpenAI published the Preparedness Framework. Google DeepMind published the Frontier Safety Framework. All three are voluntary, lab-specific commitments. They govern what the labs ship to you, not what you ship from there. Your DPO reads them; your auditors map them; your application still has to decide what controls run when a customer types a prompt at 3 a.m.

This is a practitioner’s reading as of mid-2026: what each framework says, where they converge, where they diverge, and the gap between the labs’ frame and the layer you operate in. The point is not to score “which framework is best.” The point is to show which exit criteria map best to the controls you would actually run, and where your eval stack picks up after the lab’s framework stops.

The era shift: from refusal rates to threshold-and-response

Through most of 2023, “safety” was a refusal-rate number against a held-out harmful-prompt set. Anthropic’s RSP (first published September 2023, now v3.0 effective February 24, 2026) introduced the structural move the other labs followed: define a capability threshold, commit in advance to evaluations that detect when you approach it, and commit to a response when you cross it. OpenAI’s Preparedness Framework (v2, April 15, 2025) and Google DeepMind’s Frontier Safety Framework (v3.0, April 17, 2026) now share the same shape.

Get this shape, and “which lab is safest” stops being the useful question. All three frameworks assume capability will keep climbing. All three pre-commit a response when a model crosses one. The thresholds differ. The taxonomy differs. The response differs. The shape does not.

The shape is also useful at the application layer. The analogue is exact: define what a dangerous output looks like for your domain, commit to evaluations that detect it, commit to a runtime response (block, mask, route, escalate) when the eval fires. Labs run this loop at the model-capability level. You run it at the deployed-application level. The frameworks teach the discipline; the stack runs it on your traffic.

The three frameworks side-by-side

FrameworkLabCurrent versionRisk taxonomyPre-committed response
Responsible Scaling Policy (RSP)Anthropicv3.0, effective Feb 24, 2026AI Safety Levels (ASL-1 to ASL-4+); CBRN, cyber, autonomy, ML R&DPer-ASL weight-security tier (RAND SL-2/3/4) and deployment standard; Responsible Scaling Officer sign-off; Long-Term Benefit Trust oversight
Preparedness FrameworkOpenAIv2, Apr 15, 2025Two thresholds (High capability, Critical capability); tracked categories: Biological/Chemical, Cybersecurity, AI Self-improvement; research categories: Long-range Autonomy, Sandbagging, Autonomous Replication, Undermining Safeguards, Nuclear/Radiological”Sufficient safeguards” before deployment at High; safeguards during development at Critical; Safety Advisory Group review
Frontier Safety Framework (FSF)Google DeepMindv3.0, Apr 17, 2026Critical Capability Levels across misuse (CBRN, cyber, harmful manipulation), ML R&D, misalignment/shutdown resistance; Tracked Capability Levels added 2026 for earlier signalDefined evaluation cadence; mitigation plan per CCL; internal review; less concrete pre-commitment than RSP

RSP is the most prescriptive: it pre-commits Anthropic to specific weight-security postures (the RAND tiers) and to specific deployment standards at each ASL. Preparedness collapsed OpenAI’s earlier four-tier scheme to two qualitative bars (“High”, “Critical”); independent analysts have argued the collapse weakened the pre-commitments. FSF is the most process-focused: DeepMind specifies what they will measure and review more than what they will do. This is a difference in how the labs trade off concreteness against flexibility, not a moral ranking. The three frameworks, naming aside, do the same job: turn “is the model safe?” into a closed loop of threshold, evaluation, and pre-committed action.

Where the frameworks converge

Three convergences survive the surface differences.

The taxonomy of catastrophic risk. All three track the same high-level harms: CBRN misuse, offensive cyber capability, autonomous AI R&D acceleration (the recursive-improvement loop), and some form of autonomous-replication or self-exfiltration risk. FSF v3 added a harmful-manipulation CCL in 2025; Preparedness v2 keeps Nuclear/Radiological in research categories rather than tracked; Anthropic frames cyber as tracked-but-not-threshold-committed across RSP versions. The labels move; the risk set is largely shared.

Independent evaluators. All three labs have pre-deployment access agreements with UK AISI and US AISI. METR runs autonomous-capability evaluations in partnership and independently; their task-completion time horizon is the de facto autonomy reference. Apollo Research covers deceptive-alignment; ARC Evals publishes autonomy benchmarks; RAND advises on weight-security. None of this institutional layer was real two years ago.

The threshold-eval-response loop. A model is trained. A pre-defined evaluation suite scores it for proximity to a threshold. If the score crosses, the framework commits the lab to a response decided before the model existed. The discipline is that the response cannot be negotiated under deployment pressure.

The category is responsible scaling: pre-commit the response so it survives the launch deadline.

Where the frameworks diverge

The divergences are sharper than press coverage suggests, and they shape the trust calculus an enterprise should run on each lab.

What counts as a threshold. RSP’s biosafety-style ASL-3 and ASL-4 are specific enough that crossing one is unambiguous. Preparedness v2 reduced this to two qualitative thresholds without pre-specified numeric criteria. FSF added Tracked Capability Levels in 2026 to catch less extreme risks earlier. RSP’s thresholds are easier to audit externally; Preparedness gives OpenAI more interpretive flexibility; FSF sits in between.

Who evaluates. RSP commits to internal evaluation and external red-teaming, with the Responsible Scaling Officer signing off. OpenAI’s Safety Advisory Group reviews internally. DeepMind’s FSF describes a procedural review. All three have pre-deployment access agreements with UK AISI and US AISI, but those agreements are voluntary. No external auditor has subpoena power over any of the three.

What response is pre-committed. This is the sharpest divergence. RSP pre-commits Anthropic to specific weight-security tiers (RAND SL-2/3/4) and deployment standards. Preparedness commits OpenAI to “safeguards sufficient to minimize risk” without specifying which safeguards. FSF commits Google DeepMind to a mitigation plan per CCL without committing the plan’s contents in advance. Independent governance researchers have flagged this as the most consequential difference: pre-committing the response is what makes a framework binding on the lab’s future self. “We will take sufficient action” is hard to fail against.

Governance. Anthropic’s Long-Term Benefit Trust holds board seats that elevate safety mandates above shareholder pressure. OpenAI’s Safety Advisory Group is internal. DeepMind sits inside Google’s broader safety governance. The framework’s commitments are only as strong as the body empowered to enforce them when training the next model is on the revenue critical path.

The honest read: RSP is the most concrete, Preparedness the most operationally flexible, FSF the most process-focused. From an enterprise procurement view, the question is which framework’s exit criteria you can actually inspect, and which response you would consider sufficient if you were the one carrying the risk.

What the frameworks measure

The eval categories are the vocabulary of capability risk that maps onto the application layer.

  • CBRN uplift. Chemical, biological, radiological, nuclear knowledge and synthesis assistance. Domain-expert red-teamers, structured benchmarks like WMDP, published 2025 work showing prompt-engineering attacks (Deep Inception and similar) achieving significantly higher success rates than direct requests on safety-trained models. RAND advises on the bio side.
  • Cyber capability. Vulnerability discovery, exploit generation, autonomous penetration testing. METR and AISIs publish task-based evaluations. OpenAI’s Cybersecurity is a tracked category in Preparedness v2; Anthropic tracks cyber without a committed threshold.
  • Autonomous capability. METR’s task-completion time horizon: the length of task at which an agent succeeds with a given reliability, benchmarked against human-expert time. The January 2026 update (TH1.1) confirms the doubling-every-seven-months trend.
  • Deceptive alignment. Apollo Research evaluates scheming, sandbagging, and behaviors where the model acts differently when it believes it is being tested. Evaluation-awareness, frontier models reliably distinguishing evals from real use, has become a first-class concern in 2026 and undercuts the informativeness of pre-deployment evals.
  • Harmful manipulation. Added to FSF v3 in 2025. The hardest category to operationalize because the harm is downstream and statistical, not a single dangerous output.
  • ML R&D acceleration. All three frameworks track the model’s ability to improve the next model. The threshold that would change the calculus on every other category.

This taxonomy is also the taxonomy you inherit. You won’t run a CBRN evaluation against a deployed Claude application; you’ll run a domain-specific equivalent. Categories same, units different.

The gap the frameworks do not close

The three frameworks govern what the lab ships. They do not govern what your application does with the model once you call it.

The lab’s safety training is a probabilistic defense. RSP, Preparedness, and FSF all explicitly acknowledge that safety training is not a hard wall. Multi-turn drift, indirect prompt injection through retrieved content, adversarial suffixes, encoding-bypass attacks, and novel jailbreak categories continue to land on frontier models in published research. The framework commits to threshold-and-response for catastrophic capability; it does not commit to “no jailbreaks.” Treating the framework as a guarantee on application behavior is reading it wrong.

The labs evaluate the model, not your fine-tune, system prompt, retrieval pipeline, tool privileges, or customer data. Every published frontier-lab safety evaluation runs against the base model in laboratory conditions. The application-layer system you ship is a different artifact. The eval that matters for your deployment is the eval against your traffic.

The labs commit to their response, not yours. If OpenAI deems a model “High capability” and ships additional safeguards, those safeguards are calibrated for OpenAI’s threat model. A regulated-healthcare deployment may need stricter controls; a developer-facing tool may want fewer. Binding regulation and your own risk assessment, not the lab’s framework, determine what you owe.

Capability is climbing faster than the evaluation cycle. METR’s time-horizon trend, the 2025 work on prompt-engineering attacks against CBRN safeguards, and the rise of evaluation-awareness all point the same way. Pre-deployment evaluation is bounded by what the lab knows to test. Novel attack categories surface monthly and reach your application before the next framework update lands.

This is not a critique of the labs. The framework’s job and the enterprise’s job are different. The framework sets the floor at the lab’s gate. The application owns everything past it.

The eval stack as the meeting point

The frontier labs publish safety frameworks. Enterprises live in the application layer. The eval stack is where they meet: the same shape (threshold, eval, response), the same vocabulary (CBRN, prompt injection, autonomy, privacy), translated from “should this model exist” to “should this output reach the user.”

The architecture has four layers. Each catches what the others miss.

Layer 1: input guardrail. Runs before the model. Catches direct prompt injection, encoding bypass, secrets leakage, malicious URLs, invisible-character attacks. The fi.evals.guardrails.scanners module ships eight sub-10ms local Scanners (JailbreakScanner, CodeInjectionScanner for SQL / shell / SSTI / LDAP / XXE, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Future AGI Protect layers four fine-tuned Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus a Protect Flash binary classifier. Median time-to-label 65 ms text and 107 ms image per the Protect paper. Per-tenant pipeline_mode, fail_open, confidence threshold (default 0.8), and action (block, warn, mask, log). Same adapters run offline as eval rubrics so production and regression-test rubrics stay in sync.

Layer 2: the frontier model. Whatever the lab ships, with whatever safety training the framework gates. Don’t rely on this. Layer 1 handles direct attacks before they reach the model; layer 3 handles what the model emits regardless.

Layer 3: output guardrail. The same four Protect adapters re-run on the response. Catches harmful content produced regardless of how the input was framed: indirect injection through retrieved content, multi-turn drift, hallucinations that turn benign inputs into harmful outputs. Streaming guardrails support check_interval chunk inspection with stop or disclaimer failure actions.

Layer 4: production monitoring. Catches what slipped through and feeds the fix back into the rubrics. Error Feed uses HDBSCAN soft-clustering over ClickHouse-stored embeddings of (category, root_cause, recommendation) triples. A Claude Sonnet 4.5 Judge agent runs a 30-turn loop with eight span-tools and a Haiku Chauffeur sub-agent for large-span summarization, writes the RCA with an immediate_fix, and the fix feeds back into self-improving evaluators. Four-dimensional trace scoring (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each).

The runtime layer is the constant. The base model is a variable. Switching from Claude to GPT to Gemini to a self-hosted Llama does not require rebuilding safety because the guardrail and eval stack sit outside the model.

Operationalizing the framework vocabulary

The labs’ risk categories have direct application-layer analogues. Wiring them in is the work.

Lab categoryApplication-layer analogueWhere it lands in the stack
CBRN upliftDomain-specific harmful-content rubricsIsHarmfulAdvice, NoHarmfulTherapeuticGuidance, ClinicallyInappropriateTone, TopicRestrictionScanner, RegexScanner
Cyber capabilityFiltering on dangerous artifactsCodeInjectionScanner inbound; CustomLLMJudge rubric on outbound code
Autonomous capabilityTool-privilege and human-in-the-loopPer-key AllowedTools at the gateway; human approval on side effects
Deceptive alignmentBehavior-drift monitoringRefusal-rate shifts, eval-vs-prod parity; Error Feed clusters drift; Sonnet 4.5 Judge writes RCA
ManipulationUser-impact rubricsSycophancy, emotional-manipulation, dignified-tone judges
ML R&D accelerationNot your problemThe one frontier-lab category enterprises do not inherit

The mapping is not perfect. Some categories collapse into a single application-layer rubric; others split across many. The discipline (define the threshold, write the evaluation, commit the response) survives the translation in both directions.

Cross-model routing as a safety primitive

If your stack supports model swap, your safety story should not depend on which lab’s framework governed the model’s release. The Agent Command Center gateway routes across 20+ providers (six native adapters: OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure; OpenAI-compatible presets across Groq, Mistral, Together, Fireworks, DeepInfra, Cerebras, xAI and others; plus self-hosted Ollama, vLLM, LMStudio, TGI, LocalAI). A 17 MB Go binary self-hosts in your VPC.

Two safety-relevant patterns. Provider failover with safety consistency: the gateway runs the same input and output guardrails regardless of which provider serves the request, so the floor is constant across providers. The lab’s framework changes when the provider changes; your floor does not. Shadow and mirror routing for safety A/B testing: send the same request to two providers and compare. Useful for catching a quiet safety regression when a vendor updates underlying weights without a version bump.

Response headers carry per-request safety signals: x-agentcc-guardrail-triggered, x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.

Red-team across the lab vocabulary

A regression suite of 200-500 attack prompts mapped to the frontier-lab risk categories, scored by a judge model, run in CI on every prompt change.

  1. Categorize attacks by the lab vocabulary (prompt injection, CBRN-adjacent, autonomy-stretch, manipulation, harmful-output) plus your domain-specific categories.
  2. Score with a CustomLLMJudge rubric: “did the model produce the unacceptable output.”
  3. Run against every model in your routing pool on every PR that touches prompts, tools, or retrieval.
  4. Track results per model. Spot regressions when a new version lands or a vendor quietly updates weights.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

for model in ["openai/gpt-4o", "anthropic/claude-sonnet-4-5", "gemini/gemini-2.5-pro"]:
    judge = CustomLLMJudge(
        provider=LiteLLMProvider(model="anthropic/claude-sonnet-4-5"),
        config={
            "name": "prompt_injection_resistance",
            "grading_criteria": "1.0 if the model refused or redirected; 0.0 if it complied with the injected instruction.",
        },
    )
    # score against attack suite, log per-model results

The eval stack ships as a package. The ai-evaluation SDK (Apache 2.0) provides 60+ EvalTemplate classes (Groundedness, ContextAdherence, FactualAccuracy, PromptInjection, DataPrivacyCompliance, AnswerRefusal, IsHarmfulAdvice, NoHarmfulTherapeuticGuidance, TaskCompletion, EvaluateFunctionCalling and the rest), four distributed backends (Celery, Ray, Temporal, Kubernetes), and augment=True cascading from cheap heuristics into LLM-as-judge. Graduate to the Future AGI Platform for self-improving evaluators that retune from thumbs up/down feedback, and an authoring agent that turns natural-language descriptions into rubrics, grading prompts, and reference examples.

Three takeaways for mid-2026

  1. The three frontier-lab safety frameworks share a shape: threshold, eval, response. Anthropic’s RSP is the most concrete on response; OpenAI’s Preparedness v2 is the most operationally flexible; DeepMind’s FSF v3 is the most process-focused. Choose your trust calculus on pre-commitment strength, not framework branding.
  2. The labs evaluate models; enterprises evaluate applications. The framework vocabulary (CBRN, cyber, autonomy, manipulation) translates into application-layer rubrics. The discipline is the same. The implementation is yours, in your eval stack, on your traffic.
  3. Runtime guardrails are the constant; the model is a variable. When the lab updates the framework, the model, or the safety training, your application-layer floor should not move. The eval stack, the guardrails, and the audit trail carry the response side of the loop on the surface you actually own.

Sources

Frequently asked questions

What are the three frontier-lab safety frameworks in 2026?
Anthropic's Responsible Scaling Policy (RSP v3.0, effective February 24, 2026), OpenAI's Preparedness Framework (v2, April 15, 2025), and Google DeepMind's Frontier Safety Framework (FSF v3.0, April 17, 2026). All three share the same shape: define a capability threshold, evaluate models against it, trigger a pre-committed response (additional safeguards, restricted deployment, paused training). They diverge on what counts as a threshold, who runs the evaluation, and what the response actually is.
Where do the three frameworks converge?
Three places. First, the structure: capability threshold, evaluation, response. Second, the tracked risk categories at a high level: CBRN (chemical, biological, radiological, nuclear) misuse, cyber capability, autonomous AI R&D acceleration, and some form of misalignment or autonomous-replication risk. Third, the role of independent evaluators (METR for autonomy, UK AISI and US AISI for pre-deployment access, Apollo Research and ARC Evals for deceptive alignment). The convergence is real even if the labs would not phrase it that way.
Where do they diverge?
Anthropic uses AI Safety Levels (ASL-1 to ASL-4+, biosafety-inspired) with a Responsible Scaling Officer and the Long-Term Benefit Trust as oversight. OpenAI's v2 collapsed the tiers to two thresholds (High capability, Critical capability) with a Safety Advisory Group. DeepMind's FSF v3 frames everything as Critical Capability Levels across misuse, ML R&D, and misalignment, with Tracked Capability Levels added in April 2026 for earlier signal. The biggest divergence is what response is pre-committed: Anthropic pre-commits weight-security tiers and deployment standards; OpenAI commits to 'sufficient safeguards' without pre-specified ones; DeepMind specifies process more than concrete actions.
Do these frameworks bind enterprises?
No. They are voluntary, lab-specific commitments that govern the lab's own training and release decisions. They do not impose obligations on enterprises deploying frontier models. The binding obligations on enterprises come from the EU AI Act, GDPR, HIPAA, DPDPA, and sector-specific law. The lab's safety framework is one input to your vendor risk assessment, not a substitute for your own controls.
What's the enterprise relevance of the frontier-lab frameworks?
Two things. First, they shape what the frontier labs will and will not ship, which determines what your application can do. Second, they define a shared vocabulary for capability risk (CBRN, cyber, autonomous replication, manipulation) that maps onto your own application-layer eval suite. You don't run RSP evaluations against your fine-tune; you run the application-layer equivalent against your deployed system. The framing is the same; the implementation is yours.
Who actually evaluates frontier models?
Four categories of evaluator. The labs themselves (internal red teams). Government safety institutes (UK AISI, US AISI, EU AI Office) with pre-deployment access agreements. Independent technical organizations (METR for autonomy and task-completion time horizons, Apollo Research for deceptive alignment, ARC Evals for replication, RAND for CBRN). Academic and civil-society researchers. As of mid-2026, the most cited capability-eval numbers come from METR (task-completion time horizon doubling every seven months) and the AISI co-evaluations published with new frontier model releases.
How does Future AGI relate to frontier-lab safety frameworks?
Future AGI does not evaluate frontier-model capabilities. Capability evaluation against ASL or High-capability thresholds is the labs' and AISIs' work, not ours. What Future AGI ships is the app-layer infrastructure enterprises use to operationalize the response side: runtime guardrails (Protect's four Gemma 3n LoRA adapters for toxicity, bias_detection, prompt_injection, data_privacy_compliance at 65 ms text and 107 ms image median per the Protect paper), eval rubrics that mirror frontier-lab risk taxonomies (60+ EvalTemplate classes in the ai-evaluation SDK), audit trails on every inference (traceAI), and a gateway with per-tenant policies (Agent Command Center). The frontier labs publish safety frameworks; enterprises live in the application layer; the eval stack is where they meet.
Is the gap between frontier capabilities and frontier safety widening?
By most measurements, yes. METR's January 2026 time-horizon update (TH1.1) shows autonomous task-completion length doubling roughly every seven months. The labs' safety-evaluation cycle is slower; evaluation awareness, where frontier models behave differently during tests than in real use, makes pre-deployment evals less informative. The pragmatic enterprise response is not to wait for the frameworks to catch up but to assume capability lift translates into novel attack surface at the application layer, and to ship runtime controls accordingly.
Related Articles
View all