Evaluate · Cloudflare Workers AI

Evaluate Cloudflare Workers AI

Cloud Platforms

Score Cloudflare Workers AI responses against 70+ purpose-built evaluators — Groundedness, Context Relevance, Prompt Injection, Toxicity, function-call accuracy, and your own custom templates.

python typescript · trace evaluate

Start free Open in docs View example on GitHub

Recipes for Cloudflare Workers AI

Trace Cloudflare Workers AI Evaluate Cloudflare Workers AI live

Prerequisites

Before you start

· A working Cloudflare Workers AI app — local or already in production.
· A free Future AGI account with FI_API_KEY and FI_SECRET_KEY.
· Python 3.9+ / Node 18+ / Java 17+ depending on which SDK you're installing.
· Trace input/output payloads (or a dataset) ready to score.

Install

pip install traceAI-openai

npm install @traceai/openai

Evaluate recipe

from fi.evals import EvalClient
from fi.evals.templates import (
    ContextRelevance, Groundedness, PromptInjection, Toxicity
)

client = EvalClient(api_key="<FI_API_KEY>", secret_key="<FI_SECRET_KEY>")

# Reuse the trace input/output from your Cloudflare Workers AI run
result = client.evaluate(
    eval_templates=[ContextRelevance(), Groundedness(), PromptInjection(), Toxicity()],
    inputs=[{
        "input": user_query,
        "output": cloudflare_workers_ai_response,
        "context": retrieved_context,
    }],
)

print(result.eval_results)

What Future AGI captures

Evaluate fields you'll see in the dashboard

Run any of the 70+ Future AGI evaluator templates against trace input/output
Score in real time on production spans, in CI on a dataset, or as a guardrail before response
Custom evaluators via the builder API — heuristic, LLM-as-judge, or fine-tuned Turing models
Eval results land back on the originating trace as searchable attributes

Common gotchas

Read these before you ship

01
Eval templates expect specific input keys — check the template signature in `fi.evals.templates`.
02
For RAG evaluators, pass the retrieved chunks as `context`, not the full document.
03
LLM-as-judge templates count against your eval-model token budget — switch to Turing flash for high-volume.

Next: chain it with the other recipes

Evaluate is the first step. Most teams add an evaluator the same week, and start optimising or simulating once they have a baseline. Each recipe takes minutes to wire up.

Start free Read the Evaluate docs

Adjacent integrations