Evaluate · AWS Bedrock

Evaluate AWS Bedrock

Cloud Platforms

Score AWS Bedrock responses against 70+ purpose-built evaluators — Groundedness, Context Relevance, Prompt Injection, Toxicity, function-call accuracy, and your own custom templates.

python typescript java · trace evaluate optimize

Start free Open in docs View example on GitHub

Recipes for AWS Bedrock

Trace AWS Bedrock Evaluate AWS Bedrock live Optimize AWS Bedrock

Prerequisites

Before you start

· A working AWS Bedrock app — local or already in production.
· A free Future AGI account with FI_API_KEY and FI_SECRET_KEY.
· Python 3.9+ / Node 18+ / Java 17+ depending on which SDK you're installing.
· Trace input/output payloads (or a dataset) ready to score.

Install

pip install traceAI-bedrock

npm install @traceai/bedrock

<dependency>
  <groupId>ai.futureagi</groupId>
  <artifactId>traceai-java-bedrock</artifactId>
  <version>LATEST</version>
</dependency>

Evaluate recipe

from fi.evals import EvalClient
from fi.evals.templates import (
    ContextRelevance, Groundedness, PromptInjection, Toxicity
)

client = EvalClient(api_key="<FI_API_KEY>", secret_key="<FI_SECRET_KEY>")

# Reuse the trace input/output from your AWS Bedrock run
result = client.evaluate(
    eval_templates=[ContextRelevance(), Groundedness(), PromptInjection(), Toxicity()],
    inputs=[{
        "input": user_query,
        "output": aws_bedrock_response,
        "context": retrieved_context,
    }],
)

print(result.eval_results)

What Future AGI captures

Evaluate fields you'll see in the dashboard

Run any of the 70+ Future AGI evaluator templates against trace input/output
Score in real time on production spans, in CI on a dataset, or as a guardrail before response
Custom evaluators via the builder API — heuristic, LLM-as-judge, or fine-tuned Turing models
Eval results land back on the originating trace as searchable attributes

Common gotchas

Read these before you ship

01
Eval templates expect specific input keys — check the template signature in `fi.evals.templates`.
02
For RAG evaluators, pass the retrieved chunks as `context`, not the full document.
03
LLM-as-judge templates count against your eval-model token budget — switch to Turing flash for high-volume.

Next: chain it with the other recipes

Evaluate is the first step. Most teams add an evaluator the same week, and start optimising or simulating once they have a baseline. Each recipe takes minutes to wire up.

Start free Read the Evaluate docs

Adjacent integrations