System Metrics in Observe, Multimodal Bedrock Tracing, and Eval Playground Upgrades
Infrastructure metrics alongside agent traces in Observe, image tracing for AWS Bedrock, and standalone mode + feedback loops in the Eval Playground.
What's in this digest
System Metrics in the Observe Graph

Agent traces (end-to-end records of how your agent handled each request) tell you what happened inside your AI pipeline. The Observe graph already plotted agent-level signals; this release adds system metrics directly into that graph so infrastructure and agent behavior live in the same view.
What’s new
- System metrics in the primary graph dropdown. Select “system metrics” in the Observe graph dropdown and render the graph against system-level signals — right alongside your agent metrics.
- System metrics in the compare graph. The same option works in the compare graph, so you can overlay two periods or two views of system data.
- Backed by the graph API. Selecting a system metric triggers the graph API to render the appropriate series.
Why it matters
Debugging a latency spike used to mean cross-referencing your observability platform and your infrastructure monitoring tool, aligning timestamps manually. With system metrics in the same Observe graph, the correlation is one selection instead of two tools.
Who it’s for
MLOps and platform engineering teams running agents in production, and data scientists and agent developers who need to see whether a latency issue is model-side or infrastructure-side.
Multimodal Tracing for AWS Bedrock
AWS Bedrock users working with multimodal models can now trace image inputs and outputs. The new Bedrock multimodal support captures images sent to and generated by models like Amazon Titan Image Generator and Anthropic Claude on Bedrock, recording them as part of the trace span (an individual step inside a trace).
What’s new
- Images as part of the span. Both input and output images are captured and stored with the trace.
- Inline image rendering in the trace view. See exactly what went into the model and what came out, without leaving the trace.
- Automatic with traceAI v0.1.11+. No additional configuration — update the SDK and multimodal spans are captured by default.
Why it matters
Debugging a multimodal agent when you can’t see the image inputs is debugging blind. Bedrock multimodal tracing closes that gap for teams on AWS.
Who it’s for
Teams building multimodal agents on AWS Bedrock, especially those using image-input or image-output models where the visual artifact is often the source of unexpected behavior.
Eval Playground Improvements
Three upgrades to the playground shipped two cycles ago. The Eval Playground first shipped in w24.
Standalone evaluation mode. Run an evaluation (a test that scores agent outputs against criteria you define) without connecting it to a dataset or experiment. Paste any text, configure the evaluation, get a score. Useful for quick spot-checks without the overhead of a full pipeline.
Feedback collection on playground results. When the playground scores an example, you can mark the score as correct or incorrect. The feedback is collected and used to calibrate evaluation accuracy over time, particularly for custom LLM-as-judge evaluations (where one LLM scores the outputs of another) and their rubric tuning.
Cleaner scoring visualization. Score breakdowns by criteria, distribution histograms for batch evaluations, and clear pass/fail indicators that make results scannable.
Multi-line Evaluation Graphs
Plot multiple evaluation metrics on a single chart — overlay hallucination rate against relevance score, or compare factual accuracy across three different models on the same axes. When one metric improves but another degrades, the multi-line view surfaces that tradeoff in one picture. Customize colors, toggle individual lines, set date ranges.
SDK and Platform Updates
traceAI v0.1.11. This release brings the instrumentor count past 25. Headline additions: the Google GenAI instrumentor for tracing Gemini models and function calling via the Google Generative AI SDK, and the multimodal Bedrock support described above.
Langfuse evaluations integration. Teams already using Langfuse for some evaluations can now route that data into Future AGI for unified analysis alongside native evaluations — evaluation results, scores, and metadata sync continuously.
API key management revamp. Redesigned UI for API keys, bulk operations, and clearer permission displays — a cleanup on top of the w26 developer-keys release.
Annotation notes. Reviewers can attach free-form text notes to annotations for richer context beyond labels alone.
Draft prompts. Save work-in-progress prompts as drafts privately, without cluttering the shared prompt library with half-finished experiments.