Introduction
If you’ve ever tried debugging an AI agent, you know the pain. It’s like wandering through a maze - thousands of traces, random logs, and a root cause that always seems just out of reach. The tools we use today weren't built for the complexity of AI agents. Traditional monitoring platforms like Datadog or New Relic are fantastic for conventional software, but they treat an agent's reasoning process as a complete black box. They can tell you if an API call was slow, but they can't tell you why the agent chose the wrong tool or hallucinated an answer.
Newer LLM observability platforms, like LangSmith or Arize AI, are a step closer. They provide detailed traces of your agent's activity, which is a great start. However, they often stop at just presenting the data, leaving the most difficult work to you. You still have to manually skim through thousands of traces to connect the dots, identify a pattern, and then begin the guesswork of finding the root cause. They give you the raw ingredients but leave you to write the recipe for a fix. This manual, time-intensive process is precisely the bottleneck that keeps teams stuck in debugging cycles for days.
Agent Compass transforms this process by providing full transparency into your agent behaviour: it clusters errors in your agent, connects symptoms, patterns, root causes, and actionable fixes - all in a workflow designed for speed and clarity. In this guide, we show how you can debug AI agents in under 5 minutes using Agent Compass, step by step.
Step 1: Instrument Your Agent
Collsb NB - https://colab.research.google.com/drive/15i-0uMu1ucUdRlhu0lZcGKVKiClTajqw?usp=sharing
Before any analysis begins, the agent must feed data into Agent Compass. One of the most powerful aspects of Agent Compass is its zero-config evaluation. We will setup the observability that will do this evaluation using TraceAI, our open-source package designed to enable standardized tracing for AI applications.
TraceAI is a set of conventions and plugins that integrates seamlessly with OpenTelemetry, the industry standard for observability. It works by automatically instrumenting popular AI frameworks and libraries like LangChain, OpenAI, Anthropic, and many more to capture detailed execution data without requiring you to manually modify your agent's code.
Imagine your support agent is tasked with answering a customer's question: "How do I reset my password if I've lost my 2FA device?"
The agent needs to:
Retrieve the correct procedure from your internal knowledge base.
Synthesize that information into a clear, helpful answer.
But the agent responds with: "You can reset your password by clicking 'Forgot Password' on the login page." This answer is generic and unhelpful—it completely misses the critical "lost 2FA device" part of the query.
Without Agent Compass, debugging this is a nightmare. Did the retrieval step fail to find the right document? Did it find the right document but the LLM ignored it? Or did the LLM just hallucinate a generic answer? You'd have to sift through logs and manually piece together the story.
Step-by-Step Guide On how to Instrument your Agent
Install one of the traceai libraries
pip install traceAI-langchain |
Configure your environment
import os |
These are your personal credentials for the Future AGI platform. They ensure that the trace data sent from your application is securely associated with your account, so you can view and analyze it within Agent Compass. You can get them easily from your Future AGI account and can even create new keys with your preferred Role Based Access Control.
Register your Observe Project
This step sets up the observability pipeline, telling TraceAI where to send the data. You give your project a name so you can easily find it in the Agent Compass UI.
from fi_instrumentation import register, Transport |
Instrument your Project with the auto-instrumentor
This is the one-line command that activates auto-instrumentation. From this point on, all calls to the OpenAI library will be automatically traced.
from traceai_langchain import LangChainInstrumentor |
Start Interacting with the framework as you normally would as the traces will be automatically captured by the library
Now, run your agent's code without any changes. Even though this looks like a standard RAG workflow, TraceAI is capturing every step in the background.
from langchain_core.prompts import ChatPromptTemplate |
Ensure all tools, prompts, and APIs used by the agent are included in the trace capture.
Within seconds, Agent Compass begins collecting traces, tool usage, and prompt interactions - creating the foundation for clustering and diagnosis.
Tip: No custom evaluators or dashboards are required; the zero-config setup works across multi-tool agents immediately.
Step 2: Cluster Failures Automatically
Once your agent is instrumented, Agent Compass automatically groups similar failures across runs, versions, and data slices. This step is key for moving from span-level noise to actionable patterns.
This is where our formal Error Taxonomy comes into play. Instead of just grouping raw logs, Agent Compass first analyzes every trace and identifies specific, categorized failures.
Compass scans each trace and assigns failures to precise categories from the taxonomy, such as a Tool Misuse error, a PII Leak, or an Ungrounded Summary. This gives a structured name to every problem.

Figure 1: Trace Error Analysis Categories
How clustering works:
Compass scans traces for errors, hallucinations, latency spikes, or guardrail violations.
It then groups these categorized incidents together. This is incredibly powerful because it clusters based on the type of problem, not just similar-looking text.
Clusters are tagged with contextual metadata - user journey, release version, or data slice.
For example, suppose your sales assistant agent starts failing after a new model update. Instead of seeing hundreds of generic errors, Agent Compass collapses them into a single, high-priority cluster titled: "Workflow & Task Gaps > Retrieval Errors." You'd instantly know the core issue is that the agent is failing to retrieve the right information, allowing you to focus your debugging efforts immediately.

Figure 2: Agent Compass Feed: Debug AI Agents with Auto-Clustered Error Tracking
Step 3: Deep Diagnosis with a Developer-Centric Taxonomy
Let’s go back to our sales assistant. Agent Compass has clustered thousands of failures into a single group: "Retrieval Errors." This is a huge step forward, we know where the fire is. But this is where most observability tools stop, leaving you to ask the most important question: why is the retrieval failing?
Agent Compass investigates the root cause using its built-in, developer-centric Error Taxonomy. Think of it as an expert system that asks a series of diagnostic questions
Is the agent just making things up? It checks for Thinking & Response Issues, like a hallucination where the agent invents a product detail instead of retrieving it.
Is it doing something dangerous? It scans for Safety & Security Risks, ensuring no customer PII was leaked in the failed query.
Did one of its tools break? It looks for Tool & System Failures, like a timeout or crash in the connection to your knowledge base API.
Did it get lost in a multi-step process? It analyzes the plan for Workflow & Task Gaps. Perhaps the agent retrieved the correct document but then forgot about it two steps later, a classic case of context loss.
In our sales agent example, Agent Compass finds something deeper. The agent didn't just fail to retrieve the document once. It failed, then tried the exact same query three more times, getting the same error each time without ever changing its approach.
This is the difference between a symptom and a diagnosis. A traditional tool would just report four "Tool Errors." But Agent Compass uses its taxonomy to classify this as a Lack of Self-Correction, which is the ultimate root cause of agent’s designed workflow urging us to fix that.
Step 4: Generating Actionable, Developer-Ready Fixes
Identifying a root cause is only half the battle. The FAGI framework is designed to close the loop by providing specific, actionable recommendations that developers can use immediately. It doesn't just tell you what is broken; it gives you a clear path to fixing it.
For each identified error, the agent generates several layers of guidance:
Root Cause Analysis: A clear, plain-language explanation of the underlying failure (e.g., "The agent is not grounded, inventing information when its tools fail to provide a factual answer.").
Long-Term Recommendation: Strategic advice on how to prevent this class of error in the future (e.g., "Update the system prompt to explicitly forbid inventing answers. If tools fail, the agent should report the failure.").
Suggested Fix: A concrete, often copy-pastable, code or prompt modification. For example, after diagnosing a recurring tool-use error, the agent suggested this specific addition to the system prompt:
Immediate Fix: The most direct action to resolve the immediate issue (e.g., "The tool-calling logic should be corrected to call page_down() with no arguments, as specified in the schema.").
This multi-layered output transforms a vague error signal into a rich, developer-ready ticket, dramatically reducing the time it takes to go from detection to resolution.

Figure 3: Agent Compass Error Diagnosis
Best Practices for Rapid Debugging
Instrument all relevant tools and prompts to ensure cause graphs capture the full workflow.
Review clusters daily in high-traffic agents to preempt escalations.
Leverage Fix Recipes for workflow integration - don’t manually recreate solutions.
Maintain historical context: Compass’s feed-style timeline helps track recurring issues and their remediation.
Collaborate across teams: Share clusters and fixes to align development, MLOps, and support functions.
Conclusion
Debugging AI agents no longer needs to be a slow, error-prone process. With Agent Compass, teams can instrument, cluster, diagnose, and fix agent failures in under five minutes. Its narrative observability, automated clustering, cause graphs, and workflow-integrated Fix Recipes create a seamless, evidence-driven debugging experience.
By embracing Compass, organizations gain faster resolution, improved agent reliability, and higher operational efficiency - transforming agent debugging from a tedious task into a streamlined, proactive workflow.
To explore more about Agent Compass and get started, visit our documentation page.
FAQs











