AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
The first evaluation framework purpose-built for monitoring and debugging agentic workflows in production, achieving state-of-the-art results on the TRAIL benchmark.
Abstract
With recent advancements in LLM reasoning capabilities, agentic workflows have surged dramatically across industries. However, insufficient evaluation methods, over-reliance on technical benchmarks, and inadequate real-world robustness testing leave organizations unprepared for edge cases and contextual failures.
We present AgentCompass, the first evaluation framework purpose-built for monitoring and debugging agentic workflows in production. Unlike prior approaches that rely on static benchmarks or single-pass LLM judgments, AgentCompass employs:
- A recursive plan-and-execute reasoning cycle
- A formal hierarchical error taxonomy tailored to business-critical issues
- A dual memory system for longitudinal analysis across executions
Key Results
AgentCompass achieves state-of-the-art results on the TRAIL benchmark:
- Localization Accuracy of 0.657 on GAIA, substantially outperforming Gemini-2.5-Pro (0.546)
- Highest Joint Score on both GAIA (0.239) and SWE-Bench (0.051)
- Highest Categorization F1 on SWE-Bench (0.232)
- Uncovered critical errors that human annotations missed, including safety and reflection gaps
Multi-Stage Analytical Pipeline
The framework deconstructs trace analysis into four progressively abstract stages:
- Error Identification and Categorization - Comprehensive scan classifying errors via a formal taxonomy
- Thematic Error Clustering - Groups errors into semantically coherent clusters revealing systemic issues
- Quantitative Quality Scoring - Multi-dimensional scoring across factual grounding, safety, plan execution
- Synthesis and Strategic Summarization - Actionable summary with aggregate scores and priority levels
Formal Error Taxonomy
Five high-level categories covering:
- Thinking & Response Issues - Hallucinations, misinterpretation, flawed decision-making
- Safety & Security Risks - PII leakage, credential exposure, biased content
- Tool & System Failures - API failures, misconfigurations, runtime exceptions
- Workflow & Task Gaps - Context loss, goal drift, task orchestration failures
- Reflection Gaps - Lack of self-correction, action without reasoning
Knowledge Persistence
The system maintains Episodic Memory (individual trace context) and Semantic Memory (cross-trace generalized knowledge), enabling longitudinal analysis and continual improvement of diagnostic accuracy.
Conclusion
AgentCompass offers a rigorous and practical foundation for real-time production monitoring and continual improvement of complex agentic systems, bridging the gap between technical benchmarks and enterprise deployment realities.