AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
AgentCompass: first evaluation framework purpose-built for monitoring and debugging agentic workflows in production, SOTA on the TRAIL benchmark.
Abstract
With recent advancements in LLM reasoning capabilities, agentic workflows have surged dramatically across industries. However, insufficient evaluation methods, over-reliance on technical benchmarks, and inadequate real-world robustness testing leave organizations unprepared for edge cases and contextual failures.
We present AgentCompass, the first evaluation framework purpose-built for monitoring and debugging agentic workflows in production. Unlike prior approaches that rely on static benchmarks or single-pass LLM judgments, AgentCompass employs:
- A recursive plan-and-execute reasoning cycle
- A formal hierarchical error taxonomy tailored to business-critical issues
- A dual memory system for longitudinal analysis across executions
Key Results
AgentCompass achieves state-of-the-art results on the TRAIL benchmark:
- Localization Accuracy of 0.657 on GAIA, substantially outperforming Gemini-2.5-Pro (0.546)
- Highest Joint Score on both GAIA (0.239) and SWE-Bench (0.051)
- Highest Categorization F1 on SWE-Bench (0.232)
- Uncovered critical errors that human annotations missed, including safety and reflection gaps
Multi-Stage Analytical Pipeline
The framework deconstructs trace analysis into four progressively abstract stages:
- Error Identification and Categorization - Comprehensive scan classifying errors via a formal taxonomy
- Thematic Error Clustering - Groups errors into semantically coherent clusters revealing systemic issues
- Quantitative Quality Scoring - Multi-dimensional scoring across factual grounding, safety, plan execution
- Synthesis and Strategic Summarization - Actionable summary with aggregate scores and priority levels
Formal Error Taxonomy
Five high-level categories covering:
- Thinking & Response Issues - Hallucinations, misinterpretation, flawed decision-making
- Safety & Security Risks - PII leakage, credential exposure, biased content
- Tool & System Failures - API failures, misconfigurations, runtime exceptions
- Workflow & Task Gaps - Context loss, goal drift, task orchestration failures
- Reflection Gaps - Lack of self-correction, action without reasoning
Knowledge Persistence
The system maintains Episodic Memory (individual trace context) and Semantic Memory (cross-trace generalized knowledge), enabling longitudinal analysis and continual improvement of diagnostic accuracy.
Conclusion
AgentCompass offers a rigorous and practical foundation for real-time production monitoring and continual improvement of complex agentic systems, bridging the gap between technical benchmarks and enterprise deployment realities.