Abstract

With recent advancements in LLM reasoning capabilities, agentic workflows have surged dramatically across industries. However, insufficient evaluation methods, over-reliance on technical benchmarks, and inadequate real-world robustness testing leave organizations unprepared for edge cases and contextual failures.

We present AgentCompass, the first evaluation framework purpose-built for monitoring and debugging agentic workflows in production. Unlike prior approaches that rely on static benchmarks or single-pass LLM judgments, AgentCompass employs:

A recursive plan-and-execute reasoning cycle
A formal hierarchical error taxonomy tailored to business-critical issues
A dual memory system for longitudinal analysis across executions

Key Results

AgentCompass achieves state-of-the-art results on the TRAIL benchmark:

Localization Accuracy of 0.657 on GAIA, substantially outperforming Gemini-2.5-Pro (0.546)
Highest Joint Score on both GAIA (0.239) and SWE-Bench (0.051)
Highest Categorization F1 on SWE-Bench (0.232)
Uncovered critical errors that human annotations missed, including safety and reflection gaps

Multi-Stage Analytical Pipeline

The framework deconstructs trace analysis into four progressively abstract stages:

Error Identification and Categorization - Comprehensive scan classifying errors via a formal taxonomy
Thematic Error Clustering - Groups errors into semantically coherent clusters revealing systemic issues
Quantitative Quality Scoring - Multi-dimensional scoring across factual grounding, safety, plan execution
Synthesis and Strategic Summarization - Actionable summary with aggregate scores and priority levels

Formal Error Taxonomy

Five high-level categories covering:

Thinking & Response Issues - Hallucinations, misinterpretation, flawed decision-making
Safety & Security Risks - PII leakage, credential exposure, biased content
Tool & System Failures - API failures, misconfigurations, runtime exceptions
Workflow & Task Gaps - Context loss, goal drift, task orchestration failures
Reflection Gaps - Lack of self-correction, action without reasoning

Knowledge Persistence

The system maintains Episodic Memory (individual trace context) and Semantic Memory (cross-trace generalized knowledge), enabling longitudinal analysis and continual improvement of diagnostic accuracy.

Conclusion

AgentCompass offers a rigorous and practical foundation for real-time production monitoring and continual improvement of complex agentic systems, bridging the gap between technical benchmarks and enterprise deployment realities.

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production