Research / Agentic Evaluation
Agentic Evaluation

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production

The first evaluation framework purpose-built for monitoring and debugging agentic workflows in production, achieving state-of-the-art results on the TRAIL benchmark.

NVJK Kartik, Garvit Sapra, Rishav Hada, Nikhil Pareek | | Future AGI Research

Abstract

With recent advancements in LLM reasoning capabilities, agentic workflows have surged dramatically across industries. However, insufficient evaluation methods, over-reliance on technical benchmarks, and inadequate real-world robustness testing leave organizations unprepared for edge cases and contextual failures.

We present AgentCompass, the first evaluation framework purpose-built for monitoring and debugging agentic workflows in production. Unlike prior approaches that rely on static benchmarks or single-pass LLM judgments, AgentCompass employs:

  • A recursive plan-and-execute reasoning cycle
  • A formal hierarchical error taxonomy tailored to business-critical issues
  • A dual memory system for longitudinal analysis across executions

Key Results

AgentCompass achieves state-of-the-art results on the TRAIL benchmark:

  • Localization Accuracy of 0.657 on GAIA, substantially outperforming Gemini-2.5-Pro (0.546)
  • Highest Joint Score on both GAIA (0.239) and SWE-Bench (0.051)
  • Highest Categorization F1 on SWE-Bench (0.232)
  • Uncovered critical errors that human annotations missed, including safety and reflection gaps

Multi-Stage Analytical Pipeline

The framework deconstructs trace analysis into four progressively abstract stages:

  1. Error Identification and Categorization - Comprehensive scan classifying errors via a formal taxonomy
  2. Thematic Error Clustering - Groups errors into semantically coherent clusters revealing systemic issues
  3. Quantitative Quality Scoring - Multi-dimensional scoring across factual grounding, safety, plan execution
  4. Synthesis and Strategic Summarization - Actionable summary with aggregate scores and priority levels

Formal Error Taxonomy

Five high-level categories covering:

  • Thinking & Response Issues - Hallucinations, misinterpretation, flawed decision-making
  • Safety & Security Risks - PII leakage, credential exposure, biased content
  • Tool & System Failures - API failures, misconfigurations, runtime exceptions
  • Workflow & Task Gaps - Context loss, goal drift, task orchestration failures
  • Reflection Gaps - Lack of self-correction, action without reasoning

Knowledge Persistence

The system maintains Episodic Memory (individual trace context) and Semantic Memory (cross-trace generalized knowledge), enabling longitudinal analysis and continual improvement of diagnostic accuracy.

Conclusion

AgentCompass offers a rigorous and practical foundation for real-time production monitoring and continual improvement of complex agentic systems, bridging the gap between technical benchmarks and enterprise deployment realities.

agentic AI evaluation production monitoring debugging TRAIL benchmark

Try Future AGI

Put this research into practice. Start for free.

Get started free