Evaluation

What Is TruLens?

An open-source LLM evaluation framework that records app runs and scores them with feedback functions for groundedness, relevance, and RAG quality.

What Is TruLens?

TruLens is an open-source LLM evaluation and observability framework for recording LLM application calls and scoring them with feedback functions. It belongs to the eval family, especially RAG evaluation, where it surfaces groundedness, answer relevance, and context relevance across an eval pipeline or production trace review. Teams use TruLens to compare prompts, retrievers, and model versions, then pair its findings with FutureAGI evaluators such as Groundedness, ContextRelevance, and AnswerRelevancy for release decisions.

Why TruLens Matters in Production LLM and Agent Systems

TruLens matters because framework metrics often become the first production evidence that a RAG or agent change broke semantics rather than uptime. Ignore it, and a retriever can return plausible but wrong passages; final answers look fluent while citations fail, groundedness falls, and users get unsupported policy guidance. Another failure mode is evaluation drift: a prompt change improves demo examples but degrades a tenant-specific workflow, while the dashboard average hides the cohort.

The pain is split across owners. Developers debug prompt and retriever changes without knowing which step failed. SREs see p99 latency, token cost, and retry rate increase but lack the quality signal that explains the operational pattern. Product teams see lower answer acceptance and more escalations. Compliance teams lose the evidence trail needed to prove that a generated answer used approved context.

Symptoms show up as low feedback scores on groundedness or context relevance, repeated low-scoring traces from one retriever route, score variance after a model swap, and support tickets that mention missing citations. In 2026-era agentic systems, those symptoms are harder to isolate because one user request can include retrieval, planning, tool calls, and final synthesis. TruLens-style run records help locate the weak step before teams over-correct the whole application.

How FutureAGI Handles TruLens

FutureAGI’s approach is to make TruLens-style evidence operational: turn each feedback concern into a named evaluator, attach it to a dataset or trace cohort, and use thresholds to decide whether a release proceeds. The closest eval surfaces are Groundedness, ContextRelevance, and AnswerRelevancy from fi.evals. For agentic RAG, teams often add ToolSelectionAccuracy or TaskCompletion when a low final score comes from the agent path rather than the retriever.

A real workflow starts with a support RAG application instrumented through traceAI-langchain. Each sampled trace carries a trace_id, prompt version, model name, retrieved chunks, final answer, and user cohort. The engineer builds a FutureAGI dataset from failed and passing traces, then attaches Dataset.add_evaluation() entries for groundedness, context relevance, and answer relevancy. If the team already uses TruLens, its feedback outputs can be treated as a discovery signal, then re-run inside the same FutureAGI threshold and regression-eval workflow.

The next action is concrete. If ContextRelevance drops below 0.75 for one knowledge-base version, the engineer inspects the retrieval spans and rolls back chunking. If Groundedness fails while context relevance stays high, the prompt is likely ignoring sources or synthesizing unsupported claims. Unlike Ragas, which is strongest when the problem is RAG metric design, this FutureAGI workflow keeps evaluator results tied to traces, owners, release gates, and alert policy.

How to Measure or Detect TruLens Results

Measure TruLens output by translating feedback functions into repeatable eval signals:

  • Groundedness — returns whether the response is supported by the supplied context; track failure rate by prompt version and retriever route.
  • ContextRelevance — scores whether retrieved chunks match the user request; use it before blaming the generator.
  • AnswerRelevancy — checks whether the final answer addresses the question, even when sources are valid.
  • Dashboard signal — eval-fail-rate-by-cohort, score variance after model change, and threshold breaches per dataset version.
  • Trace signal — link failed evals to trace_id, retrieved chunks, model name, prompt version, and user-feedback proxies such as thumbs-down rate.

Minimal FutureAGI pairing:

from fi.evals import Groundedness, ContextRelevance, AnswerRelevancy

q = "What plan tier includes audit logs?"
answer = "Enterprise includes audit logs."
docs = ["Enterprise plan includes audit logs and SSO."]
for metric in [Groundedness(), ContextRelevance(), AnswerRelevancy()]:
    result = metric.evaluate(input=q, output=answer, context=docs)
    print(metric.__class__.__name__, result.score, result.reason)

Use the pattern to compare TruLens feedback with FutureAGI thresholds rather than treating either dashboard as a standalone truth source.

Common Mistakes

  • Treating TruLens as one metric. It is a framework around feedback functions; name the exact signal before setting a release gate.
  • Averaging away cohort failures. A global score can hide one tenant, language, retriever, or document family that is failing hard.
  • Skipping trace linkage. A low feedback score without the prompt, chunks, answer, and tool path is hard to debug.
  • Using RAG scores for agent tools. Groundedness does not prove the agent selected the right API or completed the task.
  • Copying thresholds between frameworks. TruLens feedback scores and FutureAGI eval scores may have different distributions; calibrate on the same labeled set.

Frequently Asked Questions

What is TruLens?

TruLens is an open-source framework for recording LLM app calls and scoring them with feedback functions such as groundedness, answer relevance, and context relevance. FutureAGI maps the same questions to evaluator classes, traceAI spans, and monitored thresholds.

How is TruLens different from Ragas?

Ragas focuses on RAG evaluation metrics such as faithfulness and context precision. TruLens also records application runs and attaches feedback functions to those records, so engineers can inspect scores beside call history.

How do you measure TruLens results?

In FutureAGI, map TruLens-style signals to fi.evals classes such as Groundedness, ContextRelevance, and AnswerRelevancy. Track eval-fail-rate-by-cohort and link failures back to trace IDs.