Evaluation

What Is TruLens?

An open-source LLM evaluation framework that records app runs and scores them with feedback functions for groundedness, relevance, and RAG quality.

What Is TruLens?

TruLens is an open-source LLM evaluation framework for recording LLM application calls and scoring them with feedback functions. It belongs to the evaluation family, especially RAG evaluation, where it surfaces groundedness, answer relevance, and context relevance across an eval pipeline or production trace review. Teams use TruLens to compare prompts, retrievers, and model versions, then pair its findings with FutureAGI evaluators such as Groundedness, ContextRelevance, and AnswerRelevancy for release decisions. In 2026 TruLens remains popular for ad-hoc local prototyping but most production teams have moved their gating evals to a hosted platform (FutureAGI, Braintrust, or Langfuse) so the eval surface and the trace store live in the same audit trail.

Why TruLens matters in production LLM and agent systems

TruLens matters because framework metrics often become the first production evidence that a RAG or agent change broke semantics rather than uptime. Ignore it, and a retriever can return plausible but wrong passages; final answers look fluent while citations fail, groundedness falls, and users get unsupported policy guidance. Another failure mode is evaluation drift: a prompt change improves demo examples but degrades a tenant-specific workflow, while the dashboard average hides the cohort.

The pain is split across owners. Developers debug prompt and retriever changes without knowing which step failed. SREs see p99 latency, token cost, and retry rate increase but lack the quality signal that explains the operational pattern. Product teams see lower answer acceptance and more escalations. Compliance teams lose the evidence trail needed to prove that a generated answer used approved context.

Symptoms show up as low feedback scores on groundedness or context relevance, repeated low-scoring traces from one retriever route, score variance after a model swap, and support tickets that mention missing citations. In 2026-era agentic systems, those symptoms are harder to isolate because one user request can include retrieval, planning, tool calls, and final synthesis. TruLens-style run records help locate the weak step before teams over-correct the whole application.

How FutureAGI Handles TruLens

FutureAGI’s approach is to make TruLens-style evidence operational: turn each feedback concern into a named evaluator, attach it to a dataset or trace cohort, and use thresholds to decide whether a release proceeds. The closest eval surfaces are Groundedness, ContextRelevance, and AnswerRelevancy from fi.evals. For agentic RAG, teams often add ToolSelectionAccuracy or TaskCompletion when a low final score comes from the agent path rather than the retriever.

A real workflow starts with a support RAG application instrumented through traceAI-langchain. Each sampled trace carries a trace_id, prompt version, model name, retrieved chunks, final answer, and user cohort. The engineer builds a FutureAGI dataset from failed and passing traces, then attaches Dataset.add_evaluation() entries for groundedness, context relevance, and answer relevancy. If the team already uses TruLens, its feedback outputs can be treated as a discovery signal, then re-run inside the same FutureAGI threshold and regression-eval workflow.

The next action is concrete. If ContextRelevance drops below 0.75 for one knowledge-base version, the engineer inspects the retrieval spans and rolls back chunking. If Groundedness fails while context relevance stays high, the prompt is likely ignoring sources or synthesizing unsupported claims. We recommend grounding TruLens-style feedback against public RAG suites. RAGTruth (18K labeled chunks across QA, summarisation, and data-to-text) and RAGBench give calibrated reference distributions so teams can interpret a TruLens groundedness score of 0.78 against the frontier baseline of roughly 0.90-0.94 on the same task family rather than against a vibes-based threshold. Unlike Ragas, which is strongest when the problem is RAG metric design, this FutureAGI workflow keeps evaluator results tied to traces, owners, release gates, and alert policy.

TruLens feedback function → FutureAGI evaluator mapping

TruLens feedback functionFutureAGI evaluatorWhat it checks
GroundednessGroundednessAnswer supported by retrieved context
Context RelevanceContextRelevanceRetrieved chunks match the user’s query
Answer RelevanceAnswerRelevancyFinal answer addresses the question
HarmfulnessContentSafety / ToxicityHarmful or abusive output detection
StereotypesBiasDetectionFairness-risk fingerprints in output
CoherenceReasoningQualityInternal logical structure
ConcisenessCustomEvaluation (rubric)Verbosity bounds
Agent tool-use checksToolSelectionAccuracy + FunctionCallAccuracyPer-step tool grading

How to measure or detect TruLens results

Measure TruLens output by translating feedback functions into repeatable eval signals:

  • Groundedness. returns whether the response is supported by the supplied context; track failure rate by prompt version and retriever route.
  • ContextRelevance. scores whether retrieved chunks match the user request; use it before blaming the generator.
  • AnswerRelevancy. checks whether the final answer addresses the question, even when sources are valid.
  • Dashboard signal. eval-fail-rate-by-cohort, score variance after model change, and threshold breaches per dataset version.
  • Trace signal. link failed evals to trace_id, retrieved chunks, model name, prompt version, and user-feedback proxies such as thumbs-down rate.

Minimal FutureAGI pairing:

from fi.evals import Groundedness, ContextRelevance, AnswerRelevancy

q = "What plan tier includes audit logs?"
answer = "Enterprise includes audit logs."
docs = ["Enterprise plan includes audit logs and SSO."]
for metric in [Groundedness(), ContextRelevance(), AnswerRelevancy()]:
    result = metric.evaluate(input=q, output=answer, context=docs)
    print(metric.__class__.__name__, result.score, result.reason)

Use the pattern to compare TruLens feedback with FutureAGI thresholds rather than treating either dashboard as a standalone truth source.

Common mistakes

  • Treating TruLens as one metric. It is a framework around feedback functions; name the exact signal before setting a release gate.
  • Averaging away cohort failures. A global score can hide one tenant, language, retriever, or document family that is failing hard.
  • Skipping trace linkage. A low feedback score without the prompt, chunks, answer, and tool path is hard to debug.
  • Using RAG scores for agent tools. Groundedness does not prove the agent selected the right API or completed the task.
  • Copying thresholds between frameworks. TruLens feedback scores and FutureAGI eval scores may have different distributions; calibrate on the same labeled set.

Frequently Asked Questions

What is TruLens?

TruLens is an open-source framework for recording LLM app calls and scoring them with feedback functions such as groundedness, answer relevance, and context relevance. FutureAGI maps the same questions to evaluator classes, traceAI spans, and monitored thresholds.

How is TruLens different from Ragas?

Ragas focuses on RAG evaluation metrics such as faithfulness and context precision. TruLens also records application runs and attaches feedback functions to those records, so engineers can inspect scores beside call history.

How do you measure TruLens results?

In FutureAGI, map TruLens-style signals to fi.evals classes such as Groundedness, ContextRelevance, and AnswerRelevancy. Track eval-fail-rate-by-cohort and link failures back to trace IDs.