Models

What Is LGTMK?

An extended Grafana observability stack — Loki, Grafana, Tempo, Mimir — with Kafka or Kubernetes-native shipping for logs, dashboards, traces, and metrics.

What Is LGTMK?

LGTMK is an extended form of the Grafana LGTM observability stack — Loki for logs, Grafana for dashboards, Tempo for traces, Mimir for metrics — with K added to denote Kafka-based shipping or Kubernetes-native deployment. It is a common open-source path for self-hosted observability of distributed systems, including LLM and agent applications. In an AI-application context, an OpenTelemetry-compatible instrumentation layer emits spans to Tempo, metrics to Mimir, and logs to Loki, while the evaluation surface (LLM-as-a-judge, programmatic checks, judge-model rubrics) runs as a separate pipeline writing scores back as span attributes.

Why It Matters in Production LLM and Agent Systems

Self-hosted observability is the default in regulated, security-conscious, and cost-sensitive environments. A bank running LLM-driven workflows may not be allowed to ship traces to a SaaS platform. A team scaling to billions of spans per month may find managed-platform pricing untenable and migrate to LGTMK. A platform engineer building a multi-tenant AI product needs to give each tenant audit-grade trace retention without exporting their data.

The pain shows up across roles. An SRE picks LGTM and finds Tempo’s default retention is not enough for a year-long legal-audit window. A platform lead deploys Mimir cheaply but discovers cardinality explosion when LLM token-count metrics include user-id labels. A product manager wants LLM-quality dashboards alongside latency dashboards but realises the eval scores are not in Mimir — they live in some other system, and joining them is manual.

In 2026 LLM stacks the observability story has converged on OpenTelemetry semantic conventions for gen_ai.* attributes — model name, token counts, prompt, response, tool calls, latency. LGTMK, like every other backend, ingests those as long as the instrumentation layer follows the spec. The interesting question is whether the LLM-specific evaluation lives inside the stack or alongside it.

How FutureAGI Handles LGTMK Integration

FutureAGI does not replace LGTMK — it complements it. The integration runs on OpenTelemetry: traceAI instrumentations for LangChain, LlamaIndex, OpenAI Agent SDK, Pydantic-AI, MCP, and 50+ frameworks emit OTel-compatible spans carrying llm.model.name, llm.token_count.prompt, llm.token_count.completion, and tool-call attributes. These spans land in Tempo (or any OTel-compatible backend) just like any other span. Mimir captures derived metrics; Loki captures structured logs.

What FutureAGI adds on top is the evaluation layer. A team running LGTMK keeps their observability pipeline as is and adds FutureAGI to the eval path: sample 5% of production traces, run AnswerRelevancy, Faithfulness, HallucinationScore, and JSONValidation, and write the results back to the trace as span events using the gen_ai.evaluation.* attributes. Grafana dashboards then plot eval-fail-rate alongside latency p99 and cost-per-trace, joining on trace_id. The team gets one dashboard, two systems — LGTMK for telemetry, FutureAGI for evaluation — both running on the same OpenTelemetry contract.

For the gateway side, the Agent Command Center can also emit OTel spans for routing decisions, fallback triggers, and guardrail blocks, so the LGTMK trace view tells the full story of a request including which routing-policy fired and which pre-guardrail blocked it.

How to Measure or Detect It

LGTMK-side telemetry combines with FutureAGI’s evaluation signal at the trace layer:

  • llm.token_count.prompt / llm.token_count.completion — OTel attributes for token cost.
  • llm.model.name — model version captured per span.
  • agent.trajectory.step — multi-step agent context attribute.
  • AnswerRelevancy / Faithfulness — evaluator scores written as span events.
  • Eval-fail-rate-by-route — Grafana dashboard joining FutureAGI scores with route attributes.
  • Latency p99 by model — Mimir-derived metric on the trace dataset.
  • Span-event count per trace — guardrail and evaluation activity per request.
# Instrument with traceAI; spans flow to your LGTMK Tempo
from traceai.langchain import LangChainInstrumentor
LangChainInstrumentor().instrument()

# Run FutureAGI eval on sampled traces
from fi.evals import AnswerRelevancy, Faithfulness
ar = AnswerRelevancy()
fa = Faithfulness()
# scores can be written back as span_event via the trace SDK

Common Mistakes

  • High-cardinality metrics in Mimir. Including user-id as a label explodes the metrics store; use exemplars or trace-sampled metrics instead.
  • Treating Tempo retention as “set and forget”. Audit windows can be years; budget storage accordingly.
  • Logging full prompts in Loki without redaction. PII in log lines is a compliance violation; redact at the instrumentation layer.
  • Missing the eval layer entirely. Telemetry shows that requests happened; without evaluators you do not know whether they were correct.
  • Using non-standard attribute names. Stick to OpenTelemetry gen_ai.* semantic conventions so dashboards, evaluators, and queries align.

Frequently Asked Questions

What is LGTMK?

LGTMK extends the Grafana LGTM observability stack — Loki, Grafana, Tempo, Mimir — with K for Kafka or Kubernetes-native shipping. It is a common open-source path for ingesting logs, traces, and metrics from AI applications.

How is LGTMK different from a managed observability platform?

LGTMK is self-hosted; you operate Loki, Tempo, Mimir, and the Kafka or Kubernetes pipeline yourself. Managed platforms trade that control for operational simplicity. LLM-specific evaluation usually layers on top of either choice via OpenTelemetry.

How does FutureAGI integrate with LGTMK?

FutureAGI's traceAI integrations emit OpenTelemetry-compatible spans that any LGTMK pipeline can ingest into Tempo. FutureAGI then runs evaluators like AnswerRelevancy and Faithfulness against the same spans and writes scores back as span events.