A tensor is a multi-dimensional numeric array — the canonical data type that ML frameworks like PyTorch, TensorFlow, and JAX use to represent inputs, weights, activations, and gradients inside a model.

How is a tensor different from a matrix?

A matrix is a 2-D tensor. Tensors generalize that to any rank — 0-D scalars, 1-D vectors, 2-D matrices, or 3-D and higher arrays needed for batched LLM activations.

How do you evaluate the outputs of a tensor-based LLM?

FutureAGI's fi.evals package evaluates LLM outputs after tensors are decoded back to text — using EmbeddingSimilarity, Groundedness, and HallucinationScore against a Dataset of inputs and reference responses.

What Is a Tensor? Definition & FutureAGI Guide (2026)

What Is a Tensor?

A tensor is a multi-dimensional numeric array — the universal data container that machine-learning frameworks use to hold inputs, weights, activations, and gradients. A scalar is a 0-D tensor; a vector is 1-D; a matrix is 2-D; a batched LLM hidden state is typically 3-D or 4-D with shape (batch, sequence, hidden) or (batch, head, sequence, head_dim). Frameworks like PyTorch, JAX, and TensorFlow schedule tensor operations onto GPU kernels. In a FutureAGI trace, tensors do not appear directly — but the token counts and embeddings they produce do, surfaced as OpenTelemetry attributes.

Why It Matters in Production LLM and Agent Systems

Tensor shape is where most silent inference bugs originate. A mis-shaped attention mask drops half the input. A (batch, seq) swap with (seq, batch) makes the model attend to the wrong tokens. A bad pad-token handling causes garbled outputs only at the longest sequences. None of these surface as exceptions — the model returns text, the API returns 200, and the eval rate quietly drops three points.

ML engineers feel this when retraining: gradient norms explode, loss diverges, and the culprit is a single transposed dimension. Inference engineers feel it when batching: the same prompt produces different completions at batch size 1 versus 32 because of left-padding versus right-padding interactions with rotary embeddings. SREs feel it when a vLLM upgrade silently changes the KV-cache layout and p99 latency spikes 40%.

For 2026-era stacks, tensor mechanics matter beyond training. Multimodal models pack image, audio, and text into a single tensor; voice agents pipe audio frames through a (batch, channels, samples) tensor before tokenization; agent runtimes serialize tool-call results back into token tensors. A wrong shape propagates through the trajectory and shows up not as a crash but as a quality regression on TaskCompletion or Faithfulness. FutureAGI’s role is to catch the regression — not to fix the tensor — but engineers should know which layer broke.

How FutureAGI Handles Tensors

FutureAGI does not allocate, train, or reshape tensors — that is the job of PyTorch, JAX, vLLM, and the model itself. FutureAGI is the evaluation and observability layer above the inference runtime. Where tensors enter the FutureAGI surface is at three boundaries.

First, the embedding tensor: EmbeddingSimilarity consumes pairs of text strings, encodes each into a fixed-dimension embedding tensor under the hood (typically 768 or 1536 dim), and returns a 0–1 cosine score. Engineers never touch the tensor — they get the score plus reason. Second, the token-count tensor metadata: traceAI integrations like traceAI-openai and traceAI-langchain emit llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total as OpenTelemetry attributes on every LLM span. These are derived from the input/output token tensors. Third, the tool-output tensor flow: when an agent calls a tool, the result is serialized to text, retokenized, and folded back into the next forward pass; FutureAGI evaluators like ToolSelectionAccuracy score whether that round-trip produced the right next action.

A practical workflow: a team retraining a 7B base on domain text uses Adagrad, ships v2, and runs FutureAGI’s regression eval against Dataset.add_evaluation(EmbeddingSimilarity) and HallucinationScore against a golden cohort. If embedding similarity drops 8 points, the team knows the new tensor weights regressed semantics — without ever inspecting a single weight tensor.

How to Measure or Detect It

This term is foundational rather than directly measurable, but the proxies are concrete:

EmbeddingSimilarity: returns a 0–1 cosine score across two text strings, run after the model has decoded its output tensor back to text.
llm.token_count.prompt (OTel attribute): the prompt-tensor token count per LLM span; rising values flag context bloat.
llm.token_count.completion: completion-tensor size per response; sudden drops can indicate truncation bugs.
GPU memory dashboards: per-request VRAM consumed by the KV-cache tensor; correlate with batch-size and context-length to spot OOM risk.
Eval-fail-rate-by-cohort: regression rate on Faithfulness or TaskCompletion after a model or runtime change — the cleanest proxy for tensor-level regressions.

Minimal Python:

from fi.evals import EmbeddingSimilarity

evaluator = EmbeddingSimilarity()
result = evaluator.evaluate(
    response="The quarterly revenue rose 12%.",
    expected_response="Q3 revenue grew by 12 percent.",
)
print(result.score, result.reason)

Common Mistakes

Confusing tensor rank with size. A 3-D tensor of shape (1, 1, 1) and one of shape (1024, 1024, 1024) are both rank-3; rank is the count of dimensions, not their magnitudes.
Trusting batch-1 inference results in production. Many tensor bugs emerge only at batched inference because of padding, masking, and KV-cache interactions.
Comparing token counts across tokenizers. llm.token_count.prompt is meaningful per-model, not across models — Llama-3 tokens are not GPT-4 tokens.
Treating embedding tensors as universal. Two embedding models produce vectors in different spaces; cosine similarity across them is meaningless.
Skipping evals after a runtime upgrade. vLLM, TGI, and transformers releases routinely change tensor layouts; always rerun a regression cohort.