What is a Python dict?

A Python dict is a built-in associative-array data structure that maps unique hashable keys to values using a hash table, with O(1) average-case lookup, insertion, and deletion.

How is a dict different from a JSON object?

A JSON object is a serialised wire format that allows only string keys and JSON-typed values. A Python dict is an in-memory data structure that allows any hashable key and any object value. Most LLM SDKs convert between the two at the boundary.

Where do dicts show up in FutureAGI evaluations?

Everywhere — `Dataset` rows are dicts, evaluator inputs/outputs are dicts, and `JSONValidation` plus `JsonSchema` evaluators validate dict-shaped LLM outputs against a schema.

Dict (Python Dictionary): Definition & FutureAGI Guide 2026

What Is a Dict (Python Dictionary)?

A dict (Python dictionary) is Python’s built-in associative-array data structure: an insertion-ordered, mutable mapping of unique hashable keys to values backed by a hash table. In LLM and agent systems, dicts carry configs, JSON payloads, function-call arguments, span attributes, dataset rows, and evaluator results. FutureAGI treats dict correctness as a model-family reliability concern because one missing key or wrong value type can break parsing, tool execution, tracing, or downstream evaluation before a user sees the failure.

Why Dicts Matter in Production LLM and Agent Systems

Dicts are the universal currency of LLM systems, which is exactly why dict-shaped failures are so common. An LLM emits a JSON-looking string that is missing a closing brace, and the downstream parser raises; invalid-json is one of the most-cited failure modes in production. A function-call dict has the right keys but the wrong types, so a tool call silently fails. A Dataset row is built with a typo in one of its keys, so the evaluator never reads the column it expected.

The pain is shared across roles. ML engineers debug schema drift between LLM output and downstream consumer. SREs see error spikes from dict KeyErrors that look like infrastructure failures. Product managers see broken responses that trace back to a parser, not the model. Compliance leads worry about audit-log dicts that miss required fields.

In 2026-era agent stacks dicts compound. Tool calls flow as dicts; planner outputs are dicts; retrieved context is wrapped in dicts; evaluator results are dicts. A single missing key at the planner step quietly nullifies three downstream tool calls. Step-level eval on dict-shaped outputs is the only way to catch this before it reaches a user.

How FutureAGI Handles Dict-Shaped Outputs and Inputs

FutureAGI’s SDK assumes dicts at every interface. Dataset rows are dicts; Dataset.add_evaluation reads and writes dict-shaped columns; the Persona and Scenario classes serialize to dicts for ScenarioGenerator. Every evaluator returns a dict-shaped result with at minimum score, label, and reason keys. Span attributes carried through traceAI integrations such as langchain are dicts too, including OTel fields like llm.token_count.prompt and agent.trajectory.step. The Agent Command Center logs each request and response as a dict for replay, then can attach gateway controls such as fallback, retry, or semantic cache decisions when a contract fails.

The evaluator surface for dict correctness is concrete. JSONValidation evaluates an LLM’s stringified-JSON output against a JSON Schema and returns score, the failed paths, and the parsed dict. JsonSchema validates an in-memory dict against a JSON Schema. SchemaCompliance evaluates structured-output compliance. FieldCompleteness checks for required fields. TypeCompliance checks dict value types. ContainsJson is a quick check that a response text contains a parseable JSON object, useful as a pre-filter before expensive validators. Unlike Pydantic’s parse-or-raise model, FutureAGI’s evaluators return graded scores and reason strings, so teams can chart per-field failure rates and threshold rather than relying on hard parse errors. FutureAGI’s approach is to treat every dict boundary as an evaluable contract, not just a parser handoff.

How to Measure or Detect Dict Failures

Useful signals for dict-shape correctness in LLM outputs:

JSONValidation — full schema validation with parsed dict and failure paths.
JsonSchema — local-metric variant for in-memory dicts.
SchemaCompliance — comprehensive structured-output evaluation.
FieldCompleteness — required-field coverage in structured outputs.
TypeCompliance — type-only validation against schema.
ContainsJson — fast pre-filter to confirm parseable JSON exists.
invalid-json rate dashboards — production-trace signal segmented by route, model, tool, and release cohort.
OTel span-attribute checks — missing llm.token_count.prompt or malformed agent.trajectory.step values usually point to instrumentation or serialization bugs.

Minimal Python:

from fi.evals import JSONValidation, JsonSchema

schema = {"type": "object", "required": ["intent", "args"]}
validator = JSONValidation(json_schema=schema)

result = validator.evaluate(
    input=user_query,
    output=llm_output_text,
    context=None,
)

Treat the result as both a gate and a time series. Block releases when schema compliance drops below threshold, then segment by route, model, and tool to find drift.

Common mistakes

Treating “looks like JSON” as JSON. Many LLMs emit fenced markdown or trailing prose; strip and validate before passing data to tools.
Using Python dict literals as JSON. Single quotes, None, tuples, and datetime objects break wire-format contracts.
Leaving tool-call arguments schema-free. Function-calling without strict JSON Schema turns key and type bugs into silent tool failures.
Relying only on parse-or-raise. A scored validator gives release gates and regression dashboards; an exception only reports the first break.
Mutating evaluator inputs in place. Copy dicts before normalization so repeated eval runs do not inherit hidden state.