What Is Pandas and NumPy?
Pandas and NumPy are the foundational Python libraries for labeled DataFrame manipulation and n-dimensional numerical arrays, used across ML data work.
What Is Pandas and NumPy?
Pandas and NumPy are the two foundational Python libraries for numerical and tabular data work. NumPy gives you the n-dimensional array (ndarray) plus vectorized math, broadcasting, and linear algebra. Pandas builds DataFrames and Series on top of NumPy, adding row and column labels, mixed dtypes, time-series indexing, and SQL-like joins, groupby, and merge. They are not part of the LLM runtime, but every dataset prep, drift report, and offline eval script in an AI pipeline runs through them. FutureAGI’s fi.datasets.Dataset SDK is built to accept and emit Pandas-friendly shapes.
Why Pandas and NumPy Matter in Production LLM and Agent Systems
Modern LLM stacks look like cloud platforms, but the data prep underneath looks like a notebook. An eval cohort is built by sampling production traces, joining them against ground-truth labels, exploding tool calls into rows, computing per-cohort scores, and writing the result back to a dataset store. That entire path is Pandas. Drift reports compare two distributions of token counts, latency, and per-evaluator scores; that is NumPy.
The pain shows up when teams skip these libraries. A custom JSON-walking script silently drops nested fields, eval scores get keyed by row index rather than trace id, or per-cohort averages mix None with 0. Engineers see flat-line charts that hide a 30% regression on one segment. SREs see drift alarms that disagree with human review because the bins were computed inconsistently across runs.
For 2026 agent systems, the stakes are higher. A single user request can produce dozens of spans across planner, retrieval, tools, and critique passes. To produce a per-trajectory eval, you flatten spans into a long-form Pandas DataFrame keyed by trace.id and agent.trajectory.step, attach the evaluator outputs as columns, and pivot to a per-trace summary. NumPy underneath makes the operation fast on millions of rows.
How FutureAGI Works With Pandas and NumPy
FutureAGI is not a Pandas replacement; it is a layer on top. The honest anchor is data interchange: fi.datasets.Dataset accepts row-and-column inputs that load cleanly from a DataFrame, supports add_columns and add_rows operations, and exports evaluation results back to a tabular shape an analyst can read in a notebook. There is no FutureAGI evaluator named “Pandas” or “NumPy” because these are libraries, not behaviors.
A practical flow: a RAG team samples 5,000 production traces via traceAI, loads them into a Pandas DataFrame, joins on a ground-truth label table, then registers the result as an fi.datasets.Dataset. They call Dataset.add_evaluation(Groundedness()), Dataset.add_evaluation(AnswerRelevancy()), and Dataset.add_evaluation(JSONValidation()). FutureAGI runs the evaluators row-by-row, attaches scores as new columns, and exports the enriched dataset. The team pulls it back into Pandas for slice-and-dice analysis: groupby route, compute per-cohort Groundedness mean, plot a histogram of token counts using NumPy.
Compared with stitching scripts together by hand, the FutureAGI surface gives the team versioned datasets, reproducible eval runs, and a per-row reason from each evaluator that survives the round trip.
How to Measure or Detect Pandas/NumPy Issues
Pandas and NumPy themselves are not “measured” — they are tools. What you measure is whether your eval data path is correct.
- Row-count parity between input traces, the registered
Dataset, and the exported eval results. Drift means a join silently dropped rows. - Schema parity — column dtypes match expectations after each transform; mismatched dtypes silently coerce values.
- Per-cohort consistency — recompute group averages on the raw and on the enriched DataFrame; they should agree.
- Memory and runtime — track DataFrame size and eval-job runtime; out-of-memory errors and quadratic operations are the most common Pandas anti-pattern.
- Trace-id integrity — every row in the eval DataFrame should map back to a real
trace.idso reasons can be opened in tracing.
import pandas as pd
from fi.datasets import Dataset
df = pd.read_parquet("production_traces.parquet")
ds = Dataset.from_dataframe(df, name="rag-eval-2026-05")
ds.add_evaluation("Groundedness")
results = ds.export()
Common Mistakes
- Indexing eval scores by row position. Always key on
trace.id; positional joins break when filters change row order. - Mixing
NaNandNonein scores. Pandas treats them differently across dtypes; cast scores tofloatbefore aggregating. - Running per-row Python loops over a DataFrame. Use vectorized NumPy or Pandas operations; loops turn a 30-second job into 30 minutes.
- Forgetting that
groupbydropsNaNkeys by default. Per-cohort averages can silently exclude an entire segment. - Treating Pandas as a database. It is not. For multi-million-row joins, persist intermediate results and use the right tool.
Frequently Asked Questions
What are Pandas and NumPy used for?
Pandas and NumPy are Python libraries for working with tabular and numerical data. NumPy supplies fast n-dimensional arrays; Pandas provides labeled DataFrames and Series for mixed-type, labeled data.
How is Pandas different from NumPy?
NumPy provides homogeneous numerical arrays and vectorized math. Pandas builds on NumPy and adds row and column labels, mixed dtypes, time-series indexing, and SQL-like operations such as groupby and merge.
How do they relate to LLM evaluation?
Eval scripts almost always shape data as Pandas DataFrames before scoring; FutureAGI's `fi.datasets.Dataset` accepts DataFrame-like inputs and exports results that load cleanly back into Pandas for analysis.