Modern AI Engineering in 2026: Scaling LLMs from Pilot to Production (Webinar with Broadcom)
Future AGI webinar with Sandeep Kaipu (Broadcom) on scaling production AI: KPI alignment, infra and data pipelines, inference cost, evaluation, and guardrails.
Table of Contents
Watch the Modern AI Engineering Webinar
Sandeep Kaipu (Senior Manager, Software Engineering at Broadcom) and Nikhil Pareek (CEO at Future AGI) walk through what it actually takes to scale a working LLM pilot into a production AI system, with concrete patterns for KPI alignment, infrastructure scaling, inference cost optimization, and the guardrails layer.
TL;DR: Modern AI Engineering in 2026
| Pillar | What you take away |
|---|---|
| KPI alignment | Map every AI workflow to a business outcome before picking a model or framework. |
| Infrastructure scaling | Decouple data, model, and orchestration; let each scale on its own curve. |
| Inference cost and latency | Route by request difficulty; smaller models on easy paths, reasoning models on hard paths. |
| Evaluation discipline | Three-layer eval (heuristic + model-based + LLM-as-a-judge), continuous on sampled live traffic. |
| Guardrails | PII redaction, prompt-injection detection, content classification, tool-call authorization. |
| Observability companion | traceAI (Apache 2.0) for spans, ai-evaluation (Apache 2.0) for evaluators. |
| Runtime gateway | Agent Command Center at /platform/monitor/command-center for BYOK routing, budgets, guardrails. |
Webinar Abstract
Scaling AI from pilot to production breaks for predictable reasons. The prototype hit a curated demo on a single example and passed. Production hits live traffic, multiple tenants, mixed input quality, and rising cost pressure. The session walks through the four pillars that decide whether a working pilot scales:
- KPI alignment: tying model outputs to revenue, retention, or cost metrics rather than chasing model leaderboard scores.
- Infrastructure and data pipelines: the trade-offs between centralized and distributed inference, vector store choice, and the data pipeline that keeps evaluations honest.
- Inference cost and speed: caching, batching, model routing, and the new lever of inference-time compute introduced by reasoning models.
- Security and guardrails: PII, injection, content safety, and tool-call authorization as a non-negotiable layer rather than a launch checklist.
Who Should Watch
This webinar is built for AI founders, engineering leads, ML platform architects, DevOps and SRE teams, and product managers delivering reliable, enterprise-scale AI. The most useful starting point is a team that already has a working LLM pilot and now needs to scale it into a multi-team, multi-environment production system.
What You Will Learn
- How to align AI work to business KPIs with the right evaluations, not just leaderboard scores.
- How to scale infrastructure and data pipelines so the model, the orchestration layer, and the data layer can each evolve on their own curves.
- How to optimize inference cost and speed through routing, caching, and the right model size per request path.
- Why security and guardrails are non-negotiable in 2026 and how to implement them without adding prohibitive latency.
The Four Pillars in Practice
1. KPI Alignment Before Model Selection
The first lever is the one teams skip most often. Before picking a model, name the business KPI the AI workflow moves. Resolution rate for a support agent. Conversion lift for a recommendation surface. Time-to-close for a sales assistant. Without that anchor, model A versus model B becomes a leaderboard argument rather than an engineering decision.
Once the KPI is named, the evaluation set follows. Eval set construction starts from real production examples that map to the KPI, not from a generic benchmark. The Future AGI Compare Data view lets you score the same eval set across models and prompts and see which moves the KPI proxy, not which scores higher on a generic metric.
2. Infrastructure That Scales Independently
The second lever is decoupling. Treat the model layer, the orchestration layer, and the data layer as three independent surfaces that scale on their own curves:
- The model layer is mostly a routing problem in 2026. Multiple backing models, one BYOK gateway, request-level routing rules.
- The orchestration layer is where agent frameworks, tool catalogs, and workflow graphs live. LangGraph, the OpenAI Agents SDK, and custom Python orchestrators all fit this layer.
- The data layer is the vector stores, the relational stores, the cache, and the feature store. None of these need to scale at the same rate as the model layer.
The Future AGI platform connects orchestration tracing (via traceAI) with model-gateway routing (via the Agent Command Center), so the model and orchestration layers can scale independently while sharing one evaluation surface.
3. Inference Cost and Latency: Route by Difficulty
The 2025 framing was straightforward: cache, batch, push small where possible. The 2026 framing adds inference-time compute as a third axis. A reasoning model costs much more per call but can collapse a five-step agent run into one model call. The cost-per-task math now beats cost-per-token math for hard paths.
The pattern in practice:
- Cache aggressively on inputs that repeat.
- Batch where you can.
- Route easy paths to small models (
gpt-5-2025-08-07for many tasks, smaller open models for narrow tasks). - Route hard paths to reasoning models (
gpt-5-2025-08-07reasoning mode,claude-opus-4-7, DeepSeek R1). - Enforce per-team and per-tenant budgets at the gateway.
The Agent Command Center at /platform/monitor/command-center handles routing, caching, and budget enforcement.
4. Guardrails as a First-Class Layer
The webinar’s core point on guardrails was that they are not a launch checklist; they are a runtime layer. The non-negotiable list in 2026:
- PII redaction on inputs and outputs.
- Prompt-injection detection on untrusted inputs (RAG retrieved chunks, user-supplied tool outputs, scraped web content).
- Content classification on outputs (toxicity, regulated topics, brand-safety policies).
- Tool-call authorization for agents that touch external systems (API calls, database writes, payments).
The Future AGI Guardrails runtime (fi.evals.guardrails.Guardrails) ships these as pre-call and post-call hooks in the Agent Command Center gateway. Latency overhead is typically 100 to 200 milliseconds per request.
Reference Architecture Implementation
The webinar pairs with a reference implementation that wires the four pillars together. The minimal pattern looks like this:
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from fi.evals import evaluate
# Step 1: Register the tracer (KPI alignment + observability)
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="support-agent",
)
tracer = FITracer(trace_provider)
# Step 2: Wrap your agent and tool functions
@tracer.agent(name="resolve_ticket")
def resolve_ticket(ticket_id: str, query: str) -> dict:
# ... your retrieval + LLM call ...
return {"resolution": "answer", "sources": ["doc-1", "doc-2"]}
# Step 3: Inline evaluation against the KPI proxy
result = resolve_ticket(ticket_id="T-123", query="reset password")
faithfulness = evaluate(
"faithfulness",
output=result["resolution"],
context="\n".join(result["sources"]),
)
print(faithfulness.score, faithfulness.reason)
Routing, caching, budgets, and guardrails are enforced at the gateway layer (Agent Command Center) rather than in application code, which keeps the four pillars cleanly separated.
Further Reading and Primary Sources
- Future AGI documentation
- ai-evaluation GitHub repository
- ai-evaluation Apache 2.0 LICENSE
- traceAI GitHub repository
- traceAI Apache 2.0 LICENSE
- Future AGI Cloud Evals (Turing model latencies)
- OpenTelemetry GenAI semantic conventions
- OpenAI Agents SDK documentation
- Anthropic Claude API documentation
- Google Vertex AI documentation
- LangChain documentation
- LiteLLM documentation
- Future AGI on LinkedIn
Need evals and observability for your modern GenAI stack? Read the Future AGI documentation or book a demo for a tailored walkthrough.
Frequently asked questions
What is this webinar about?
When was the webinar recorded and what has changed since?
Who should watch this webinar?
What is the single biggest mistake teams make when scaling LLM systems?
What evaluation framework does the webinar recommend?
How should teams approach inference cost and latency in 2026?
What guardrails are non-negotiable for enterprise AI?
How do I connect what's in the webinar to my own stack?
Webinar: how routing, guardrails, and budget caps at the AI gateway layer fix the prompt injection, cost, and reliability failures most teams blame on the LLM provider.
Webinar replay on Agentic UX in 2026 and the AG-UI protocol. Build streaming, tool-aware interfaces that work across LangGraph, CrewAI, and Mastra agents.
Replace manual prompt tuning with eval-driven auto-optimization. 6 strategies (Bayesian, GEPA, ProTeGi), real fi.opt code, and a free 2026 webinar.