Webinars

Modern AI Engineering in 2026: Scaling LLMs from Pilot to Production (Webinar with Broadcom)

Future AGI webinar with Sandeep Kaipu (Broadcom) on scaling production AI: KPI alignment, infra and data pipelines, inference cost, evaluation, and guardrails.

·
Updated
·
5 min read
agents llms webinars
Modern AI Engineering webinar 2026 cover
Table of Contents

Watch the Modern AI Engineering Webinar

Sandeep Kaipu (Senior Manager, Software Engineering at Broadcom) and Nikhil Pareek (CEO at Future AGI) walk through what it actually takes to scale a working LLM pilot into a production AI system, with concrete patterns for KPI alignment, infrastructure scaling, inference cost optimization, and the guardrails layer.

TL;DR: Modern AI Engineering in 2026

PillarWhat you take away
KPI alignmentMap every AI workflow to a business outcome before picking a model or framework.
Infrastructure scalingDecouple data, model, and orchestration; let each scale on its own curve.
Inference cost and latencyRoute by request difficulty; smaller models on easy paths, reasoning models on hard paths.
Evaluation disciplineThree-layer eval (heuristic + model-based + LLM-as-a-judge), continuous on sampled live traffic.
GuardrailsPII redaction, prompt-injection detection, content classification, tool-call authorization.
Observability companiontraceAI (Apache 2.0) for spans, ai-evaluation (Apache 2.0) for evaluators.
Runtime gatewayAgent Command Center at /platform/monitor/command-center for BYOK routing, budgets, guardrails.

Webinar Abstract

Scaling AI from pilot to production breaks for predictable reasons. The prototype hit a curated demo on a single example and passed. Production hits live traffic, multiple tenants, mixed input quality, and rising cost pressure. The session walks through the four pillars that decide whether a working pilot scales:

  • KPI alignment: tying model outputs to revenue, retention, or cost metrics rather than chasing model leaderboard scores.
  • Infrastructure and data pipelines: the trade-offs between centralized and distributed inference, vector store choice, and the data pipeline that keeps evaluations honest.
  • Inference cost and speed: caching, batching, model routing, and the new lever of inference-time compute introduced by reasoning models.
  • Security and guardrails: PII, injection, content safety, and tool-call authorization as a non-negotiable layer rather than a launch checklist.

Who Should Watch

This webinar is built for AI founders, engineering leads, ML platform architects, DevOps and SRE teams, and product managers delivering reliable, enterprise-scale AI. The most useful starting point is a team that already has a working LLM pilot and now needs to scale it into a multi-team, multi-environment production system.

What You Will Learn

  • How to align AI work to business KPIs with the right evaluations, not just leaderboard scores.
  • How to scale infrastructure and data pipelines so the model, the orchestration layer, and the data layer can each evolve on their own curves.
  • How to optimize inference cost and speed through routing, caching, and the right model size per request path.
  • Why security and guardrails are non-negotiable in 2026 and how to implement them without adding prohibitive latency.

The Four Pillars in Practice

1. KPI Alignment Before Model Selection

The first lever is the one teams skip most often. Before picking a model, name the business KPI the AI workflow moves. Resolution rate for a support agent. Conversion lift for a recommendation surface. Time-to-close for a sales assistant. Without that anchor, model A versus model B becomes a leaderboard argument rather than an engineering decision.

Once the KPI is named, the evaluation set follows. Eval set construction starts from real production examples that map to the KPI, not from a generic benchmark. The Future AGI Compare Data view lets you score the same eval set across models and prompts and see which moves the KPI proxy, not which scores higher on a generic metric.

2. Infrastructure That Scales Independently

The second lever is decoupling. Treat the model layer, the orchestration layer, and the data layer as three independent surfaces that scale on their own curves:

  • The model layer is mostly a routing problem in 2026. Multiple backing models, one BYOK gateway, request-level routing rules.
  • The orchestration layer is where agent frameworks, tool catalogs, and workflow graphs live. LangGraph, the OpenAI Agents SDK, and custom Python orchestrators all fit this layer.
  • The data layer is the vector stores, the relational stores, the cache, and the feature store. None of these need to scale at the same rate as the model layer.

The Future AGI platform connects orchestration tracing (via traceAI) with model-gateway routing (via the Agent Command Center), so the model and orchestration layers can scale independently while sharing one evaluation surface.

3. Inference Cost and Latency: Route by Difficulty

The 2025 framing was straightforward: cache, batch, push small where possible. The 2026 framing adds inference-time compute as a third axis. A reasoning model costs much more per call but can collapse a five-step agent run into one model call. The cost-per-task math now beats cost-per-token math for hard paths.

The pattern in practice:

  • Cache aggressively on inputs that repeat.
  • Batch where you can.
  • Route easy paths to small models (gpt-5-2025-08-07 for many tasks, smaller open models for narrow tasks).
  • Route hard paths to reasoning models (gpt-5-2025-08-07 reasoning mode, claude-opus-4-7, DeepSeek R1).
  • Enforce per-team and per-tenant budgets at the gateway.

The Agent Command Center at /platform/monitor/command-center handles routing, caching, and budget enforcement.

4. Guardrails as a First-Class Layer

The webinar’s core point on guardrails was that they are not a launch checklist; they are a runtime layer. The non-negotiable list in 2026:

  • PII redaction on inputs and outputs.
  • Prompt-injection detection on untrusted inputs (RAG retrieved chunks, user-supplied tool outputs, scraped web content).
  • Content classification on outputs (toxicity, regulated topics, brand-safety policies).
  • Tool-call authorization for agents that touch external systems (API calls, database writes, payments).

The Future AGI Guardrails runtime (fi.evals.guardrails.Guardrails) ships these as pre-call and post-call hooks in the Agent Command Center gateway. Latency overhead is typically 100 to 200 milliseconds per request.

Reference Architecture Implementation

The webinar pairs with a reference implementation that wires the four pillars together. The minimal pattern looks like this:

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from fi.evals import evaluate

# Step 1: Register the tracer (KPI alignment + observability)
trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="support-agent",
)
tracer = FITracer(trace_provider)

# Step 2: Wrap your agent and tool functions
@tracer.agent(name="resolve_ticket")
def resolve_ticket(ticket_id: str, query: str) -> dict:
    # ... your retrieval + LLM call ...
    return {"resolution": "answer", "sources": ["doc-1", "doc-2"]}

# Step 3: Inline evaluation against the KPI proxy
result = resolve_ticket(ticket_id="T-123", query="reset password")
faithfulness = evaluate(
    "faithfulness",
    output=result["resolution"],
    context="\n".join(result["sources"]),
)
print(faithfulness.score, faithfulness.reason)

Routing, caching, budgets, and guardrails are enforced at the gateway layer (Agent Command Center) rather than in application code, which keeps the four pillars cleanly separated.

Further Reading and Primary Sources

Need evals and observability for your modern GenAI stack? Read the Future AGI documentation or book a demo for a tailored walkthrough.

Frequently asked questions

What is this webinar about?
Modern AI Engineering: Scaling LLMs from Pilot to Production is a session co-hosted by Sandeep Kaipu (Senior Manager, Software Engineering at Broadcom) and Nikhil Pareek (CEO at Future AGI). The session covers what it actually takes to scale a working LLM pilot into a production AI system that an enterprise can rely on. The four threads are KPI alignment (mapping AI work to business outcomes), infrastructure and data-pipeline scaling, inference cost and latency optimization, and the non-negotiable security and guardrails layer. The recording is roughly an hour and is available on demand through the gated player on this page.
When was the webinar recorded and what has changed since?
The webinar was recorded in May 2025. Twelve months on, the structural patterns covered in the session (KPI alignment, infra scaling, inference cost control, and guardrails) are still the right four pillars for modern AI engineering. What has changed is the toolchain: the OpenAI Agents SDK has become a common option for teams building agent workflows, model routing through gateways like the Future AGI Agent Command Center at /platform/monitor/command-center is now widely available, inference-time compute (reasoning models) has joined model size as a cost lever, and traceAI plus ai-evaluation are open-source observability and evaluation packages from Future AGI. The principles in the talk still apply, the tools have matured underneath them.
Who should watch this webinar?
AI founders, engineering leads, ML platform architects, and product managers building enterprise AI systems. The session is most useful for teams that have a working LLM proof of concept and now need to scale it into a multi-team, multi-environment production deployment. If you are still at the prompt-engineering stage, the talk will give you a roadmap. If you are already running production AI, it will give you a structured way to audit your stack against the four pillars.
What is the single biggest mistake teams make when scaling LLM systems?
Treating evaluation as a launch gate rather than a continuous production process. Teams pass a curated eval set, ship, and then discover regressions through user complaints. The webinar covers the alternative pattern: continuous production evaluation that scores sampled live traffic, rolls scores up into drift alerts, and feeds back into model selection and prompt updates. The Future AGI ai-evaluation library (Apache 2.0) and the Agent Command Center route this loop with a small wrapper around the standard fi.evals.evaluate call and a tracer registration.
What evaluation framework does the webinar recommend?
A three-layer evaluation stack. Heuristic and reference-based metrics for narrow tasks, model-based scorers for semantic similarity and groundedness, and LLM-as-a-judge evaluators for open-ended outputs. The Future AGI ai-evaluation library packages all three layers in one Python API: fi.evals.evaluate("faithfulness", ...) is the simplest entry point, fi.evals.metrics.CustomLLMJudge gives you authoring control over LLM-as-a-judge prompts, and fi.evals.llm.LiteLLMProvider lets you point the judge at any backing model. For continuous production runs, the cloud Turing models are turing_flash (roughly 1 to 2 seconds), turing_small (roughly 2 to 3 seconds), and turing_large (roughly 3 to 5 seconds).
How should teams approach inference cost and latency in 2026?
Cost and latency are now coupled with model selection in a way they were not in 2024. The patterns the webinar covers (cache aggressively, batch where you can, push to smaller models on easy paths, route to bigger models only on hard paths) are still right. What 2026 added is inference-time compute: reasoning models pay much more compute per call but answer harder problems in one shot, so the cost calculus is per task rather than per token. The Future AGI Agent Command Center at /platform/monitor/command-center handles the routing, caching, and budget enforcement layer so the model-selection decision is made per-request based on live latency and cost data.
What guardrails are non-negotiable for enterprise AI?
At minimum: PII redaction in inputs and outputs, prompt-injection detection on untrusted inputs, output content classification (toxicity, regulated topics), and tool-call authorization for agents that touch external systems. The webinar walks through how to layer these without adding more than 100 to 200 milliseconds of overhead per request. The Future AGI Guardrails runtime (fi.evals.guardrails.Guardrails) runs as a pre-call and post-call layer in the Agent Command Center gateway and supports custom policy rules in addition to the built-in protectors.
How do I connect what's in the webinar to my own stack?
Three concrete steps. First, instrument with traceAI (Apache 2.0, github.com/future-agi/traceAI), which gives OTEL-compatible spans for OpenAI, Anthropic, Vertex, Bedrock, LangChain, LangGraph, LlamaIndex, and the OpenAI Agents SDK. Second, wire ai-evaluation (Apache 2.0, github.com/future-agi/ai-evaluation) into your CI and your production sampling job. Third, route production traffic through the Agent Command Center at /platform/monitor/command-center for the BYOK gateway, budgets, and pre-call guardrails. The three pieces can be wired incrementally and each layer delivers value on its own.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.