Webinars

LLM Inference Performance Webinar (2026 Update): Continuous Batching, Speculative Decoding, and Intelligent Caching for Production AI Serving

Watch the LLM inference performance webinar, updated for 2026: continuous batching, speculative decoding, and caching that can cut serving cost on suitable workloads.

·
Updated
·
3 min read
webinars
LLM Inference Performance Webinar: 2026 Update
Table of Contents

Inference Performance Webinar (2026 Update): The TL;DR

QuestionAnswer
Who is the speaker?FriendliAI infrastructure team, hosted by Future AGI.
Core techniques coveredContinuous batching, speculative decoding, intelligent caching, custom GPU kernels.
Cost impactUp to 90% reduction in serving cost on workloads with shared context or shared prefixes.
Who should watchML and MLOps engineers shipping production GenAI systems.
FAGI’s roleEvaluate and observe optimized inference at three latency tiers (turing_flash, turing_small, turing_large) plus traceAI Apache 2.0 instrumentation.

Watch the Webinar

Inference optimization separates production AI systems from proofs-of-concept. Most teams discover this only when costs spiral or p95 latency breaks SLAs.

What This Webinar Covers

As generative AI moves to production, the bottleneck shifts from training to serving. Industry coverage of large-scale deployments (see a16z’s LLM stack overview and the vLLM project) commonly points to inference as the dominant production GPU cost. Serving performance directly shapes user experience and unit economics.

This session walks through FriendliAI’s approach to LLM inference optimization. You will see architectural decisions and deployment strategies that enable lower-latency serving on production workloads and understand why inference is a business imperative, not a back-end concern.

This is not about squeezing marginal gains from existing infrastructure. It is about architecting inference pipelines that scale efficiently from day one.

Who Should Watch

ML and AI engineers, MLOps practitioners, infrastructure leads, and technical teams deploying generative AI in production who need to balance response speed, infrastructure costs, and system reliability.

Why You Should Watch: Continuous Batching, Speculative Decoding, Caching, and Real Customer Deployment Results

  • Why inference optimization becomes critical as AI systems move from prototype to production.
  • Continuous batching, speculative decoding, and intelligent caching that can reduce serving cost by up to 90% on workloads with shared context.
  • The FriendliAI infrastructure approach: custom GPU kernels and flexible deployment models.
  • Real customer deployments and measurable impact on latency, throughput, and cost.
  • Actionable deployment strategies for high-performance LLM serving at scale.
  • How inference efficiency affects latency, cost, and production reliability.

Key Insight

Most teams optimize model accuracy but deploy on generic serving infrastructure. Production-grade AI systems require purpose-built inference engines that treat serving performance as a first-class design constraint, not an afterthought.

Three Inference Optimization Techniques That Improve Serving Performance

Continuous Batching

Continuous batching merges new requests into in-flight GPU passes. Compare with static batching, which waits for a full batch before dispatching. Continuous batching keeps GPU utilization high under variable traffic. See the vLLM and PagedAttention paper and the vLLM continuous batching docs for implementation details.

Speculative Decoding

A small draft model proposes tokens. A larger model verifies and accepts or rejects them. Latency can drop 2-3x on suitable workloads while preserving output quality when the draft model is well matched to the target. The technique was popularized by Leviathan et al. (2023) and now ships in vLLM, TensorRT-LLM, and most managed inference platforms.

Intelligent Caching

KV cache reuse and prefix cache hits are the largest caching wins for production traffic. RAG systems with shared retrieved context, agent loops with shared system prompts, and high-traffic templated prompts all see dramatic cost cuts. See the SGLang prefix caching docs and the vLLM automatic prefix caching feature.

How Future AGI Evaluates and Observes Optimized Inference

Future AGI is the evaluation and observability companion for whichever inference stack you ship on. Two surfaces are most relevant here.

Evaluators across three latency tiers. Cloud evaluators run at turing_flash (~1-2s), turing_small (~2-3s), and turing_large (~3-5s). See docs.futureagi.com for the cloud evaluator API and the open-source ai-evaluation SDK (Apache 2.0).

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "your_key"
os.environ["FI_SECRET_KEY"] = "your_secret"

result = evaluate(
    "faithfulness",
    output="The model returned an answer in 230ms with 99% accuracy.",
    context="Production trace shows 230ms p50 latency on GPT-5.",
    model="turing_flash",
)

traceAI for end-to-end observability. traceAI (Apache 2.0) instruments LLM calls and surfaces token-level timing, retry counts, and cost in the Agent Command Center at /platform/monitor/command-center. Drop-in adapters cover LangChain, OpenAI Agents, LlamaIndex, and MCP.

Wire these together and you have inference traces correlated with evaluation scores, which is the only way to tell whether speeding up serving silently degraded answer quality.

Watch the Webinar and Explore Future AGI

The full webinar is gated above. For deeper coverage of related topics, see:

Visit Future AGI | Quickstart docs | Book a demo

Frequently asked questions

Who should watch the inference performance webinar?
ML and AI engineers, MLOps practitioners, infrastructure leads, and product engineers shipping generative AI in production who need to balance response latency, GPU cost, and reliability. The session assumes familiarity with LLM serving (vLLM, TGI, TensorRT-LLM) but starts from first principles for decoding, batching, and caching.
What inference optimization techniques does the webinar cover?
Three core techniques are covered. Continuous batching merges incoming requests into ongoing GPU passes for higher throughput. Speculative decoding uses a small draft model to predict tokens that a larger model verifies, cutting latency by 2-3x on common workloads. Intelligent caching reuses KV cache and prefix cache across requests with shared context, which together can reduce serving cost by up to 90%.
Why does inference performance matter more than training cost?
Once a model is trained, inference dominates total compute spend. Industry surveys put 80-90% of GPU resources at the inference stage in production AI. Inference latency directly shapes user experience and unit economics, while training is amortized over the model's lifetime. Optimizing inference is the lever with the largest cost and product impact.
What is FriendliAI's approach to inference optimization?
FriendliAI is a purpose-built inference engine company. Their approach combines custom GPU kernels, dynamic batching, speculative decoding, and quantization. The webinar walks through their architecture and customer deployments. See [friendli.ai](https://friendli.ai/) for the platform details.
How does inference optimization affect FAGI evaluations?
Optimized inference cuts evaluation latency and cost. Future AGI's cloud evaluators ship at three latency tiers: turing_flash (~1-2s), turing_small (~2-3s), turing_large (~3-5s). These tiers map to underlying inference-engine tradeoffs. For production traffic, faster evaluators let you run quality checks inline without breaking p95 latency budgets.
Can I observe inference performance with Future AGI?
Yes. traceAI (Apache 2.0) instruments LLM calls and surfaces token-level timing, retry counts, and cost in the Agent Command Center at /platform/monitor/command-center. You can correlate inference latency with evaluation scores and tie regressions back to specific model versions or serving configs.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.