Google donates llm-d LLM inference gatew…

GOOGLE PUB_DATE: 2026.03.25

GOOGLE DONATES LLM-D LLM INFERENCE GATEWAY TO CNCF SANDBOX

Google open-sourced llm-d, a Kubernetes-native LLM inference gateway, into the CNCF Sandbox with backing from IBM, Red Hat, NVIDIA, and Anyscale. llm-d isn’t a...

Google open-sourced llm-d, a Kubernetes-native LLM inference gateway, into the CNCF Sandbox with backing from IBM, Red Hat, NVIDIA, and Anyscale.

llm-d isn’t a model or training stack; it’s the routing and scheduling layer for running LLMs on Kubernetes, with features like intelligent request routing and KV cache reuse to tame bursty, GPU-heavy traffic WebProNews. The aim is to standardize the “plumbing” of LLM inference so teams don’t keep rebuilding the same gateway layer.

The move lands in the CNCF as a vendor-neutral project and arrives alongside IBM and Red Hat’s push for a Kubernetes “blueprint” for LLM inference deployments The New Stack. For platform teams, this points to a common control plane for multi-model, multi-cluster LLM serving on K8s.

[ WHY_IT_MATTERS ]

01.

Inference is now the dominant AI cost center; a common, open gateway could lower latency and GPU burn via smarter routing and cache reuse.

02.

CNCF stewardship increases the odds of broad ecosystem support and less vendor lock-in across model backends and accelerator fleets.

[ WHAT_TO_TEST ]

terminal
Deploy llm-d in a staging Kubernetes cluster and benchmark P50/P95/P99 latency and GPU utilization with and without KV cache reuse under bursty load.
terminal
Route traffic across multiple model backends and compare tail latency, throughput, and failover behavior versus your current gateway or direct-to-runtime approach.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Front existing K8s-hosted inference services with llm-d; map auth, quotas, tracing, and metrics to your current stack.
02.
Plan a phased cutover by shadowing production traffic through llm-d and validating cache hit rates, autoscaling, and SLOs.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Adopt llm-d as the default inference entrypoint to standardize routing, caching, and observability from day one.
02.
Design for multi-tenant clusters and heterogeneous GPUs, letting llm-d handle placement while you keep business logic thin.

arrow_back

PREVIOUS_DATA_LOG

LiteLLM PyPI compromise shows why to turn on dependency cooldowns now

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

Build vs. Buy for AI Agents: Ship your own stack, fix prompts, and save the consulting bill

arrow_forward