Cloud Blog: Introducing the next generation of AI inference, powered by llm-d

May 20, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/enhancing-vllm-for-distributed-inference-with-llm-d/
Source: Cloud Blog
Title: Introducing the next generation of AI inference, powered by llm-d

Feedly Summary: As the world transitions from prototyping AI solutions to deploying AI at scale, efficient AI inference is becoming the gating factor. Two years ago, the challenge was the ever-growing size of AI models. Cloud infrastructure providers responded by supporting orders of magnitude more compute and data. Today, agentic AI workflows and reasoning models create highly variable demands and another exponential increase in processing, easily bogging down the inference process and degrading the user experience. Cloud infrastructure has to evolve again.
Open-source inference engines such as vLLM are a key part of the solution. At Google Cloud Next 25 in April, we announced full vLLM support for Cloud TPUs in Google Kubernetes Engine (GKE), Google Compute Engine, Vertex AI, and Cloud Run. Additionally, given the widespread adoption of Kubernetes for orchestrating inference workloads, we introduced the open-source Gateway API Inference Extension project to add AI-native routing to Kubernetes, and made it available in our GKE Inference Gateway. Customers like Snap, Samsung, and BentoML are seeing great results from these solutions. And later this year, customers will be able to use these solutions with our seventh-generation Ironwood TPU, purpose-built to build and serve reasoning models by scaling to up to 9,216 liquid-cooled chips in a single pod linked with breakthrough Inter-Chip Interconnect (ICI). But, there’s opportunity for even more innovation and value.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Today, we’re making inference even easier and more cost-effective, by making vLLM fully scalable with Kubernetes-native distributed and disaggregated inference. This new project is called llm-d. Google Cloud is a founding contributor alongside Red Hat, IBM Research, NVIDIA, and CoreWeave, joined by other industry leaders AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Google has a long history of founding and contributing to key open-source projects that have shaped the cloud, such as Kubernetes, JAX, and Istio, and is committed to being the best platform for AI development. We believe that making llm-d open-source, and community-led, is the best way to make it widely available, so you can run it everywhere and know that a strong community supports it.
llm-d builds upon vLLM’s highly efficient inference engine, adding Google’s proven technology and extensive experience in securely and cost-effectively serving AI at billion-user scale. llm-d includes three major innovations: First, instead of traditional round-robin load balancing, llm-d includes a vLLM-aware inference scheduler, which enables routing requests to instances with prefix-cache hits and low load, achieving latency SLOs with fewer hardware resources. Second, to serve longer requests with higher throughput and lower latency, llm-d supports disaggregated serving, which handles the prefill and decode stages of LLM inference with independent instances. Third, llm-d introduces a multi-tier KV cache for intermediate values (prefixes) to improve response time across different storage tiers and reduce storage costs. llm-d works across frameworks (PyTorch today, JAX later this year), and both GPU and TPU accelerators, to provide choice and flexibility.

You can already use features like model-aware routing and load balancing on AI Hypercomputer with GKE Inference Gateway and vLLM with multiple accelerators, secured by Model Armor.
We are excited to partner with the community to help you cost-effectively scale AI in your business. llm-d incorporates state-of-the-art distributed serving technologies into an easily deployed Kubernetes stack. Deploying llm-d on Google Cloud provides low-latency and high-performance inference by leveraging Google Cloud’s vast global network, GKE AI capabilities, and AI Hypercomputer integrations across software and hardware accelerators. Early tests by Google Cloud using llm-d show 2x improvements in time-to-first-token for use cases like code completion, enabling more responsive applications.
Visit the llm-d project to learn more, contribute, and get started today.

AI Summary and Description: Yes

Summary: The text focuses on the advancements in AI inference, particularly the developments at Google Cloud Next 25, including the integration of scalable solutions like vLLM and llm-d for enhanced AI performance using Kubernetes. It emphasizes how these innovations can improve user experience and efficiency in AI workflows, a crucial aspect for professionals in AI and cloud infrastructure.

Detailed Description: The piece highlights several key points regarding the evolving landscape of AI inference and Google Cloud’s contributions to this development. Here’s a detailed breakdown:

– **Transition from Prototyping to Deployment**: The text underscores the shift from AI prototyping to large-scale deployment, where efficient AI inference has become essential. As models grow in complexity, the demand for processing power increases significantly.

– **Challenges in AI Inference**: It discusses the challenges posed by “agentic AI workflows” which lead to varied processing requirements, potentially degrading user experience due to bottlenecks in inference.

– **Cloud Infrastructure Evolution**: There’s an indication that cloud infrastructure must continually adapt to support these growing and variable demands. The need for innovation in this space is emphasized.

– **Open Source Innovations**:
– **vLLM**: The text mentions open-source inference engines like vLLM that help address these challenges. Google Cloud has provided full support for vLLM across various platforms like GKE, Google Compute Engine, and Cloud Run.
– **llm-d Project**: A new project termed “llm-d” is introduced, which enhances vLLM’s capabilities to enable scalable, Kubernetes-native distributed inference.

– **Community Collaboration**: The text indicates that Google Cloud is collaborating with notable industry leaders (Red Hat, IBM Research, NVIDIA, etc.) to make llm-d open-source and community-led, reinforcing the importance of community in technological development.

– **Key Innovations of llm-d**:
– **vLLM-aware Inference Scheduler**: This innovative feature allows for efficient request routing to optimize latency with fewer resources.
– **Disaggregated Serving**: This strategy separates the handling of prefill and decode requests to boost performance.
– **Multi-tier KV Cache**: A novel caching method to enhance response times and reduce storage costs across various tiers.

– **Interoperability and Flexibility**: llm-d is designed to work with multiple frameworks, providing flexibility for developers by working with both GPU and TPU accelerators.

– **Performance Gains**: Initial tests have shown that using llm-d may lead to significant performance improvements, including a 2x improvement in time-to-first-token for applications like code completion, pointing towards better user experiences with AI applications.

Overall, the innovations discussed in the text are critical for security and compliance professionals to consider as they prepare for and adapt to the evolving AI landscape, particularly in cloud environments where scalable and efficient AI solutions are of paramount importance.

1 2 5 a accelerator accelerators Act adoption ads advancement advancements agent agentic AI AGI AI AI applications AI development AI landscape ai model AI models AMD and API app Application applications Arch Aria ARM art as aware being Best Bi building built business by C Cache caching capabilities challenges chip chips CI CIA Cisco Cloud cloud environment cloud environments cloud infrastructure Cloud Next Cloud Run co code code completion Col collaboration commit community community collaboration community support complexity compliance compliance professionals compute Compute Engine computer Console core CoreWeave cost cost-effective Costs critical cross Customer D data day de demand deployment design developer developers development developments distributed inference e e-learning effective efficiency efficient end environment event exp experience face fact feature features first fixes flexibility for framework frameworks free full g Gateway Gen generation GKE global network Go Google Google Cloud Google Cloud Next Google Compute Engine Google Kubernetes Google Kubernetes Engine GPU grading gs H hardware hardware accelerators high high-performance Highlight HP HR http HTTPS hugging Hugging Face Hyper Hypercomputer IBM image in industry industry leaders Inference inference engine inference engines inference workloads infrastructure infrastructure providers innovation Innovations integration integrations Intel inter interoperability Iron IRS ite J Jax k Key Kubernetes Kubernetes Engine l Labor Lambda land large latency learning led Li Link linked liquid llm lm load balancing logic long low M mac machine made making man media Mistral ML Mode model Model Armor models multi N native network next no non Nvidia o of on one oost open open-source OPM opt orchestrating ory out over performance performance gains performance improvement performance improvements platform platforms point porting potential Power pre process processing processing power product products professionals project projects prototyping Py pytorch Q R rag rate RCE ready reasoning reasoning model reasoning models red Red Hat Requirements research resource resources response response times Ro RoT routing s sam Samsung scalable scalable solutions Scale scaling search sec secure security security and compliance SHA shift side Sig single size Snap software solutions source source project source projects SSE stack start state storage storage costs Strategy support T tech technological technologies technology test text the third throughput Time to token Tor TP transition trial two Uber UI under up US use use cases user user experience V val Vertex Vertex AI vllm Ware Wi workflow workflows workload workloads world x