Cloud Blog: Introducing the next generation of AI inference, powered by llm-d

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/enhancing-vllm-for-distributed-inference-with-llm-d/
Source: Cloud Blog
Title: Introducing the next generation of AI inference, powered by llm-d

Feedly Summary: As the world transitions from prototyping AI solutions to deploying AI at scale, efficient AI inference is becoming the gating factor. Two years ago, the challenge was the ever-growing size of AI models. Cloud infrastructure providers responded by supporting orders of magnitude more compute and data. Today, agentic AI workflows and reasoning models create highly variable demands and another exponential increase in processing, easily bogging down the inference process and degrading the user experience. Cloud infrastructure has to evolve again.
Open-source inference engines such as vLLM are a key part of the solution. At Google Cloud Next 25 in April, we announced full vLLM support for Cloud TPUs in Google Kubernetes Engine (GKE), Google Compute Engine, Vertex AI, and Cloud Run. Additionally, given the widespread adoption of Kubernetes for orchestrating inference workloads, we introduced the open-source Gateway API Inference Extension project to add AI-native routing to Kubernetes, and made it available in our GKE Inference Gateway. Customers like Snap, Samsung, and BentoML are seeing great results from these solutions. And later this year, customers will be able to use these solutions with our seventh-generation Ironwood TPU, purpose-built to build and serve reasoning models by scaling to up to 9,216 liquid-cooled chips in a single pod linked with breakthrough Inter-Chip Interconnect (ICI). But, there’s opportunity for even more innovation and value.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Today, we’re making inference even easier and more cost-effective, by making vLLM fully scalable with Kubernetes-native distributed and disaggregated inference. This new project is called llm-d. Google Cloud is a founding contributor alongside Red Hat, IBM Research, NVIDIA, and CoreWeave, joined by other industry leaders AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. Google has a long history of founding and contributing to key open-source projects that have shaped the cloud, such as Kubernetes, JAX, and Istio, and is committed to being the best platform for AI development. We believe that making llm-d open-source, and community-led, is the best way to make it widely available, so you can run it everywhere and know that a strong community supports it.
llm-d builds upon vLLM’s highly efficient inference engine, adding Google’s proven technology and extensive experience in securely and cost-effectively serving AI at billion-user scale. llm-d includes three major innovations: First, instead of traditional round-robin load balancing, llm-d includes a vLLM-aware inference scheduler, which enables routing requests to instances with prefix-cache hits and low load, achieving latency SLOs with fewer hardware resources. Second, to serve longer requests with higher throughput and lower latency, llm-d supports disaggregated serving, which handles the prefill and decode stages of LLM inference with independent instances. Third, llm-d introduces a multi-tier KV cache for intermediate values (prefixes) to improve response time across different storage tiers and reduce storage costs. llm-d works across frameworks (PyTorch today, JAX later this year), and both GPU and TPU accelerators, to provide choice and flexibility.

You can already use features like model-aware routing and load balancing on AI Hypercomputer with GKE Inference Gateway and vLLM with multiple accelerators, secured by Model Armor.
We are excited to partner with the community to help you cost-effectively scale AI in your business. llm-d incorporates state-of-the-art distributed serving technologies into an easily deployed Kubernetes stack. Deploying llm-d on Google Cloud provides low-latency and high-performance inference by leveraging Google Cloud’s vast global network, GKE AI capabilities, and AI Hypercomputer integrations across software and hardware accelerators. Early tests by Google Cloud using llm-d show 2x improvements in time-to-first-token for use cases like code completion, enabling more responsive applications.
Visit the llm-d project to learn more, contribute, and get started today.

AI Summary and Description: Yes

Summary: The text focuses on the advancements in AI inference, particularly the developments at Google Cloud Next 25, including the integration of scalable solutions like vLLM and llm-d for enhanced AI performance using Kubernetes. It emphasizes how these innovations can improve user experience and efficiency in AI workflows, a crucial aspect for professionals in AI and cloud infrastructure.

Detailed Description: The piece highlights several key points regarding the evolving landscape of AI inference and Google Cloud’s contributions to this development. Here’s a detailed breakdown:

– **Transition from Prototyping to Deployment**: The text underscores the shift from AI prototyping to large-scale deployment, where efficient AI inference has become essential. As models grow in complexity, the demand for processing power increases significantly.

– **Challenges in AI Inference**: It discusses the challenges posed by “agentic AI workflows” which lead to varied processing requirements, potentially degrading user experience due to bottlenecks in inference.

– **Cloud Infrastructure Evolution**: There’s an indication that cloud infrastructure must continually adapt to support these growing and variable demands. The need for innovation in this space is emphasized.

– **Open Source Innovations**:
– **vLLM**: The text mentions open-source inference engines like vLLM that help address these challenges. Google Cloud has provided full support for vLLM across various platforms like GKE, Google Compute Engine, and Cloud Run.
– **llm-d Project**: A new project termed “llm-d” is introduced, which enhances vLLM’s capabilities to enable scalable, Kubernetes-native distributed inference.

– **Community Collaboration**: The text indicates that Google Cloud is collaborating with notable industry leaders (Red Hat, IBM Research, NVIDIA, etc.) to make llm-d open-source and community-led, reinforcing the importance of community in technological development.

– **Key Innovations of llm-d**:
– **vLLM-aware Inference Scheduler**: This innovative feature allows for efficient request routing to optimize latency with fewer resources.
– **Disaggregated Serving**: This strategy separates the handling of prefill and decode requests to boost performance.
– **Multi-tier KV Cache**: A novel caching method to enhance response times and reduce storage costs across various tiers.

– **Interoperability and Flexibility**: llm-d is designed to work with multiple frameworks, providing flexibility for developers by working with both GPU and TPU accelerators.

– **Performance Gains**: Initial tests have shown that using llm-d may lead to significant performance improvements, including a 2x improvement in time-to-first-token for applications like code completion, pointing towards better user experiences with AI applications.

Overall, the innovations discussed in the text are critical for security and compliance professionals to consider as they prepare for and adapt to the evolving AI landscape, particularly in cloud environments where scalable and efficient AI solutions are of paramount importance.