Cloud Blog: Google, Bytedance, and Red Hat make Kubernetes generative AI inference aware

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/google-bytedance-and-red-hat-improve-ai-on-kubernetes/
Source: Cloud Blog
Title: Google, Bytedance, and Red Hat make Kubernetes generative AI inference aware

Feedly Summary: Over the past ten years, Kubernetes has become the leading platform for deploying cloud-native applications and microservices, backed by an extensive community and boasting a comprehensive feature set for managing distributed systems. Today, we are excited to share that Kubernetes is now unlocking new possibilities for generative AI inference.
In partnership with Red Hat and ByteDance, we are introducing new capabilities that optimize load balancing, scaling and model server performance on Kubernetes clusters running large language model (LLMs) inference. These capabilities build on the success of LeaderWorkerSet (LWS), which enables multi-host inference for state-of-the-art models (including ones with 671B parameters), and push the envelope on what’s possible for gen AI Inference on Kubernetes.
First, the new Gateway API Inference Extension now supports LLM-aware routing, rather than traditional round robin. This makes it more cost-effective to operationalize popular Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low-Rank Adaptation (LoRA) at scale, by using a base model and dynamically loading fine-tuned models (‘adapters’) based on user need. To support PEFT natively, we also introduced new APIs, namely InferencePool and InferenceModel.

Second, a new inference performance project provides a benchmarking standard for detailed model performance insights on accelerators and HPA scaling metrics and thresholds. With the growth of gen AI inference on Kubernetes, it’s important to be able to measure the performance of serving workloads alongside the performance of model servers, accelerators, and Kubernetes orchestration.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

Third, Dynamic Resource Allocation, developed with Intel and others, simplifies and automates how Kubernetes allocates and schedules GPUs, TPUs, and other devices to pods and workloads. When used along with the vLLM inference and serving engine, the community benefits from scheduling efficiency and portability across accelerators.
“Large-scale inference with scalability and flexibility remains a challenge on Kubernetes. We are excited to collaborate with Google and the community on the Gateway API Inference Extension project to extract common infrastructure layers, creating a more unified and efficient routing system for AI serving — enhancing both AIBrix and the broader AI ecosystem.” – Jiaxin Shan, Staff Engineer at Bytedance, and Founder at AIBrix
“We’ve been collaborating with Google on various initiatives in the Kubernetes Serving working group, including a shared benchmarking tool for gen AI inference workloads. Working with Google, we hope to contribute to a common standard for developers to compare single-node inference performance and scale out to the multi-node architectures that Kubernetes brings to the table.” – Yuan Tang, Senior Principal Software Engineer, Red Hat
“We are partnering with Google to improve vLLM for operationalizing deployments of open-source LLMs for enterprise, including capabilities like LoRA support and Prometheus metrics that enable customers to benefit across the full stack right from vLLM to Kubernetes primitives such as Gateway. This deep partnership across the stack ensures customers get production ready architectures to deploy at scale” – Robert Shaw, vLLM Core Committer and Senior Director of Engineering Neural Magic (acquired by Red Hat)

Together, these projects allow customers to qualify and benchmark accelerators with the inference performance project, operationalize scale-out architectures with LLM-aware routing with the Gateway API Inference extension, and provide an environment with scheduling and fungibility benefits across a wide range of accelerators with DRA and vLLM. To try out these new capabilities for running gen AI inference on Kubernetes, visit Gateway API Inference Extension, the inference performance project or Dynamic Resource Allocation. Also, be sure to visit us at KubeCon in London this week, where we’ll be participating in the keynote as well as many other sessions. Stop by Booth S100 to say hi!

AI Summary and Description: Yes

**Summary:** The text discusses new capabilities in Kubernetes that enhance generative AI inference, particularly focusing on large language models (LLMs). These advancements include improvements in load balancing, model server performance, and dynamic resource allocation, highlighting the collaboration between major industry players like Red Hat and ByteDance.

**Detailed Description:**

The provided content outlines significant developments in Kubernetes that are tailored for optimizing generative AI inference capabilities. As Kubernetes continues to be a central platform for cloud-native applications, these enhancements are particularly relevant for professionals dealing with AI infrastructure, especially those focused on deploying large language models (LLMs). Major points include:

– **New Capabilities for Generative AI Inference:**
– Collaboration among Kubernetes, Red Hat, and ByteDance.
– Introduction of features that support model server performance and efficient load balancing for LLM inference.

– **Key Features:**
– **Gateway API Inference Extension:**
– Supports LLM-aware routing, moving beyond traditional methods.
– Facilitates operationalizing Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA).
– Allows for dynamic loading of fine-tuned models to optimize user requests.

– **Inference Performance Project:**
– Establishes a benchmarking standard for evaluating model performance on accelerators (such as GPUs and TPUs).
– Provides metrics for Horizontal Pod Autoscaling (HPA) related to inference workloads.

– **Dynamic Resource Allocation (DRA):**
– Simplifies and automates the scheduling of GPUs, TPUs, and other resources within Kubernetes.
– Enhances task scheduling efficiency for AI workloads when using the vLLM inference engine.

– **Challenges and Excitement:**
– Despite the advancements, large-scale inference challenges persist; however, these collaborative efforts are expected to lead to a more flexible and scalable AI serving infrastructure in Kubernetes.
– Noteworthy quotes from industry professionals emphasize the collaborative spirit aimed at improving the enterprise deployment of open-source LLMs.

The text encapsulates the strides being made in the Kubernetes ecosystem, advocating for enhanced infrastructure management to operationalize AI at scale. Not only do these developments offer practical insights into current trends in AI and cloud infrastructure, but they also set the stage for professionals looking to leverage Kubernetes for advanced AI applications.

In conclusion, the enhancements in Kubernetes will not only boost the overall performance of AI workloads but also provide significant advantages in terms of scalability and adaptability, crucial for enterprises eager to implement state-of-the-art AI solutions.