Cloud Blog: New GKE inference capabilities reduce costs, tail latency and increase throughput

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/understanding-new-gke-inference-capabilities/
Source: Cloud Blog
Title: New GKE inference capabilities reduce costs, tail latency and increase throughput

Feedly Summary: When it comes to AI, inference is where today’s generative AI models can solve real-world business problems. Google Kubernetes Engine (GKE) is seeing increasing adoption of gen AI inference. For example, customers like HubX run inference of image-based models to serve over 250k images/day to power gen AI experiences, and Snap runs AI inference on GKE for its ad ranking system.
However, there are challenges when deploying gen AI inference. First, during the evaluation phase of this journey, you have to evaluate all your accelerator options. You need to choose the right one for your use case. While many customers are interested in using Tensor Processing Units (TPU), they are looking for compatibility with popular model servers. Then, once you’re in production, you need to load-balance traffic, manage price-performance with real traffic at scale, monitor performance, and debug any issues that arise.
To help, this week at Google Cloud Next, we introduced new gen AI inference capabilities for GKE:

GKE Inference Quickstart to help you set up inference environments according to best practices enhancements

GKE TPU serving stack to help you easily benefit from the price-perf of TPUs

GKE Inference Gateway, which introduces gen-AI-aware scaling and load balancing techniques

Together these capabilities help reduce serving costs by over 30%, tail latency by 60%, and increase throughput by up to 40% compared to other managed and open-source Kubernetes offerings.

01- Revised Graphs – Introducing GKE Optimized Inference

02-Revised Graphs – Introducing GKE Optimized Inference (1)

03-Revised Graphs – Introducing GKE Optimized Inference (2)

GKE Inference Quickstart
GKE Inference Quickstart helps you select and optimize the best accelerator, model server and scaling configuration for your AI/ML inference applications. It includes information about instance types, their model compatibility across GPU and TPUs, and benchmarks for how a given accelerator can help you meet your performance goals. Then, once your accelerators are configured, GKE Inference Quickstart can help you with Kubernetes scaling, as well as new inference-specific metrics. In future releases, GKE Inference Quickstart will be available as a Gemini Cloud Assist experience.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

GKE TPU serving stack
With support for TPUs and vLLM, one of the leading open-source model servers, you get seamless portability across GPUs and TPUs. This means you can use any open model, select the vLLM:TPU container image and just deploy on GKE without any TPU-specific changes. GKE Inference Quickstart also recommends TPU best practices so you can seamlessly run on TPUs without any switching costs. For customers who want to run state-of-the-art models, Pathways, used internally at Google for large models like Gemini, allows you to run multi-host and disaggregated serving.
GKE Inference Gateway
GKE Gateway is an abstraction backed by a load balancer to route incoming requests to your Kubernetes applications, and traditionally, it has been tuned for web serving applications, using load-balancing techniques such as round-robin, whose requests have very predictable patterns. But LLMs have high variability in their request patterns. This can result in high tail latencies and uneven compute utilization, which can negatively impact the end-user experience and unnecessarily increase inference costs. In addition, traditional Gateway does not support routing infrastructure for popular Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA), which can increase GPU efficiency by model reuse during inference.
For scale-out scenarios, the new GKE Inference Gateway provides gen-AI-aware load balancing, for optimal routing. With GKE Inference Gateway, you can define routing rules for safe rollouts, cross-regional preferences, and performance goals such as priority. Finally, GKE Inference Gateway supports LoRA, which lets you map multiple models to the same underlying service, for better efficiency.
To summarize, the visual below shows the needs of the customers during the different stages of the AI inference journey, and how GKE Inference Quickstart, GKE TPU serving stack and GKE Inference Gateway help simplify the evaluation, onboarding and production phases.

What our customers are saying
“Using TPUs on GKE, especially the newer Trillium for inference, particularly for image generation, has reduced latency by up to 66%, leading to a better user experience and increased conversion rates. Users get responses in under 10 seconds instead of waiting up to 30 seconds. This is crucial for user engagement and retention.” – Cem Ortabas, Co-founder, HubX
“Optimizing price-performance for generative AI inference is key for our customers. We are excited to see GKE Inference Gateway with its optimized load balancing and extensibility in open-source. The new GKE Inference Gateway capabilities could help us further improve performance for our customers’ inference workloads “ – Chaoyu Yang, CEO & Founder, BentoML
With GKE’s new inference capabilities, you get a powerful set of capabilities to take the next step with AI. To learn more, join our GKE gen AI inference breakout session at Next 25, and hear how Snap re-architected their inference platform.

AI Summary and Description: Yes

Summary: The text discusses the advancements in generative AI inference capabilities introduced for Google Kubernetes Engine (GKE), emphasizing innovations such as the GKE Inference Quickstart, TPU serving stack, and Inference Gateway. These improvements promise significant reductions in serving costs and latency, making them highly relevant for AI and cloud computing professionals looking to optimize their infrastructure for AI workloads.

Detailed Description:
The provided text outlines new features and enhancements in Google Kubernetes Engine (GKE) that cater specifically to generative AI inference. As businesses increasingly adopt generative AI, the text highlights how GKE can optimize performance and efficiency in the inference phase of AI deployments.

– **Key Innovations**:
– **GKE Inference Quickstart**: This tool provides best practice guidelines for setting up inference environments, including advice on accelerator selection, model compatibility (GPUs and TPUs), and scaling configurations.
– **GKE TPU Serving Stack**: Supports seamless deployment using TPUs with minimal changes, enhancing portability and efficiency for AI models. Recommendations for TPUs are provided, allowing users to optimize their configurations without costs associated with switching.
– **GKE Inference Gateway**: Introduces advanced load-balancing techniques tailored for the high variability seen in large language models (LLMs), helping reduce latency and improve efficiency. It also allows users to route requests based on specific performance goals and supports advanced model fine-tuning techniques (e.g., LoRA).

– **Business Impact**:
– Reduction in serving costs by over 30%.
– Decrease in tail latency by up to 60%.
– Increase in throughput by up to 40% compared to other solutions.

– **Customer Testimonials**:
– Users have reported significant improvements in latency and user experience, with one company seeing reductions of latency by up to 66%, leading to higher engagement and conversion rates.
– Another CEO expressed optimism regarding the enhanced load balancing and performance optimization offered by the GKE Inference Gateway.

The overall implication of these developments is that the new features can substantially enhance the operational efficiency of businesses leveraging generative AI. Professionals in cloud infrastructure, AI, and related fields should consider these tools to improve their deployment and operational strategies.