Source URL: https://cloud.google.com/blog/products/ai-machine-learning/ai-hypercomputer-4-use-cases-tutorials-and-guides/
Source: Cloud Blog
Title: Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials
Feedly Summary: AI Hypercomputer is a fully integrated supercomputing architecture for AI workloads – and it’s easier to use than you think. In this blog, we break down four common use cases, including reference architectures and tutorials, representing just a few of the many ways you can use AI Hypercomputer today.
Short on time? Here’s a quick summary.
Affordable inference. JAX, Google Kubernetes Engine (GKE) and NVIDIA Triton Inference Server are a winning combination, especially when you pair them with Spot VMs for up to 90% cost savings. We have several tutorials, like this one on how to serve LLMs like Llama 3.1 405B on GKE.
Large and ultra-low latency training clusters. Hypercompute Cluster gives you physically co-located accelerators, targeted workload placement, advanced maintenance controls to minimize workload disruption, and topology-aware scheduling. You can get started by creating a cluster with GKE or try this pretraining NVIDIA GPU recipe.
High-reliability inference. Pair new cloud load balancing capabilities like custom metrics and service extensions with GKE Autopilot, which includes features like node auto-repair to automatically replace unhealthy nodes, and horizontal pod autoscaling to adjust resources based on application demand.
Easy cluster setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You can get started with one of our AI/ML blueprints.
If you want to see a broader set of reference implementations, benchmarks and recipes, go to the AI Hypercomputer GitHub.
Why it mattersDeploying and managing AI applications is tough. You need to choose the right infrastructure, control costs, and reduce delivery bottlenecks. AI Hypercomputer helps you deploy AI applications quickly, easily, and with more efficiency relative to just buying the raw hardware and chips.
Take Moloco, for example. Using the AI Hypercomputer architecture they achieved 10x faster model training times and reduced costs by 2-4x.
aside_block
Let’s dive deeper into each use case.
1. Reliable AI inference
According to Futurum, in 2023 Google had ~3x fewer outage hours vs. Azure, and ~3x fewer than AWS. Those numbers fluctuate over time, but maintaining high availability is a challenge for everyone. The AI Hypercomputer architecture offers fully integrated capabilities for high-reliability inference.
Many customers start with GKE Autopilot because of its 99.95% pod-level uptime SLA. Autopilot enhances reliability by automatically managing nodes (provisioning, scaling, upgrades, repairs) and applying security best practices, freeing you from manual infrastructure tasks. This automation, combined with resource optimization and integrated monitoring, minimizes downtime and helps your applications run smoothly and securely.
There are several configurations available, but in this reference architecture we use TPUs with the JetStream Engine to accelerate inference, plus JAX, GCS Fuse, and SSDs (like Hyperdisk ML) to speed up the loading of model weights. As you can see, there are two notable additions to the stack that get us to high reliability: Service Extensions and custom metrics.
Service extensions allow you to customize the behavior of Cloud Load Balancer by inserting your own code (written as plugins) into the data path, enabling advanced traffic management and manipulation.
Custom metrics, utilizing the Open Request Cost Aggregation (ORCA) protocol, allow applications to send workload-specific performance data (like model serving latency) to Cloud Load Balancer, which then uses this information to make intelligent routing and scaling decisions.
Try it yourself. Start by defining your Load Balancing Metrics, create a plugin using Service Extensions, or spin up a fully-managed Kubernetes cluster with Autopilot. For more ideas, check out this blog on the latest networking enhancements for generative AI applications
2. Large scale AI training
Training large AI models demands massive, efficiently scaled compute. Hypercompute Cluster is a supercomputing solution built on AI Hypercomputer that lets you deploy and manage a large number of accelerators as a single unit, using a single API call. Here are a few things that set Hypercompute Cluster apart:
Clusters are densely physically co-located for ultra-low-latency networking. They come with pre-configured and validated templates for reliable and repeatable deployments, and with cluster-level observability, health monitoring, and diagnostic tooling.
To simplify management, Hypercompute Clusters are designed for integrating with orchestrators like GKE and Slurm, and are deployed via the Cluster Toolkit. GKE provides support for over 50,000 TPU chips to train a single ML model.
In this reference architecture, we use GKE Autopilot and A3 Ultra VMs.
GKE supports up to 65,000 nodes — we believe this is more than 10X larger scale than the other two largest public cloud providers.
A3 Ultra uses NVIDIA H200 GPUs with twice the GPU-to-GPU network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega GPUs. They are built with our new Titanium ML network adapter and incorporate NVIDIA ConnectX-7 network interface cards (NICs) to deliver a secure, high-performance cloud experience, perfect for large multi-node workloads on GPUs.
Try it yourself: Create a Hypercompute Cluster with GKE or try this pretraining NVIDIA GPU recipe.
3. Affordable AI inference
Serving AI, especially large language models (LLMs), can become prohibitively expensive. AI Hypercomputer combines open software, flexible consumption models and a wide range of specialized hardware to minimize costs.
Cost savings are everywhere, if you know where to look. Beyond the tutorials, there are two cost-efficient deployment models you should know. GKE Autopilot reduces the cost of running containers by up to 40% compared to standard GKE by automatically scaling resources based on actual needs, while Spot VMs can save up to 90% on batch or fault-tolerant jobs. You can combine the two to save even more — “Spot Pods” are available in GKE Autopilot to do just that.
In this reference architecture, after training with JAX, we convert into NVIDIA’s Faster Transformer format for inferencing. Optimized models are served via NVIDIA’s Triton on GKE Autopilot. Triton’s multi-model support allows for easy adaptation to evolving model architectures, and a pre-built NeMo container simplifies setup.
Try it yourself: Start by learning how to serve a model with a single NVIDIA GPU in GKE. You can also serve Gemma open models with Hugging Face TGI, or LLMs like DeepSeek-R1 671B and Llama 3.1 405B.
4. Easy cluster setup and deployment
You need tools that simplify, not complicate, your infrastructure setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You get easy integration with JAX, PyTorch, and Keras. Platform teams get simplified management with Slurm, GKE, and Google Batch, plus flexible consumption models like Dynamic Workload Scheduler and a wide range of hardware options. In this reference architecture, we set up an A3 Ultra cluster with Slurm:
Try it yourself. You can select one of our easy-to-use AI/ML blueprints, available through our GitHub repo, and use it to set up a cluster. We also offer a variety of resources to help you get started, including documentation, quickstarts, and videos.
AI Summary and Description: Yes
Summary: The text discusses AI Hypercomputer, an integrated supercomputing architecture designed for efficient deployment and management of AI workloads. It addresses various use cases, highlighting cost-effective solutions, reliable inference, and straightforward cluster setups, making it particularly relevant for professionals in AI, cloud, and infrastructure security.
Detailed Description: The text outlines the features and advantages of AI Hypercomputer, aiming to streamline the deployment and management of AI applications. Key points include:
– **Integrated Supercomputing Architecture**: AI Hypercomputer combines software and hardware innovations to support AI workloads, enhancing usability and efficiency.
– **Cost-effective Inference**: The architecture leverages JAX, Google Kubernetes Engine (GKE), and NVIDIA Triton Inference Server, which can significantly reduce costs when combined with Spot VMs.
– **High-performance Training Clusters**: Hypercompute Clusters are designed for ultra-low latency and reliability in training large AI models, providing targeted workload placement and advanced maintenance controls.
– **Reliable Inference Capabilities**: With enhanced load balancing and GKE Autopilot’s features, the architecture ensures high availability and automatic management of infrastructure to minimize downtime.
– **Simplified Cluster Setup**: The Cluster Toolkit provides pre-built templates and blueprints for quick and efficient cluster deployments, compatible with various ML frameworks.
– **Use Case Examples**: Companies like Moloco have reported substantial improvements in training times and cost reductions when utilizing the AI Hypercomputer architecture.
Key Advantages Include:
– Improved reliability with a 99.95% uptime SLA for GKE Autopilot.
– Significant cost savings through intelligent scaling and the use of Spot VMs.
– Pre-validated configurations for easier deployment and integration.
– Integration of advanced networking solutions for large workloads.
In conclusion, AI Hypercomputer addresses critical challenges in deploying and managing AI applications by facilitating cost efficiency and high performance, making it essential for security and compliance professionals to consider in their operational strategies.