Cloud Blog: Blackwell is here — new A4 VMs powered by NVIDIA B200 now in preview

Source URL: https://cloud.google.com/blog/products/compute/introducing-a4-vms-powered-by-nvidia-b200-gpu-aka-blackwell/
Source: Cloud Blog
Title: Blackwell is here — new A4 VMs powered by NVIDIA B200 now in preview

Feedly Summary: Modern AI workloads require powerful accelerators and high-speed interconnects to run sophisticated model architectures on an ever-growing diverse range of model sizes and modalities. In addition to large-scale training, these complex models need the latest high-performance computing solutions for fine-tuning and inference.Today, we’re excited to bring the highly-anticipated NVIDIA Blackwell GPUs to Google Cloud with the preview of A4 VMs, powered by NVIDIA HGX B200. The A4 VM features eight Blackwell GPUs interconnected by fifth-generation NVIDIA NVLink, and offers a significant performance boost over the previous generation A3 High VM. Each GPU delivers 2.25 times the peak compute and 2.25 times the HBM capacity, making A4 VMs a versatile option for training and fine-tuning for a wide range of model architectures, while the increased compute and HBM capacity makes it well-suited for low-latency serving.The A4 VM integrates Google’s infrastructure innovations with Blackwell GPUs to bring the best cloud experience for Google Cloud customers, from scale and performance, to ease-of-use and cost optimization. Some of these innovations include:Enhanced networking: A4 VMs are built on servers with our Titanium ML network adapter, optimized to deliver a secure, high-performance cloud experience for AI workloads, building on NVIDIA ConnectX-7 network interface cards (NICs). Combined with our datacenter-wide 4-way rail-aligned network, A4 VMs deliver non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). Customers can scale to tens of thousands of GPUs with our Jupiter network fabric with 13 Petabits/sec of bi-sectional bandwidth.Google Kubernetes Engine: With support for up to 65,000 nodes per cluster, GKE is the most scalable and fully automated Kubernetes service for customers to implement a robust, production-ready AI platform. Out of the box, A4 VMs are natively integrated with GKE. Integrating with other Google Cloud services, GKE facilitates a robust environment for the data processing and distributed computing that underpin AI workloads.Vertex AI: A4 VMs will be accessible through Vertex AI, our fully managed, unified AI development platform for building and using generative AI, and which is powered by the AI Hypercomputer architecture under the hood.Open software: In addition to PyTorch and CUDA, we work closely with NVIDIA to optimize JAX and XLA, enabling the overlap of collective communication and computation on GPUs. Additionally, we added optimized model configurations and example scripts for GPUs with XLA flags enabled.Hypercompute Cluster: Our new highly scalable clustering system streamlines infrastructure and workload provisioning, and ongoing operations of AI supercomputers with tight GKE and Slurm integration.Multiple consumption models: In addition to the On-demand, Committed use discount, and Spot consumption models, we reimagined cloud consumption for the unique needs of AI workloads with Dynamic Workload Scheduler, which offers two modes for different workloads: Flex Start mode for enhanced obtainability and better economics, and Calendar mode for predictable job start times and durations.Hudson River Trading, a multi-asset-class quantitative trading firm, will leverage A4 VMs to train its next generation of capital market model research. The A4 VM, with its enhanced inter-GPU connectivity and high-bandwidth memory, is ideal for the demands of larger datasets and sophisticated algorithms, accelerating Hudson River Trading’s ability to react to the market.

“We’re excited to leverage A4, powered by NVIDIA’s Blackwell B200 GPUs. Running our workload on cutting edge AI Infrastructure is essential for enabling low-latency trading decisions and enhancing our models across markets. We’re looking forward to leveraging the innovations in Hypercompute Cluster to accelerate deployment of training our latest models that deliver quant-based algorithmic trading.” – Iain Dunning, Head of AI Lab, Hudson River Trading
 

“NVIDIA and Google Cloud have a long-standing partnership to bring our most advanced GPU-accelerated AI infrastructure to customers. The Blackwell architecture represents a giant step forward for the AI industry, so we’re excited that the B200 GPU is now available with the new A4 VM. We look forward to seeing how customers build on the new Google Cloud offering to accelerate their AI mission.” – Ian Buck, Vice-President and General Manager of Hyperscale and HPC, NVIDIA

 

Better together: A4 VMs and Hypercompute ClusterEffectively scaling AI model training requires precise and scalable orchestration of infrastructure resources. These workloads often stretch across thousands of VMs, pushing the limits of compute, storage, and networking.Hypercompute Cluster enables you to deploy and manage these large clusters of A4 VMs with compute, storage and networking as a single unit. This makes it easy to manage complexity while delivering exceptionally high performance and resilience for large distributed workloads. Hypercompute Cluster is engineered to:Deliver high performance through co-location of A4 VMs densely packed to enable optimal workload placementOptimize resource scheduling and workload performance with GKE and Slurm, packed with intelligent features like topology-aware schedulingIncrease reliability with built-in self-healing capabilities, proactive health checks, and automated recovery from failuresEnhance observability and monitoring for timely and customized insightsAutomate provisioning, configuration, and scaling, integrated with GKE and SlurmWe’re excited to be the first hyperscaler to announce preview availability of an NVIDIA Blackwell B200-based offering. Together, A4 VMs and Hypercompute Cluster make it easier for organizations to create and deliver AI solutions across all industries. If you’re interested in learning more, please reach out to your Google Cloud representative.

AI Summary and Description: Yes

Summary: The text discusses the introduction of NVIDIA Blackwell GPUs in Google Cloud’s A4 VMs, emphasizing their capabilities for modern AI workloads, especially in terms of performance, scalability, and integration with various services. This creates significant implications for professionals in AI, cloud computing, and infrastructure security.

Detailed Description:

The text revolves around the launch of NVIDIA Blackwell GPUs with Google Cloud’s A4 VMs, which are designed to cater to the demands of modern AI workloads. Here are the major points highlighted:

– **Powerful Accelerators for AI Workloads**: The need for advanced hardware to handle diverse AI models is emphasized, marking the importance of efficient computing power in AI training and inference.

– **Overview of A4 VMs**: The A4 VM features:
– Eight Blackwell GPUs interconnected via fifth-generation NVIDIA NVLink.
– Improved compute performance (2.25 times peak compute over the A3 High VM).
– Increased High Bandwidth Memory (HBM) capacity which is crucial for handling larger datasets.

– **Enhanced Networking Capabilities**:
– A4 VMs utilize the Titanium ML network adapter for high-performance AI workloads.
– They support expansive GPU-to-GPU traffic capabilities at 3.2 Tbps using RDMA over Converged Ethernet (RoCE).
– The system allows scaling to tens of thousands of GPUs, optimizing performance through Jupiter network fabric with a staggering 13 Petabits/sec bandwidth.

– **Integration with Kubernetes and AI Platforms**:
– Google Kubernetes Engine (GKE) integration provides a robust framework for deploying AI models across a scalable infrastructure.
– Vertex AI offers a managed platform, simplifying generative AI model deployment.

– **Cloud Consumption Models**:
– Introduction of flexible billing options tailored for AI work, including the Dynamic Workload Scheduler that promotes efficient resource utilization based on workload demands.

– **Real-World Application**:
– Hudson River Trading’s endorsement highlights how the A4 VM will accelerate their quantitative trading model development, demonstrating practical applications for financial service providers.

– **Collaborative Innovation**:
– The partnership between NVIDIA and Google Cloud is stressed, indicating a combined effort to push the limits of AI infrastructure.

– **Hypercompute Cluster**:
– A pivotal feature that consolidates the management of large clusters for AI workloads, ensuring high performance, resource optimization, and resilience.

In summary, the advancements with NVIDIA Blackwell GPUs and A4 VMs provide critical enhancements for AI infrastructure. For professionals in AI, cloud computing, and infrastructure security, this launch signals new opportunities for performance improvements, operational efficiencies, and innovative capabilities that can reshape AI deployment strategies.