Cloud Blog: Announcing smaller machine types for A3 High VMs

Source URL: https://cloud.google.com/blog/products/compute/announcing-smaller-machine-types-for-a3-high-vms/
Source: Cloud Blog
Title: Announcing smaller machine types for A3 High VMs

Feedly Summary: Today, an increasing number of organizations are using GPUs to run inference1 on their AI/ML models. Since the number of GPUs needed to serve a single inference workload varies, organizations need more granularity in the number of GPUs in their virtual machines (VMs) to keep costs low while scaling with user demand.
You can use A3 High VMs powered by NVIDIA H100 80GB GPUs in multiple generally available machine types of 1NEW, 2NEW, 4NEW, and 8 GPUs.
Accessing smaller H100 machine types
All A3 machine types are available through the fully managed Vertex AI, as nodes through Google Kubernetes Engine (GKE), and as VMs through Google Compute Engine.
The 1, 2, and 4 A3 High GPU machine types are available as Spot VMs and through Dynamic Workload Scheduler (DWS) Flex Start mode.

A3 VMs portfolio powered by NVIDIA H100 GPUs

Machine Type(GPUs count, GPU memory)

Vertex AI

Google Kubernetes EngineGoogle Compute Engine

a3-highgpu-1g NEW(1 GPUs, 80 GB)

Vertex AI Model Garden and Online Prediction (Spot)

Spot

DWS Flex Start mode

a3-highgpu-2g NEW(2 GPUs, 160 GB)

Vertex AI Model Garden and Online Prediction(On-demanda, Spot)

a3-highgpu-4g NEW(4 GPUs, 320 GB)

a3-highgpu-8g(8 GPUs, 640 GB)

Vertex AI Online Prediction (On-Demand, Spot)

Vertex AI Training (On-demand, Spot, DWS Flex Start mode )

On-demand

Spot

DWS Flex Start mode

DWS Calendar mode

a3-megagpu-8g(8 GPUs, 640 GB)

aAvailable only through Model Garden owned capacity.
Google Kubernetes Engine
For almost a decade, GKE has been the platform-of-choice for running web applications and microservices, and now it provides a cost efficient, highly scalable, and open platform for training and serving AI workloads. GKE Autopilot reduces operational cost and offers workload-level SLAs, and is a fantastic choice for inference workloads — bring your workload and let Google do the rest. You can use the 1, 2, and 4 A3 High GPU machine types through both GKE Standard and GKE Autopilot modes of operation.
Below are two examples of creating node pools in your GKE cluster with a3-highgpu-1g machine type using Spot VMs and Dynamic Workload Scheduler Flex Start mode.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>

Using Spot VMs with GKE
Here’s how to request and deploy a3-highgpu-1g Spot VM on GKE using the gcloud API.

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud container node-pools create NODEPOOL_NAME \\\r\n –cluster CLUSTER_NAME \\\r\n –region CLUSTER_REGION \\\r\n –node-locations GPU_ZONE1,GPU_ZONE2 \\\r\n –machine-type a3-highgpu-1g \\\r\n –accelerator type=nvidia-h100-80gb,count=1,gpu-driver-version=latest \\\r\n –image-type COS_CONTAINERD \\\r\n –spot’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5a67c96b50>)])]>

Using Dynamic Workload Scheduler Flex Start mode with GKE
Here’s how to request a3-highgpu-1g using Dynamic Workload Scheduler Flex Start mode with GKE.

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud beta container node-pools create NODEPOOL_NAME \\\r\n –cluster CLUSTER_NAME \\\r\n –region CLUSTER_REGION \\\r\n –node-locations GPU_ZONE1,GPU_ZONE2 \\\r\n –enable-queued-provisioning \\\r\n –machine-type=a3-highgpu-1g \\\r\n –accelerator type=nvidia-h100-80gb,count=1,gpu-driver-version=latest \\\r\n –enable-autoscaling \\\r\n –num-nodes=0 \\\r\n –total-max-nodes TOTAL_MAX_NODES \\\r\n –location-policy=ANY \\\r\n –reservation-affinity=none \\\r\n –no-enable-autorepair’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5a67c96040>)])]>

This creates a GKE node pool with Dynamic Workload Scheduler enabled and that contains zero nodes. You can then run your workloads with Dynamic Workload Scheduler.
Vertex AI
Vertex AI is Google Cloud’s fully managed, unified AI development platform for building and using predictive and generative AI. With the new 1, 2, and 4 A3 High GPU machine types, Model Garden customers can deploy hundreds of open models cost-effectively and with strong performance.
What our customers are saying
“We use Google Kubernetes Engine to run the backend for our AI-assisted software development product. Smaller A3 machine types have enabled us to reduce the latency of our real-time code assist models by 36% compared to A2 machine types, significantly improving user experience." – Eran Dvey Aharon, VP R&D, Tabnine
Get started today
At Google Cloud, our goal is to provide you with the flexibility you need to run inference for your AI and ML models cost-effectively as well as with great performance. The availability of A3 High VMs using NVIDIA H100 80GB GPUs in smaller machine types provides you with the granularity you need to scale with user demand while keeping costs in check.

1. AI or ML inference is the process by which a trained AI model uses its training data to calculate output data or make predictions about new data points or scenarios.

AI Summary and Description: Yes

Summary: The text discusses the introduction of A3 High VMs powered by NVIDIA H100 GPUs for organizations that utilize AI/ML for inference tasks. This innovation allows for enhanced granularity and cost-efficiency in virtual machine configurations. As more entities rely on GPU-based processing, the ability to fine-tune GPU resources according to workload needs becomes increasingly relevant, particularly in cloud environments.

Detailed Description:

– **Context of Use**: The text highlights the growing trend of organizations leveraging Graphics Processing Units (GPUs) for AI and Machine Learning (ML) inference. As the demand for real-time processing grows, the need for flexible, scalable, and cost-effective solutions in cloud environments becomes critical.

– **A3 High VMs Overview**:
– Powered by NVIDIA H100 with 80GB memory.
– Different machine types offered: 1, 2, 4, and 8 GPUs to meet various needs.
– Configurations available via:
– Vertex AI
– Google Kubernetes Engine (GKE)
– Google Compute Engine

– **Cost-Effective Solutions**:
– Smaller GPU machine types (1, 2, and 4) are available as Spot VMs and via Dynamic Workload Scheduler (DWS) Flex Start mode, allowing organizations to optimize costs while scaling.

– **Technical Applications**:
– GKE is recommended for running applications and microservices due to its cost efficiency and scalability for AI workloads.
– Details on how to create node pools using GKE with a3-highgpu-1g and enabling features like Spot VMs and Dynamic Workload Scheduler are included.

– **Customer Testimonials**: A quote from a user emphasizes the significant performance improvement achieved by using the smaller A3 machine types, which leads to reduced latency and better user experiences, particularly for AI-assisted applications.

– **Introduction of Vertex AI**:
– Google Cloud’s unified AI development platform offers an ecosystem to build and utilize predictive and generative AI.
– The new GPU machine types facilitate deploying numerous open models efficiently.

– **Key Takeaways for Professionals**:
– The advancement in GPU resources is integral for organizations deploying AI models, with implications for both cost management and performance enhancement.
– Professionals in cloud computing, AI security, and infrastructure should consider these offerings for improved operational workflows and customer engagement.

This content is particularly relevant within the domains of cloud computing security, infrastructure security, and AI/ML systems, providing actionable insights for businesses aiming to optimize their AI inference workloads.