Cloud Blog: Orchestrating GPU-based distributed training workloads on AI Hypercomputer

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/gpu-orchestration-options-on-ai-hypercomputer/
Source: Cloud Blog
Title: Orchestrating GPU-based distributed training workloads on AI Hypercomputer

Feedly Summary: When it comes to AI, large language models (LLMs) and machine learning (ML) are taking entire industries to the next level. But with larger models and datasets, developers need distributed environments that span multiple AI accelerators (e.g. GPUs and TPUs) across multiple compute hosts to train their models efficiently. This can lead to orchestration, resource management, and scalability challenges.
We’re here to help. At Google Cloud, we provide a robust suite of GPU and TPU resources alongside advanced orchestration tools as part of AI Hypercomputer architecture to simplify distributed, large-scale training. In this blog, we’ll guide you through the orchestration tools available for GPU accelerators on Google Cloud that can help you streamline and scale your machine learning workflows.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ”), (‘image’, None)])]>

Choose the right accelerator family
A key element of distributed training lies in selecting the right GPU. Google Cloud’s specialized machine families offer tailored solutions for varying needs of performance and cost efficiency. The A3 machine series, featuring NVIDIA H100 and NVIDIA H200 (upcoming) GPUs, delivers strong GPU-to-GPU bandwidth that’s a great fit for large-scale training workloads. In contrast, the A2 machine series with NVIDIA A100 GPUs is designed for scenarios that require minimal inter-node communication such as streamlined, single-node training. Additionally, the versatile G2 machine family, equipped with NVIDIA L4 GPUs, provides the flexibility necessary for inference and testing workloads.
We also offer multiple GPU consumption models to meet the needs of large-scale training:

Committed Use Discounts (CUDs) provide significant cost savings and guaranteed capacity in return for a long-term commitment.

Dynamic Workload Scheduler (DWS) comes in two modes, which are designed for various workloads that need assurance or can be flexible about start time; the capacity is available for a defined duration and offered at a lower list price.

On-demand consumption is the most flexible, with no upfront commitments, although the capacity availability is not guaranteed.

Spot VMs provide drastically lower costs but are preemptible, requiring resilient and disruption-tolerant job designs.

To further accelerate your distributed training, we’ll explore three powerful orchestration strategies on Google Cloud: Google Kubernetes Engine (GKE), Cluster Toolkit, and Vertex AI custom training pipeline. Each approach brings its unique strengths, enabling you to leverage Google Cloud’s powerful infrastructure to drive your machine learning projects forward quickly and scalably.
Let’s walk through each of the options to better understand how Google Cloud’s advanced orchestration tools can help you optimize resources, reduce complexity, and achieve strong performance in your ML initiatives.
Option 1: GKE for unified workload management
Enterprises with robust platform teams often want a unified environment on which to run all their workloads, including custom training, for simpler management. GKE is a good choice in this context, providing the flexibility and scalability to handle diverse workloads on a single platform. With GKE, platform teams gain centralized control and visibility, while optimizing resource utilization and streamlining management.
Here’s how to orchestrate ML workloads running on GKE:
1. GKE cluster and nodepool provisioningIf you have reservation (CUD or DWS calendar) and prefer to use Terraform, follow the instructions from cluster provisioning templates, and specify the parameter file (terraform.tfvars):

code_block
<ListValue: [StructValue([(‘code’, ‘cat >./terraform.tfvars <<EOF\r\nproject_id = “PROJECT_XXXX"\r\nresource_prefix = "a3mega-test"\r\nregion = "us-east4"\r\nnode_pools = [\r\n{\r\n zone = "us-east4-a"\r\n node_count = 2\r\n compact_placement_policy = {\r\n existing_policy_name = "your-compact-placement-policy-name"\r\n specific_reservation = "your-reservation-name"\r\n }\r\n},\r\n]\r\nEOF’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e47d2e0>)])]>

Then execute the following command to provision the GKE cluster and nodepool:

code_block
<ListValue: [StructValue([(‘code’, ‘docker run –rm \\\r\n -v "${PWD}:/root/aiinfra/input" \\\r\n -v "${HOME}/.config/gcloud:/root/.config/gcloud" \\\r\n us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \\\r\n create a3-mega gke’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472550>)])]>

In addition, Cluster Toolkit includes terraform based example blueprints to provision A3 or A3 Mega GKE clusters and nodepool. 
If you prefer to use the gcloud command, follow the step-by-step instructions from this tutorial to create a GKE cluster and nodepool with A3/A3 Mega VMs. 
For DWS Flex, you can create a DWS enabled node-pool with these gcloud commands:

code_block
<ListValue: [StructValue([(‘code’, ‘export CLUSTER_NAME=rick-a3-mega-spot\r\nexport REGION=asia-northeast1\r\nexport ZONE=$REGION-b\r\nexport PREFIX=rick-a3-mega-spot-gpu\r\ngcloud beta container node-pools create dws-a3-mega \\\r\n –cluster=$CLUSTER_NAME \\\r\n –node-locations $ZONE –region $REGION \\\r\n –enable-queued-provisioning \\\r\n –accelerator type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=latest \\\r\n –machine-type=a3-megagpu-8g \\\r\n –additional-node-network network=$PREFIX-0,subnetwork=$PREFIX-0 \\\r\n –additional-node-network network=$PREFIX-1,subnetwork=$PREFIX-1 \\\r\n –additional-node-network network=$PREFIX-2,subnetwork=$PREFIX-2 \\\r\n –additional-node-network network=$PREFIX-3,subnetwork=$PREFIX-3 \\\r\n –additional-node-network network=$PREFIX-4,subnetwork=$PREFIX-4 \\\r\n –additional-node-network network=$PREFIX-5,subnetwork=$PREFIX-5 \\\r\n –additional-node-network network=$PREFIX-6,subnetwork=$PREFIX-6 \\\r\n –additional-node-network network=$PREFIX-7,subnetwork=$PREFIX-7 \\\r\n –enable-gvnic \\\r\n –no-enable-autoupgrade \\\r\n –scopes "https://www.googleapis.com/auth/cloud-platform" \\\r\n –enable-autoscaling \\\r\n –num-nodes=0 \\\r\n –total-max-nodes 3 \\\r\n –location-policy=ANY \\\r\n –reservation-affinity=none \\\r\n –no-enable-autorepair\r\n…’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472940>)])]>

2. Enable GPU direct communication with A3 TCPX/A3 Mega TCPXO and perform an initial benchmark testFollow these steps to install GPUDirect for TCPX/TCPXO libraries, configure NCCL, and deploy a test workload to perform your initial benchmark tests.
Validate the output of allgather benchmark tests for two A3 Mega nodes:

code_block
<ListValue: [StructValue([(‘code’, ‘size count type redop root time algbw busbw\r\n\r\n(B) (elements) (us) (GB/s) (GB/s)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472f70>)])]>

In the above benchmark output table, the first column is message size, while the algbw and busbw columns on the right indicate per GPU bandwidth. Usually, we use the in/out-of-place busbw column with the biggest message size (highlighted row) to determine cross-node bandwidth. For A3 Mega nodes, we expect a range of 185-190GB/s per GPU; this may indicate near cross-node 1600gbps network bandwidth for A3 Mega nodes with 8 NVIDIA H100 GPUs and 8 NICs.
You may expand the NCCL tests from two nodes to 8, 16, 32, etc. to ensure your cross-node network performance is within a decent range and that all the nodes are healthy.
3. Configure distributed training batch workloadYou can use JobSet, a Kubernetes-native API for managing a group of k8s Jobs as a unit using a unified API, to deploy distributed HPC (e.g., MPI) and AI/ML training workloads (PyTorch, Jax, Tensorflow etc.) on Kubernetes.
The following example illustrates a JobSet yaml manifest for A3 with GPUDirect-TCPX, which includes:

Key JobSet configuration elements

code_block
<ListValue: [StructValue([(‘code’, ‘apiVersion: jobset.x-k8s.io/v1alpha2\r\nkind: JobSet\r\nmetadata:\r\n name: pytorch\r\nspec:\r\n replicatedJobs:\r\n – name: workers\r\n template:\r\n spec:\r\n parallelism: 2 #number of nodes\r\n completions: 2 #numder of nodes\r\n backoffLimit: 0\r\n template:\r\n metadata:\r\n annotations:\r\n gke-gcsfuse/volumes: "true"\r\n spec:\r\n nodeSelector:\r\n cloud.google.com/gke-accelerator: nvidia-h100-80gb\r\n…’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472610>)])]>

b. Training job settings, including pytorch main containerc. gcsfuse, tcpx (A3 high), tcpxo (A3 Mega) RxDM containerd. NCCL environment variables
For DWS batch workloads, please refer to the following A3 Mega-based example, integrating Kueue and JobSet settings.
Lastly, refer to this Helmchart example to see how to perform Megatron LM (Llama2) training on A3 Mega.
Option 2: Slurm via Cluster Toolkit
Slurm is one of the most popular high-performance computing (HPC) job schedulers. Used by researchers in both academia and industry, it offers a robust solution for LLM training orchestration with familiar semantics. Support for Slurm on Google Cloud is provided by Cluster Toolkit, formerly known as Cloud HPC Toolkit, open-source software that simplifies the process of deploying HPC, AI, and ML workloads on Google Cloud. It is designed to be highly customizable and extensible, and to address the deployment needs of a broad range of use cases, including deploying infrastructure for large-scale LLM training.
1. Provisioning A3-high and A3-mega clustersInstall Cluster Toolkit using the configuration instructions in the public documentation. Be sure to note some of the prerequisites including supported versions of Go, Terraform, and Packer.
Once you have a working Cluster Toolkit installation including the downloaded github repository, navigate to the examples/machine-learning blueprints directory. Here, you will have two folders for deploying H100 clusters based on the A3-series machine shapes, a3-highgpu-8g and a3-megagpu-8g. In this example, we’ll explore the blueprint in the a3-megagpu-8g folder.
Google Cloud Cluster Toolkit blueprints are Infrastructure as Code (IaC) documents that describe the infrastructure you would like to deploy, and are conceptually similar to Terraform or other IaC tooling. For the a3-megagpu-8g blueprint, there are three main files that control the deployment: 

slurm-a3mega-base.yaml – includes creating the necessary VPC networks along with the filestore instance used for a common home filesystem on the cluster nodes.
slurm-a3mega-image.yaml – creates the Compute Engine image instance that is used by Slurm to provision nodes based on the cluster’s definitio
slurm-a3mega-cluster.yaml – sets up the main cluster components, including the Slurm controller (the main orchestrator for Slurm jobs), the Slurm login node (a host used for job submission) and the a3mega partition (the working nodes in the cluster)

While you can customize each of the blueprint components if needed, you can easily get started by simply specifying the details for your working environment in the deployment-base.yaml and the deployment-image-cluster.yaml.
2. Enable GPUDirect-TCPXO optimized NCCL communicationOnce the Slurm cluster is created, follow this tutorial to enable GPUDirect-TCPXO for optimized NCCL communication on the GPU networks. To validate the environment and ensure the TCPXO plugin is being properly loaded, build and compile the NCCL tests. Then, run sbatch run-nccl-tests.sh from the login node, being sure to change the number of nodes in the script to match those in your cluster. This runs a distributed all_gather test across the GPUs and nodes indicated in the script.

code_block
<ListValue: [StructValue([(‘code’, ‘#SBATCH –partition=a3mega\r\n#SBATCH –mem=0\r\n#SBATCH -N 2 # CHANGE TO REFLECT # Of a3-mega compute nodes\r\n\r\n#SBATCH –gpus-per-node=8\r\n#SBATCH –ntasks-per-node=8\r\n\r\n# Usage: sbatch run-nccl-tests.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472220>)])]>

When working as intended, the NCCL tests should show output results indicating high-speed bandwidth throughput at various message sizes. A common measure of performance is to use the busbw value in GB/s from the second or last row of the output table, which shows the 4Gb and 8Gb message size values. A cluster with TCPXO active should report around 190 GB/s busbw throughput. See the performance page in the NVIDIA NCCL-tests repository for more details around these metrics.
3. Run an NeMo training workloadFollow this NeMo training tutorial to run an example NeMo Framework Pre-Training job using the following steps:
Step 1:

code_block
<ListValue: [StructValue([(‘code’, ‘sbatch setup.nemo.sh’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472c10>)])]>

Creates a NeMo Framework-derived container with the necessary TCPXO environment variables

Submits a Slurm job to copy the framework launcher scripts and a few other auxiliary files into your working directory

Step 2:

code_block
<ListValue: [StructValue([(‘code’, ‘pip install -r requirements.txt # Copied from the NeMo Framework Container earlier\r\n# This is needed to use 23.11 and python3.11, which is what is present on\r\n# Debian 12\r\npip install -U hydra-core’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e4721f0>)])]>

Establishes a Python virtual environment and installs NeMo Framework python package dependencies

Step 3:

code_block
<ListValue: [StructValue([(‘code’, ‘cd launcher_scripts\r\nmkdir data\r\nMAX_STEPS=10\r\nNUM_NODES=8\r\npython main.py \\\r\n launcher_scripts_path=${PWD} \\\r\n stages=[training] \\\r\n training=gpt3/5b \\\r\n env_vars.TRANSFORMERS_OFFLINE=0 \\\r\n container=../nemofw+tcpxo-23.11.sqsh \\\r\n container_mounts=\'["/var/lib/tcpxo/lib64"]\’ \\\r\n cluster.srun_args=["–container-writable"] \\\r\n training.model.data.data_impl=mock \\\r\n training.model.data.data_prefix=[] \\\r\n training.trainer.max_steps=${MAX_STEPS} \\\r\n training.trainer.val_check_interval=${MAX_STEPS} \\\r\n training.trainer.limit_val_batches=0.0 \\\r\n training.exp_manager.create_checkpoint_callback=False \\\r\n training.exp_manager.resume_if_exists=False \\\r\n training.trainer.num_nodes=${NUM_NODES}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e472f40>)])]>

This command runs distributed training of a 5B parameter GPT3 model across eight nodes for 10 steps using mock data as the input.

Option 3: Vertex AI
For teams seeking managed infrastructure experience as well as access to leading open models such as Llama 3.1, Mistral, etc., Vertex AI Model Garden and Custom Training Job service presents an attractive option. This fully managed solution removes most of the orchestration burden and provides end-to-end ML platform operations, allowing you to focus on model development and experimentation. Vertex AI’s end-to-end training support further simplifies the process, offering an integrated workflow from data preparation to deployment.
Let’s look at how to perform  single-node or multi-node fine-tuning/training workload on Vertex.   
Single-node multi-GPU fine-tuning/training on VertexThis notebook demonstrates fine-tuning and deploying Llama 3.1 models with the Vertex AI SDK. All of the examples in this notebook use parameter-efficient finetuning (PEFT) methods with Low-Rank Adaptation (LoRA) to reduce training and storage costs. LoRA is one approach of PEFT, where pretrained model weights are frozen and rank decomposition matrices representing the change in model weights are trained during fine-tuning.
Multi-node distributed fine-tuning/training on Vertex AI
This Vertex sample training repo provides examples on how to launch multi-node distributed training on A3 Mega (8 x NVIDIA H100) on Vertex. 
The NeMo example illustrates how to perform pre-training, continued pre-training and supervised fine-tuning (SFT). In addition, NeMo allows optimized training as a popular approach to evaluate the AI accelerator (A3 Mega in this case). To benchmark, you can rely on the reported metrics such as epoch time, step-time, etc. Since NeMo runs on most NVIDIA GPU types, it can be helpful for comparing different AI chips for a given task. Read on to learn how to run the example on Vertex with A3 Mega node types.
launch.sh is the main entry point to launch NeMo distributed training with command parameters:

code_block
<ListValue: [StructValue([(‘code’, ‘<TRAIN_TYPE> Job type (options: pretraining,continual-pretraining,full-sft)"\r\n <MODEL_NAME> Model name (options: llama2-7b,llama3-70b)"\r\n <LOG_DIR> Path to the local storage (e.g. /tmp/…) or gcs bucket (/gcs/BUCKET_NAME) \r\n –debug Pass sleep infinity to launch command’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e476ee0>)])]>

Example:

code_block
<ListValue: [StructValue([(‘code’, ‘export REGION=us-central1\r\nexport PROJECT_ID=YOUR_PROJECT\r\n\r\n# Starting a job to pretrain a llama2-7b model and setting /tmp as the log directory\r\n./launch.sh pretraining llama2-7b /tmp’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e4764c0>)])]>

At the end of launch.sh script, we use curl command to call Vertex customJobs API to launch NeMo training job in Vertex:

code_block
<ListValue: [StructValue([(‘code’, ‘..\r\n# == create json stucture with existing environment variables ==\r\njson_job=$(envsubst < vertex-payload.json)\r\n\r\njson_file="nemo_${MODEL_NAME}_${TRAIN_TYPE}_${NNODES}.json"\r\n\r\necho $json_job | tee $json_file > /dev/null\r\n\r\njob_addr="https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/customJobs"\r\n\r\necho json_file:$json_file\r\necho job_addr:$job_addr\r\n\r\nset -x\r\n\r\ncurl -X POST \\\r\n -H "Authorization: Bearer $(gcloud auth print-access-token)" \\\r\n -H "Content-Type: application/json; charset=utf-8" \\\r\n -d "@$json_file" \\\r\n $job_addr\r\n # "$job_addr" TODO: pass the param job_addr to the curl command. does not work with parameterized values.’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e476610>)])]>

Job configurations in vertex-payload.json are part of curl command to launch Nemo training, it includes job specifications on resource requirements as showed:

code_block
<ListValue: [StructValue([(‘code’, ‘{\r\n "displayName": "nemo_${MODEL_NAME}_${TRAIN_TYPE}_${NNODES}",\r\n "jobSpec": {\r\n "workerPoolSpecs": [\r\n { \r\n "machineSpec": {\r\n "machineType": "a3-megagpu-8g",\r\n "acceleratorType": "NVIDIA_H100_MEGA_80GB",\r\n "acceleratorCount": 8\r\n }, \r\n "replicaCount": "1",\r\n "diskSpec": {\r\n "bootDiskType": "pd-ssd",\r\n "bootDiskSizeGb": 100\r\n }, \r\n "containerSpec": {\r\n "imageUri": "classicboyir/nemo:02",\r\n "command": [\r\n "sh", "-c"\r\n ],\r\n "args": [\r\n "${TRANSFER_MODEL_CMD} ${LAUNCH_CMD}"\r\n ],\r\n "env": [\r\n {\r\n "name": "CONFIG_NAME",\r\n "value": "$MODEL_NAME.yaml"\r\n },\r\n {\r\n "name": "NNODES",\r\n "value": "$NNODES"\r\n }\r\n …..\r\n ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb02e476550>)])]>

The job configuration arguments "${TRANSFER_MODEL_CMD} ${LAUNCH_CMD}" in turn embed full content from the job training script, which also includes all the NCCL environments required by A3 Mega, while other pytorch launch commands are executed by Vertex CustomJob.
Optionally, build your own custom job container image as an "imageUri” parameter in vertex-payload.json, using this Dockerfile as your reference.
DIY enthusiasts: Building custom training environments
Lastly, we recognize many organizations prefer a more hands-on approach and have specific orchestration tools or frameworks that they wish to use. If that describes you, Google Compute Engine provides the foundation to build your own tailored training environments, letting you create and configure virtual machines (VMs) with your desired specifications, including the type and number of GPUs, CPU, memory, and storage. This granular control lets you optimize your infrastructure for your specific training workloads and integrate your preferred orchestration tools.
To facilitate this process, we provide example code snippets demonstrating how to use the gcloud compute instance create and gcloud compute instance bulk create API calls to create and manage your vanilla A3 Mega instances. Whether you need to create a single VM or provision a large-scale cluster, these resources can help streamline your infrastructure setup.
Conclusion
With the right orchestration strategy and Google Cloud’s robust and leading AI infrastructure, you can achieve your training goals and transform your business objectives into reality.
To learn more about distributed training, please review GKE example, Cluster Toolkit example, and Vertex AI example.

AI Summary and Description: Yes

Summary: The text provides an in-depth overview of Google Cloud’s offerings for orchestrating distributed training of large-scale AI and machine learning models using GPU and TPU resources. It highlights various orchestration tools, machine families, workload management strategies, and cost-saving consumption models, emphasizing their significance for professionals in AI, cloud computing, and infrastructure security.

Detailed Description: This content serves as a comprehensive guide for developers and data scientists navigating the challenges of distributed AI training. Key themes and concepts include:

* **Need for Distributed Environments**: As AI and LLMs grow in size and complexity, distributed computing becomes essential for efficient model training across multiple GPU and TPU resources.

* **Google Cloud’s AI Hypercomputer Architecture**:
– Offers advanced orchestration tools for GPU accelerators to enhance machine learning workflows.
– Provides tools to optimize resource management and improve scalability.

* **Accelerator Families**: Different GPU machine families are tailored for specific workloads.
– **A3 Series**: Designed for large-scale training workloads with high-performance NVIDIA GPUs (e.g., NVIDIA H100, H200).
– **A2 Series**: Optimized for simple, single-node training scenarios.
– **G2 Machine Family**: Supports inference and testing.

* **Consumption Models**: Various pricing and capacity assurance options are available:
– **Committed Use Discounts (CUDs)**: Cost savings for long-term commitments.
– **Dynamic Workload Scheduler (DWS)**: Flexible capacity options based on workload requirements.
– **On-Demand Consumption**: Highly flexible, with no commitments but limited capacity guarantees.
– **Spot VMs**: Cost-effective but require resilience due to potential interruptions.

* **Orchestration Strategies**:
– **Google Kubernetes Engine (GKE)**: A robust platform for unified workload management across different ML workloads.
– **Cluster Toolkit**: A solution for HPC job scheduling and customized AI/ML workload orchestration.
– **Vertex AI**: Managed infrastructure for deploying models quickly, focusing on model development and experimentation while minimizing orchestration overhead.

* **Implementation Operations**: The document outlines specific operational steps for provisioning clusters, enabling communication, and configuring environments, serving as practical guidance for engineers:
– Instructions on how to utilize Terraform, Kubernetes-native APIs, and Python environments for deploying workloads on Google Cloud.

* **Cluster Management**:
– How to set up Slurm and the Cluster Toolkit for efficient job scheduling and management, particularly for AI tasks.
– Examples of configurations for distributed training jobs, including using JobSet APIs for distributed training across nodes.

* **Final Thoughts**: Concludes with encouragement for organizations to leverage Google Cloud’s robust infrastructure and orchestration strategies to meet their AI training objectives efficiently.

Overall, this text is a significant resource for security, privacy, and compliance professionals in AI and cloud computing, providing insights into best practices and strategies for managing AI workloads securely and efficiently.