Cloud Blog: Using RDMA over Converged Ethernet networking for AI on Google Cloud

Mar 19, 2025

—

Source URL: https://cloud.google.com/blog/products/networking/rdma-rocev2-for-ai-workloads-on-google-cloud/
Source: Cloud Blog
Title: Using RDMA over Converged Ethernet networking for AI on Google Cloud

Feedly Summary: All workloads are not the same. This is especially the case for AI, ML, and scientific workloads. In this blog we show how Google Cloud makes the RDMA over converged ethernet version 2 (RoCE v2) protocol available for high performance workloads.
Traditional workloads
Network communication in traditional workloads involves a well-known flow. This includes:

Movement of data between source and destination. The application initiates requests.

The OS processes the data, adds TCP headers and passes it to the network interface card (NIC).

The NIC sends data on the wire based on networking and routing information.

The Receiving NIC receives data.

OS processing on the receiving end strips headers and delivers data based on information.

This process involves both CPU and OS processing, and these networks can recover from latency and packet loss issues and handle data of varying sizes while functioning normally.
AI workloads
AI workloads are very sensitive, involve large datasets, may require high bandwidth, low latency and lossless communication for training and inference. Because there is a higher cost for running these types of jobs, it’s important that they are completed as quickly as possible and optimize processing. This can be achieved with accelerators — specialized hardware designed to significantly speed up the training and execution of AI applications. Examples of accelerators include specialized hardware chips like TPUs and GPUs.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/products?#networking’), (‘image’, None)])]>

RDMA
Remote Direct Memory Access (RDMA) technology allows systems to exchange data directly between one another without involving the OS, networking stack and CPU. This allows faster processing times since the CPU, which can become a bottleneck, is bypassed.
Let’s take a look at how this works with GPUs.

An RDMA-capable application initiates an RDMA operation.

Kernel bypass takes place, avoiding the OS and CPU.

RDMA-capable network hardware gets involved and accesses source GPU memory to transfer the data to the destination GPU memory.

On the receiving end, the application can retrieve the information from the GPU memory, and a notification is sent to the sender as confirmation.

How RDMA with RoCE works

Previously, Google Cloud supported RDMA-like capabilities with its own native networking stack called GPUDirect-TCPX and GPUDirect-TCPXO. Currently the capability has been expanded with RoCEv2, which implements RDMA over ethernet.
RoCE-v2-capable compute
Both the A3 Ultra and A4 Compute Engine machine types leverage RoCE v2 for high-performance networking. Each node supports eight RDMA-capable NICs connected to the isolated RDMA network. Direct GPU-to-GPU communication within a node occurs via NVLink and between nodes via RoCE.
Adopting RoCEv2 networking capabilities offers more benefits including:

Lower latency

Increased bandwidth — from 1.6 Tbps to 3.2 Tbps of inter-node GPU to GPU traffic

Lossless communication due to congestion management capabilities: Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN)

Use of UDP port 4791

Support for new VM series like A3 Ultras, A4 and beyond

Scalability support for large cluster deployments

Optimized rail-designed network

rail design

Overall these features result in faster training and inference, directly improving application speed. It’s achieved through a specialized VPC network, optimized for this purpose. This high-performance connectivity is a key differentiator for demanding applications.
Get started
To enable these capabilities, follow these steps:

Create a reservation: Obtain your reservation ID; you may have to work with your support team for capacity requests.

Choose a deployment strategy: Specify the deployment region, zone, network profile, reservation ID and method.

Create your deployment.

You can see the configuration steps and more in the following documentation:

Documentation: Hypercompute Cluster

Blog: Cross-Cloud network support for AI workloads

GCT YouTube Channel: AI guide for Cloud Developers

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.

Networking support for AI workloads
In this blog we look at some of the benefits of the Cross-Cloud Network in supporting AI and HPC workloads, both managed and self-managed.

Read Article

AI Summary and Description: Yes

Summary: The text presents an overview on how Google Cloud enables high performance for AI and ML workloads through the implementation of RDMA over Ethernet version 2 (RoCE v2). The discussion touches on the unique requirements of AI workloads, the benefits of RDMA technology, and specific architectural enhancements in Google Cloud’s offerings.

Detailed Description:
The content delves into the performance demands of AI and machine learning workloads compared to traditional computing tasks. Here are the key points:

– **Traditional vs. AI Workloads**:
– Traditional workloads involve standard data movement processes that include OS and CPU interaction, which can lead to delays due to latency and packet loss.
– AI workloads necessitate significant data processing, relying on high bandwidth, low latency, and lossless communication to efficiently manage the training and inference phases.

– **Importance of Specialized Hardware**:
– To meet these demands, Google Cloud emphasizes the use of accelerators, such as TPUs and GPUs, which are engineered to expedite AI task execution.

– **Introduction of RDMA**:
– RDMA (Remote Direct Memory Access) technology allows direct data exchange between systems without involving the CPU or OS, leading to faster processing.
– It significantly reduces bottlenecks typically associated with CPU usage during data transfers.

– **How RDMA with RoCE v2 Works**:
– An RDMA-capable application initiates an operation, bypassing conventional processing channels to interact directly with GPU memory for swift data transfer.
– The new RoCE v2 extension allows greater capabilities through Google Cloud’s networking architecture.

– **Enhanced Networking Features**:
– RDMA over RoCE v2 provides:
– Lower latency and increased bandwidth (up to 3.2 Tbps).
– Lossless communication through features like Priority-based Flow Control (PFC) and Explicit Congestion Notification (ECN).
– Scalability for larger deployments via specialized VPC networks.

– **Practical Application and Deployment**:
– Users can enable these high-performance networking capabilities by creating reservations and specifying deployment strategies, with documented guides provided for further assistance.

Overall, the text illustrates how advancements in networking technology, particularly in cloud environments, can support the growing needs of AI and ML applications, making it a valuable consideration for security, compliance, and infrastructure professionals. The performance improvements not only reduce training times but also enhance the experience of deploying AI solutions in the cloud.

1 2 3 4 7 a A3 Ultra A4 accelerator accelerators access Act ads advancement advancements AI AI applications AI workloads and Application applications Arch architectural architecture art as assistance bandwidth based building by bypass C capabilities capacity channels chip chips CIA Cloud cloud environment cloud environments Cloud Network cluster co Col communication compliance compute Compute Engine Computing Configuration connectivity Console content control cost cross Cross-Cloud Network Current D data data exchange data movement data processing data transfer data transfers dataset datasets de demand deployment deployment strategies deployment strategy design developer developers document documentation e efficient end Engineer environment Ethernet execution exp experience face fast feature features for free g Go Google Google Cloud GPU GPUs gs H hardware hardware design headers high high-performance HP HPC HR http HTTPS Hyper Hypercompute Cluster image implementation in Inference information infrastructure inter interaction interface ite J job jobs k kernel Key l large large datasets latency learning led Li Link linked LinkedIn low low latency mac machine Machine Learning machine learning workloads making man management memory memory access ML N nation native network Network Communication Networking networking capabilities networking technology networks no Node non o of off on one only operation opt ory out over performance performance improvement performance improvements point porting pre process processes processing product products professionals protocol question QUIC R rag rate RCE RDMA red Region Requirements Ro RoCEv2 RoT routing s scalability sec security self series SHA side Sig SoC solutions source specific SSE SSL SSO stack start Strategy Swift system systems T Task task execution tasks tech technology text the Thought Time to Tor TP traffic training trial trie two type UI Ultra up US usage use user Users V val version vm VPC WAN Ware Well Wi workload workloads x YouTube