Cloud Blog: 65,000 nodes and counting: Google Kubernetes Engine is ready for trillion-parameter AI models

Source URL: https://cloud.google.com/blog/products/containers-kubernetes/gke-65k-nodes-and-counting/
Source: Cloud Blog
Title: 65,000 nodes and counting: Google Kubernetes Engine is ready for trillion-parameter AI models

Feedly Summary: As generative AI evolves, we’re beginning to see the transformative potential it is having across industries and our lives. And as large language models (LLMs) increase in size — current models are reaching hundreds of billions of parameters, and the most advanced ones are approaching 2 trillion — the need for computational power will only intensify. In fact, training these large models on modern accelerators already requires clusters that exceed 10,000 nodes. 
With support for 15,000-node clusters — the world’s largest — Google Kubernetes Engine (GKE) has the capacity to handle these demanding training workloads. Today, in anticipation of even larger models, we are introducing support for 65,000-node clusters.
With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>

Unmatched scale for training or inference
Scaling to 65,000 nodes provides much-needed capacity to the world’s most resource-hungry AI workloads. Combined with innovations in accelerator computing power, this will enable customers to reduce model training time or scale models to multi-trillion parameters or more. Each node is equipped with multiple accelerators (e.g., Cloud TPU v5e node with four chips), giving the ability to manage over 250,000 accelerators in one cluster.
To develop cutting-edge AI models, customers need to be able to allocate computing resources across diverse workloads. This includes not only model training but also serving, inference, conducting ad hoc research, and managing auxiliary tasks. Centralizing computing power within the smallest number of clusters provides customers the flexibility to quickly adapt to changes in demand from inference serving, research and training workloads. 
With support for 65,000 nodes, GKE now allows running five jobs in a single cluster, each matching the scale of Google Cloud’s previous world record for the world’s largest training job for LLMs.
Customers on the cutting edge of AI welcome these developments. Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems, and is excited for GKE’s expanded scale.
“GKE’s new support for larger clusters provides the scale we need to accelerate our pace of AI innovation.” – James Bradbury, Head of Compute, Anthropic
Innovations under the hood
This achievement is made possible by a variety of enhancements: For one, we are transitioning GKE from the open-source etcd, distributed key-value store, to a new, more robust, key-value store based on Spanner, Google’s distributed database that delivers virtually unlimited scale. On top of the ability to support larger GKE clusters, this change will usher in new levels of reliability for GKE users, providing improved latency of cluster operations (e.g., cluster startup and upgrades) and a stateless cluster control plane. By implementing the etcd API for our Spanner-based storage, we help ensure backward compatibility and avoid having to make changes in core Kubernetes to adopt the new technology.
In addition, thanks to a major overhaul of the GKE infrastructure that manages the Kubernetes control plane, GKE now scales significantly faster, meeting the demands of your deployments with fewer delays. This enhanced cluster control plane delivers multiple benefits, including the ability to run high-volume operations with exceptional consistency. The control plane now automatically adjusts to these operations, while maintaining predictable operational latencies. This is particularly important for large and dynamic applications such as SaaS, disaster recovery and fallback, batch deployments, and testing environments, especially during periods of high churn.
We’re also constantly innovating on IaaS and GKE capabilities to make Google Cloud the best place to build your AI workloads. Recent innovations in this space include: 

Secondary boot disk, which provides faster workload startups through container image caching

Fully managed DCGM metrics for improved accelerator monitoring

Hyperdisk ML, a high-performance storage solution for scalable applications that is now generally available

Serverless GPUs, now available in Cloud Run

Custom compute classes, which offer greater control over compute resource allocation and scaling

Support for Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date 

Support for A3 Ultra VM powered by NVIDIA H200 Tensor Core GPUs with our new Titanium ML network adapter, which delivers non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). A3 Ultra VMs will be available in preview next month.

A continued commitment to open source
Guided by Google’s long-standing and robust open-source culture, we make substantial contributions to the open-source community, including when it comes to scaling Kubernetes. With support for 65,000-node clusters, we made sure that all necessary optimizations and improvements for such scale are part of the core open-source Kubernetes.
Our investments to make Kubernetes the best foundation for AI platforms go beyond scalability. Here is a sampling of our contributions to the Kubernetes project over the past two years:

Drove a major overhaul of the Job API

Incubated the K8S Batch Working Group to build a community around research, HPC and AI workloads, producing tools like Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes 

Created the JobSet operator that is being integrated into the Kubeflow ecosystem to help run heterogenous jobs (e.g., driver-executer)

For multihost inference use cases, created the Leader Worker Set controller

Published a highly optimized internal model server of JetStream 

Incubated the Kubernetes Serving Working Group, which is driving multiple efforts including model metrics standardization, Serving Catalog and Inference Gateway

At Google Cloud, we’re dedicated to providing the best platform for running containerized workloads, consistently pushing the boundaries of innovation. These new advancements allow us to support the next generation of AI technologies. For more, listen to the Kubernetes podcast, where Maciek Rozacki and Wojtek Tyczynski join host Kaslin Fields to talk about GKE’s support for 65,000 nodes. You can also see a demo on 65,000 nodes on a single GKE cluster here.

AI Summary and Description: Yes

**Summary:** The text highlights the advancements in Google Kubernetes Engine (GKE), particularly its new support for 65,000-node clusters designed to meet the increasing computational demands of large language models (LLMs) and generative AI applications. It illustrates the transformative potential and scalability of GKE as a platform for running AI workloads, showcasing enhancements that improve performance, reliability, and flexibility for developers and researchers.

**Detailed Description:**
The text elaborates on the significant growth and technological development of Google Kubernetes Engine (GKE) to accommodate the escalating demands of AI and machine learning workloads, particularly those involving large language models (LLMs). Key points include:

– **Scaling Capabilities:**
– GKE now supports clusters with up to 65,000 nodes, which is expected to dramatically reduce model training times and enable training of models with trillions of parameters.
– Compared to other cloud providers, GKE offers a more than 10X larger scale for AI workloads.

– **Resource Allocation Flexibility:**
– With advanced capabilities, GKE allows for running multiple high-scale jobs within a single cluster.
– The ability to centralize computing resources enhances the capacity to adapt quickly to fluctuating demands in inference, research, and training tasks.

– **Positive Industry Reception:**
– Companies like Anthropic express enthusiasm for GKE’s expanded capabilities, indicating it will accelerate their AI innovations.

– **Infrastructure Innovations:**
– Transition from the open-source etcd to a new key-value store based on Spanner allows for virtually unlimited scalability for GKE, leading to improvements in reliability and latency.
– The GKE infrastructure has been overhauled to enable faster scaling and to meet high operational demands.

– **Technological Enhancements:**
– Introduction of new features like secondary boot disks, performance monitoring solutions, high-performance storage options, serverless GPU support, and advanced compute resource classes.
– Support for advanced TPUs and NVIDIA Tensor Core GPUs, enhancing computational performance and efficiency.

– **Commitment to Open Source:**
– GKE continues Google’s tradition of contributing to the open-source community with major improvements and optimizations that enhance Kubernetes for AI applications.
– Initiatives include the creation of various working groups aimed at facilitating AI workloads and tools tailored for resource management.

– **Community and Future Prospects:**
– Encouragement for industry professionals to keep updated through podcasts and available demos showcases the potential applications and efficiencies GKE can provide.

Overall, the advancements in GKE position it as a leading solution for enterprises focusing on AI and machine learning, signaling opportunities for significant improvements in computational efficiency and operational flexibility essential for handling next-generation AI technologies.