Cloud Blog: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners

Aug 5, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/q2-2025-ai-hypercomputer-updates/
Source: Cloud Blog
Title: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners

Feedly Summary: Curious about the latest in AI infrastructure from Google Cloud? Every three months we share a roundup of the latest AI Hypercomputer news, resources, events, learning opportunities, and more. Read on to learn new ways to simplify AI infrastructure deployment, improve performance, and optimize your costs.
AI innovation is moving at an unprecedented rate and achieving remarkable milestones— for example, Gemini Deep Think achieved the Gold Medal standard at the latest International Mathematics Olympiad, and Google now serves over 980 trillion monthly tokens. We are able to achieve these milestones through AI-optimized hardware, leading software with open frameworks, and flexible consumption models.
At Google Cloud, we provide you with access to the same capabilities behind Gemini, Veo 3, and more through AI Hypercomputer, our integrated AI supercomputing system. AI Hypercomputer allows you to train and serve AI models at massive scale, delivering superior performance, lower latency, and best-in-class price/performance. For example, the latest enhancements to Cluster Director simplify the complex task of managing your compute, network, and storage for both training and inference workloads, while new contributions to llm-d help to significantly accelerate the deployment of large-scale inference models on Google Kubernetes Engine (GKE).

Read on for all the latest news.
Dynamic Workload Scheduler
At the top of the stack, AI Hypercomputer offers flexible consumption via Dynamic Workload Scheduler, which optimizes compute resources with your choice of workload scheduling approaches. For workloads with predictable duration, Calendar mode lets you obtain short-term assured capacity at a discount, without long-term commitments. Similar to reserving a hotel room, you know when and where you can train or serve your models; learn how to get started here. Flex start mode, meanwhile, provides better economics and obtainability for on-demand resources that have flexible start-time requirements. This is great for batch processing, training or finetuning, and lets you either start the job once all the resources are available, or begin as resources trickle in. Flex start mode is now available for you to use in preview. Calendar mode in preview supports A3 and A4 VMs while Flex Start mode supports all GPU VMs. Both Calendar and Flex start mode support TPU v5e, TPU v5p, and Trillium.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Leading software, open frameworks
AI Hypercomputer provides choice and flexibility through support the most popular AI and ML libraries and frameworks.
Cluster DirectorTo make deploying and managing large-scale clusters easier, AI Hypercomputer offers Cluster Director, a management and orchestration layer for GPU resources that provides cluster management, job scheduling, performance optimization at scale, and comprehensive visibility and insights. At Next ‘25, we highlighted new capabilities including a new GUI, observability and straggler detection features, and now you can take advantage of them. Learn more about the latest advances here.
llm-d project releases version 0.2To help you serve LLMs at scale, the new open source llm-d project was announced in May for distributed and disaggregated inference with vLLM. The project just released version 0.2, providing performance-optimized patterns for prefill/decode disaggregation and multi-node mixture of expert deployments. It also integrates new kernel optimizations in vLLM, and offers new deployers, improved benchmarking, and improved scheduler extensibility. Learn more on the llm-d blog.
MaxText and MaxDiffusion updatesMaxText and MaxDiffusion are open-source solutions that provide high-performance implementations for training LLMs, and for training and serving diffusion models, respectively. Each solution makes it easy to get started with JAX by providing a robust platform with reproducible recipes, Docker images for TPU and GPUs, and built-in support for popular open-source models like DeepSeek and Wan 2.1. Integrations like multi-tier checkpointing and Accelerated Processing Kit (xpk) for cluster setup and workload management help ensure users get the access to these latest techniques without having to reinvent the wheel for their unique workloads.
Whether you’re new to training LLMs with JAX, or you’re looking to scale to tens of thousands of accelerators, MaxText is the best place to start. To simplify and accelerate onboarding, we’re revamping our UX and onboarding experience by building out our documentation, demystifying concepts like sharding at scale (docs), checkpointing (docs) and designing a model for TPUs (docs). We’re also adding Colab notebook support (initial work) to give simple examples of new features and models. Hear how Kakao and other customers are using MaxText to train their models from Google Cloud Next ‘25.
For more advanced MaxText users, we’re continuing to improve model performance with techniques like pipeline parallelism, where our latest enhancements provide a dramatic step-time reduction by eliminating redundant communication when sharding. For new models, we now support DeepSeek R1-0528 and the Qwen 3 dense models, in addition to Gemma 3, Llama 4, Mixtral, and other popular models. We also added Multi-Token Prediction (MTP) in July, making it easy to incorporate this technique.
We’re also expanding the focus of MaxText to provide post-training techniques through integration with the new JAX-native post-training library Tunix. Now, with MaxText and Tunix, you’ll have end to end pre- and post-training capabilities with easy-to-use recipes. With the Tunix integration, we’ll first offer Supervised Fine-Tuning (SFT) before expanding to Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), and Proximal Policy Optimization (PPO).
Join our Discord and share your feedback (feature requests, bugs, documentation) directly to the Github repos.
On the MaxDiffusion front, we’re excited to announce support for serving text-to-video models with Wan 2.1 to complement the existing text-to-image training and inference support of FLUX and SDXL. MaxDiffusion already supports training and inference for FLUX Dev and Schnell, Stable Diffusion 2.1, and Stable Diffusion XL. With this end-to-end solution, users can train their own diffusion models in the MaxDiffusion framework, or simply post-train an existing OSS model and then serve it on TPUs and GPUs.
AI-optimized hardware
Monitoring library for TPUsTo help you monitor and optimize the efficiency of your training and inference workloads, we recently released a new monitoring library for Google Cloud TPUs. These new observability and diagnostic tools provide granular insights into performance and accelerator utilization, enabling you to continuously improve the efficiency of your Cloud TPU workloads.
Managed LustreWe also recently announced improvements to Google Cloud Managed Lustre for high-performance computing (HPC) and AI workloads. Managed Lustre now has four distinct performance tiers, providing throughput options of 125 MB/s, 250 MB/s, 500 MB/s, and 1000 MB/s per TiB of capacity, so you can tailor performance to your specific needs. It also now supports up to 8 PiB of capacity, catering to the larger datasets common in modern AI and HPC applications. Furthermore, the service is tightly integrated with Cluster Director and GKE. For example, GKE includes a managed CSI driver, which allows containerized AI training workflows to access data with ultra-low latency and high throughput.
Onwards and upwards
As we continue to push the boundaries of AI, we’ll update and optimize AI Hypercomputer based on the our learnings from training Gemini to serving 980+ trillion tokens a month. To learn more about using AI Hypercomputer for your own AI workloads, read here. To stay up to date on our progress or ask us questions, join our community!

AI Summary and Description: Yes

Summary: The text discusses advancements and offerings related to Google Cloud’s AI Hypercomputer, focusing on efficient infrastructure for AI model training and deployment. Professionals in AI and cloud computing security will find interest in the optimizations provided for performance and cost efficiency, as well as updates in tools supporting large-scale AI workloads.

Detailed Description:
The content provides an in-depth overview of Google’s latest initiatives in AI infrastructure, particularly highlighting the AI Hypercomputer’s capabilities. Key points include:

– **AI Hypercomputer Overview**:
– An integrated AI supercomputing system that supports training and serving AI models at scale.
– Provides enhanced performance metrics, including lower latency and best price/performance ratio.

– **Innovations in AI Infrastructure**:
– The introduction of tools like the **Dynamic Workload Scheduler** for flexible compute resource management.
– **Calendar Mode**: Designed for predictable workloads, offering short-term capacity at discounted rates.
– **Flex Start Mode**: Optimizes economics for on-demand resources with flexible start times.

– **Cluster Director Enhancements**:
– A management tool for GPU resources that simplifies large-scale cluster management, job scheduling, and performance optimization.
– Features a new GUI for better user interaction and observability improvements.

– **llm-d Project Updates**:
– Released version 0.2 with optimized patterns for distributed inference, including kernel optimizations and new deployers for performance enhancements.

– **Open Source Initiatives**:
– **MaxText and MaxDiffusion**: Libraries aimed at training and serving LLMs and diffusion models.
– Integrates with JAX for reproducible environments and includes various model supports and tools to accelerate deployment.

– **Monitoring and Optimization Tools**:
– New monitoring library for Cloud TPUs to improve training and inference workload efficiency.

– **Managed Lustre Enhancements**:
– Tailored performance tiers for high-performance computing (HPC) workloads and AI applications, integrated closely with GKE for seamless data access.

– **Community Engagement**:
– Encouragement for users to provide feedback and engage through community platforms to enhance product development.

As Google Cloud continues to innovate in the AI space, professionals in AI and cloud security should monitor these developments closely for implications on infrastructure security and compliance within their organizations. The sophisticated orchestration of resources, cost-optimization features, and support structures highlight the importance of a secure, efficient environment for deploying AI capabilities in enterprise settings.

1 10 2 2025 3 4 5 a A4 accelerator accelerators access Act ad management ads advanced advancement advancements age AGI agnostic AI AI applications AI capabilities ai model AI models AI workloads and app Application applications art as at ated based batch processing benchmark benchmarking Best Bi board book Bug bugs building built by C capabilities capacity checkpointing CI class Cloud cloud computing cloud computing security Cloud Next cloud security cluster cluster management co code Col Colab notebook commit communication community community engagement compliance compute compute resource management compute resources computer Computing computing system concept Console consumption container content continuous cost cost efficiency Costs Customer D data data access dataset datasets de deep DeepSeek DeepSeek R1 demand deployment deployments depth design detection Detection Features development developments diffusion diffusion model diffusion models disaggregation discord distributed inference Docker docker images document documentation drive Dynamic Workload Scheduler e e-learning economics efficiency efficient end end solution engagement enhanced performance enterprise enterprise settings environment ERP event exp experience expert extensibility feature features feedback fine fine-tuning first flexibility for framework frameworks free front g Gemini Gemma Gemma 3 git GitHub GKE Go Google Google Cloud Google Cloud Managed Lustre Google Cloud Next Google Kubernetes Google Kubernetes Engine GPU GPUs Group Group Relative Policy Optimization gs H hardware high high-performance high-performance computing high-performance computing (HPC) Highlight HP HPC HR http HTTPS Hyper Hypercomputer image implementation implications in Inference inference workloads Inforce infrastructure infrastructure deployment infrastructure security initiatives innovation Innovations insights integration integrations inter interaction intern International Mathematics Olympiad io Iron IRS ite J Jax job job scheduling Just k Kakao kernel kernel optimization Key Kubernetes Kubernetes Engine l large latency leading learning learning opportunities led Li libraries library llama Llama 4 llm llms lm load long low low latency Lustre M mac machine making man Managed Lustre management mass math mathematics max mean metrics Mila milestone mini Mixture ML Mode model model performance model support model training models Modern Monitor monitoring monitoring library multi my N nation native needs network new news next Nix no Node nomic non notebook o observability of off on onboarding one ons open Open Source Initiative open-source open-source models OPM opt optimization optimizations optimized options orchestration organization organizations oS other out over Parallel parallelism patterns per performance performance computing Performance Computing (HPC) performance enhancement performance enhancements performance metrics performance optimization Performance tiers phi Pipeline pipeline parallelism platform platforms point policy policy optimization porting post pre Preview price pro process processing product product development products professionals Progress project ps Q question Qwen Qwen 3 R R1 rag Rama rate RCE re ready red reduction reinforcement reinforcement learning release releases Requirements resource resource management resources review Ro s sam Scale SD sec secure security security and compliance service settings SHA sharding short side Sig Sim Simple software solutions source source models source solutions space specific SSE stable stack STAR start storage structures supercomputing Supervised Fine supervised fine-tuning support system T Task tech techniques ted test text the throughput tier checkpointing Time times to token token prediction tokens tool tools Tor TP TPUs training training and inference training capabilities training techniques trial trillion Trillium tuning two Uber UI Ultra Unix up update updates US use user user interaction Users utilization V Vantage vents Veo Veo 3 version Vertex video video model video models visibility vllm vm WAN Ware Well Wi workflow workflows workload workload management workload scheduling workloads x z