Cloud Blog: Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure

Feb 11, 2025

—

Source URL: https://cloud.google.com/blog/topics/systems/mitigating-power-and-thermal-fluctuations-in-ml-infrastructure/
Source: Cloud Blog
Title: Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure

Feedly Summary: The recent explosion of machine learning (ML) applications has created unprecedented demand for power delivery in the data center infrastructure that underpins those applications. Unlike server clusters in the traditional data center, where tens of thousands of workloads coexist with uncorrelated power profiles, large-scale batch-synchronized ML training workloads exhibit substantially different power usage patterns. Under these new usage conditions, it is increasingly challenging to ensure the reliability and availability of the ML infrastructure, as well as to improve data-center goodput and energy efficiency.
Google has been at the forefront of data center infrastructure design for several decades, with a long list of innovations to our name. In this blog post, we highlight one of the key innovations that allowed us to manage unprecedented power and thermal fluctuations in our ML infrastructure. This innovation underscores the power of full codesign across the stack — from ASIC chip to data center, across both hardware and software. We also discuss the implications of this approach and propose a call to action for the broader industry.
New ML workloads lead to new ML power challenges
Today’s ML workloads require synchronized computation across tens of thousands of accelerator chips, together with their hosts, storage, and networking systems; these workloads often occupy one entire data-center cluster — or even multiples of them. The peak power utilization of these workloads could approach the rated power of all the underlying IT equipment, making power overscription much more difficult. Furthermore, power consumption rises and falls between idle and peak utilization levels much more steeply, due to the fact that the entire cluster’s power usage is now dominated by no more than a few large ML workloads. You can observe these power fluctuations when a workload launches or finishes, or when it is halted, then resumed or rescheduled. You may also observe a similar pattern when the workload is running normally, mostly attributable to alternating compute- and networking-intensive phases of the workload within a training step. Depending on the workload’s characteristics, these inter- and intra-job power fluctuations can occur very frequently. This can result in multiple unintended consequences on the functionality, performance, and reliability of the data center infrastructure.

Fig. 1. Large power fluctuations observed on cluster level with large-scale synchronized ML workloads

In fact, in our latest batch-synchronous ML workloads running on dedicated ML clusters, we observed power fluctuations in the tens of megawatts (MW), as shown in Fig.1. And compared to a traditional load variation profile, the ramp speed could be almost instantaneous, repeat as frequently as every few seconds, and last for weeks… or even months!
Fluctuations of this kind pose the following risks:

Functionality and long-term reliability issues with rack and data center equipment, resulting in hardware-induced outages, reduced energy efficiency and increased operational/maintenance costs, including but not limited to rectifiers, transformers, generators, cables and busways

Damage, outage, or throttling at the upstream utility, including violation of contractual commitments to the utility on power usage profiles, and corresponding financial costs

Unintended and frequent triggering of the uninterrupted power supply (UPS) system from large power fluctuations, resulting in shortened lifetime of the UPS system

Large power fluctuations may also impact hardware reliability at a much smaller per-chip or per-system scale. Although the maximum temperature is well under control, power fluctuations may still translate into large and frequent temperature fluctuations, triggering various forms of interactions including warpage, changes to thermal interface material property, and electromigration.

aside_block
), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

A full-stack approach to proactive power shaping
Due to the high complexity and large scale of our data-center infrastructure, we posited that proactively shaping a workload’s power profile could be more efficient than simply adapting to it. Google’s full codesign across the stack — from chip to data center, from hardware to software, and from instruction set to realistic workload — provides us with all the knobs we need to implement highly efficient end-to-end power management features to regulate our workloads’ power profiles and mitigate detrimental fluctuations.
Specifically, we installed instrumentation in the TPU compiler to check on signatures in the workload that are linked with power fluctuations, such as sync flags. We then dynamically balance the activities of major compute blocks of the TPU around these flags to smooth out their utilization over time. This achieves our goal of mitigating power and thermal fluctuations with negligible performance overhead. In the future, we may also apply a similar approach to the workload’s starting and completion phases, resulting in a gradual, rather than abrupt, change in power levels.
We’ve now implemented this compiler-based approach to shaping the power profile and applied it on realistic workloads. We measured the system’s total power consumption and a single chip’s hotspot temperature with, and without, the mitigation, as plotted in Fig. 2 and Fig. 3, respectively. In the test case, the magnitude of power fluctuations dropped by nearly 50% from the baseline case to the mitigation case. The magnitude of temperature fluctuations also dropped from ~20 C in the baseline case to ~10 C in the mitigation case. We measured the cost of the mitigation by the increase in average power consumption and the length of the training step. With proper tuning of the mitigation parameters, we can achieve the benefits of our design with small increases in average power with <1% performance impact.

Fig. 2. Power fluctuation with and without the compiler-based mitigation

Fig. 3. Chip temperature fluctuation with and without the compiler-based mitigation

A call to action
ML infrastructure is growing rapidly and expected to surpass traditional server infrastructure in terms of total power demand in the coming years. At the same time, ML infrastructure’s power and temperature fluctuations are unique and tightly coupled with the ML workload’s characteristics. Mitigating these fluctuations is just one example of many innovations we need to ensure reliable and high-performance infrastructure. In addition to the method described above, we’ve been investing in an array of innovative techniques to take on ever-increasing power and thermal challenges, including data center water cooling, vertical power delivery, power-aware workload allocation, and many more.
But these challenges aren’t unique to Google. Power and temperature fluctuations in ML infrastructure are becoming a common issue for many hyperscalers and cloud providers as well as infrastructure providers. We need partners at all levels of the system to help:

Utility providers to set forth a standardized definition of acceptable power quality metrics — especially in scenarios where multiple data centers with large power fluctuations co-exist within a same grid and interact with one another

Power and cooling equipment suppliers to offer quality and reliability enhancements for electronics components, particularly for use-conditions with large and frequent power and thermal fluctuations

Hardware suppliers and data center designers to create a standardized suite of solutions such as rack-level capacitor banks (RLCB) or on-chip features, to help establish an efficient supplier base and ecosystem

ML model developers to consider the energy consumption characteristics of the model, and consider adding low-level software mitigations to help address energy fluctuations

Google has been leading and advocating for industry-wide collaboration on these issues through forums such as Open Compute Project (OCP) to benefit the data center infrastructure industry as a whole. We look forward to continuing to share our learnings and collaborating on innovative new solutions together.

A special thanks to Denis Vnukov, Victor Cai, Jianqiao Liu, Ibrahim Ahmed, Venkata Chivukula, Jianing Fan, Gaurav Gandhi, Vivek Sharma, Keith Kleiner, Mudasir Ahmad, Binz Roy, Krishnanjan Gubba Ravikumar, Ashish Upreti and Chee Chung from Google Cloud for their contributions.

AI Summary and Description: Yes

**Summary:** The text discusses the challenges of managing power delivery for large-scale machine learning (ML) workloads in data center infrastructure. It highlights Google’s innovative approach to mitigate power fluctuations through proactive power shaping and industry collaborations, aiming for reliability and efficiency in ML infrastructure.

**Detailed Description:** The document delves into the recent surge in machine learning applications that are driving significant demand for power in data centers. Below are the key points covered in the text:

– **Evolving Power Demands:**
– Unlike traditional data center workloads, modern ML workloads require synchronized computation across multiple accelerator chips, leading to distinct power consumption patterns that are steeper and more intense.
– Large-scale synchronized ML workloads can cause power utilization spikes that nearly exceed the rated capacity of the data center’s infrastructure.

– **Challenges in Infrastructure Reliability:**
– Rapid power fluctuations can induce functionality and reliability issues in data center equipment, potentially resulting in outages and increased operational costs.
– Major risks include damage to upstream utility infrastructure, unintended triggering of Uninterruptible Power Supply (UPS) systems, and increased wear on hardware due to thermal and power fluctuations.

– **Innovation in Power Management:**
– Google proposes a full-stack approach for power management, advocating for collaborative design from the chip level to the data center, involving both hardware and software considerations.
– By integrating dynamic balancing techniques into system architectures, Google aims to smooth out power demands by managing workload activities in real-time.

– **Results of Implemented Mitigations:**
– Demonstrations indicated a nearly 50% reduction in power fluctuations and a significant drop in temperature variance, showcasing the efficacy of the proactive power shaping strategies.
– The paper emphasizes that these adjustments can maintain performance with only minimal increases in average power consumption.

– **Call to Action for Industry Collaboration:**
– Google calls on various stakeholders in the infrastructure ecosystem, including utility providers, equipment suppliers, hardware designers, and ML developers, to standardize definitions of power quality metrics, enhance component reliability, and consider energy impacts in model development.
– Emphasizes the importance of collective efforts and sharing knowledge across the industry through initiatives like the Open Compute Project to drive lasting improvements in data center performance and reliability.

This text is significant for professionals in AI and infrastructure security as it outlines new challenges and solutions pertaining to power management in ML setups, emphasizing the need for robust infrastructure and collaborative approaches in maintaining system reliability and efficiency.

1 2 3 5 a accelerator Act actions AGI AI and anti API Application applications Arch architecture architectures Aria ARM art as availability based based approach by C capacity challenges chip chips CIA Cloud Cloud Provider cloud providers cluster code Col collaboration collaborative collaborative approach compiler complexity compute Console contract contractual control core cost Costs cross D data data center data center infrastructure data centers day de DeFi definition definitions demo design developer developers development document dual e ecosystem edge efficiency efficient end energy energy consumption energy efficiency equipment ERP exp face fact feature features financial for free full functionality future g Gen Go Google Google Cloud gs hardware hardware design hardware reliability high high-performance Highlight HR http HTTPS Hyper hyperscal hyperscale hyperscalers image implications in industry industry collaboration infrastructure infrastructure design infrastructure providers infrastructure reliability infrastructure security innovation Innovations innovative approach instrumentation intensive inter interaction iOS ite J job Just k Key knowledge l Labor large learning led liability life Link linked lm long low mac machine Machine Learning machine learning applications making man management max metrics migration Mila mini mitigation mitigations ML model model development Modern multi network Networking no non o oE of off on one open Open Compute Project operation operational cost Operational Costs OPM out outage outages over parameter patterns performance performance impact performance overhead point post potential Power power consumption power delivery power demand power management power supply power usage pre proactive professionals Py R rack rag rate Ray RCE real real-time red reliability reliability issues Risk risks Ro RoT s Scale sec security sequence server server infrastructure SHA sharing short side Sig signatures Sim single software source SSE stack stakeholders start storage supply system system architecture system reliability systems T tech techniques test text the thermal fluctuations Time to Tor TP training transformer transformers trial tuning two UI UK up ups US usage use utilization V Well Wi workload workloads x