Cloud Blog: Unlocking LLM training efficiency with Trillium — a performance analysis

Source URL: https://cloud.google.com/blog/products/compute/trillium-mlperf-41-training-benchmarks/
Source: Cloud Blog
Title: Unlocking LLM training efficiency with Trillium — a performance analysis

Feedly Summary: Rapidly evolving generative AI models place unprecedented demands on the performance and efficiency of hardware accelerators. Last month, we launched our sixth-generation Tensor Processing Unit (TPU), Trillium, to address the demands of next-generation models. Trillium is purpose-built for performance at scale, from the chip to the system to our Google data center deployments, to power ultra-large scale training.Today, we present our first MLPerf training benchmark results for Trillium. The MLPerf 4.1 training benchmarks show that Trillium delivers up to 1.8x better performance-per-dollar compared to prior-generation Cloud TPU v5p and an impressive 99% scaling efficiency (throughput).In this blog, we offer a concise analysis of Trillium’s performance, demonstrating why it stands out as the most performant and cost-efficient TPU training system to date. We begin with a quick overview of system comparison metrics, starting with traditional scaling efficiency. We introduce convergence scaling efficiency as a crucial metric to consider in addition to scaling efficiency. We assess these two metrics along with performance per dollar and present a comparative view of Trillium against Cloud TPU v5p. We conclude with guidance that you can use to make an informed choice for your cloud accelerators.Traditional performance metricsAccelerator systems can be evaluated and compared across multiple dimensions, ranging from peak throughput, to effective throughput, to throughput scaling efficiency. Each of these metrics are helpful indicators but do not take convergence time into consideration.Hardware specifications and peak performanceTraditionally, comparisons focused on hardware specifications like peak throughput, memory bandwidth, and network connectivity. While these peak values establish theoretical boundaries, they are bad at predicting real-world performance, which depends heavily on architectural design and software implementation. Since modern ML workloads typically span hundreds or thousands of accelerators, the key metric is the effective throughput of an appropriately sized system for specific workloads.Utilization performanceSystem performance can be quantified through utilization metrics like effective model FLOPS utilization (EMFU) and memory bandwidth utilization (MBU), which measure achieved throughput versus peak capacity. However, these hardware efficiency metrics don’t directly translate to business-value measures like training time or model quality.Scaling efficiency and trade-offsA system’s scalability is evaluated through both strong scaling (performance improvement with system size for fixed workloads) and weak scaling (efficiency when increasing both workload and system size proportionally). While both metrics are valuable indicators, the ultimate goal is to achieve high-quality models quickly, sometimes making it worthwhile to trade scaling efficiency for faster training time or better model convergence.The need for convergence scaling efficiencyWhile hardware utilization and scaling metrics provide important system insights, convergence scaling efficiency focuses on the fundamental goal of training: reaching model convergence efficiently. Convergence refers to the point where a model’s output stops improving and the error rate becomes constant. Convergence scaling efficiency measures how effectively additional computing resources accelerate the training process to completion.We define convergence scaling efficiency using two key measurements: the base case, where a cluster of N₀ accelerators achieves convergence in time T₀, and a scaled case with N₁ accelerators taking time T₁ to converge. The ratio of the speedup in convergence time to the increase in cluster size gives us:

A convergence scaling efficiency of 1 indicates that time-to-solution improves by the same ratio as the cluster size. It is therefore desirable to have convergence scaling efficiency as close to 1 as possible.Now let’s apply these concepts to understand our ML Perf submission for GPT3-175b training task using Trillium and Cloud TPU v5p.Trillium performanceWe submitted GPT3-175b training results for four different Trillium configurations, and three different Cloud TPU v5p configurations. In the following analysis, we group the results by cluster sizes with the same total peak flops for comparison purposes. For example, the Cloud TPU v5p-4096 configuration is compared to 4xTrillium-256, and Cloud TPU v5p-8192 is compared with 8xTrillium-256, and so on.All results presented in this analysis are based on MaxText, our high-performance reference implementation for Cloud TPUs and GPUs.Weak scaling efficiencyFor increasing cluster sizes with proportionately larger batch-sizes, both Trillium and TPU v5p deliver near linear scaling efficiency:

Figure-1: Weak scaling comparison for Trillium and Cloud TPU v5p. v5p-4096 and 4xTrillium-256 are considered as base for scaling factor measurement. n x Trillium-256 corresponds to n Trillium pods with 256 chips in one ICI domain. v5p-n corresponds to n/2 v5p chips in a single ICI domain.

Figure 1 demonstrates relative throughput scaling as cluster sizes increase from the base configuration. Trillium achieves 99% scaling efficiency even when operating across data-center networks using Cloud TPU multislice technology, outperforming the 94% scaling efficiency of Cloud TPU v5p cluster within a single ICI domain. For these comparisons, we used a base configuration of 1024 chips (4x Trillium-256 pods), establishing a consistent baseline with the smallest v5p submission (v5p-4096; 2048 chips). When measured against our smallest submitted configuration of 2x Trillium-256 pods, Trillium maintains a strong 97.6% scaling efficiency.Convergence scaling efficiencyAs stated above, weak scaling is useful but not a sufficient indicator of value, while convergence scaling efficiency brings time-to-solution into consideration.

Figure-2: Convergence scaling comparison for Trillium and Cloud TPU v5p.

For the largest cluster size, we observed comparable convergence scaling efficiency for Trillium and Cloud TPU v5p. In this example, a CSE of 0.8 means that for the rightmost configuration, the cluster size was 3x the (base) configuration, while the time to convergence improved by 2.4x with respect to the base configuration (2.4/3 = 0.8).While the convergence scaling efficiency is comparable between Trillium and TPU v5p, where Trillium really shines is by delivering the convergence at a lower cost, which brings us to the last metric.Cost-to-trainWhile weak scaling efficiency and convergence scaling efficiency indicate scaling properties of systems, we’ve yet to look at the most crucial metric: the cost to train.

Figure-3: Comparison of cost-to-train based on the wall-clock time and the on-demand list price for Cloud TPU v5p and Trillium.

Trillium lowers the cost to train by up to 1.8x (45% lower) compared to TPU v5p while delivering convergence to the same validation accuracy.Making informed cloud accelerator choicesIn this article, we explored the complexities of comparing accelerator systems, emphasizing the importance of looking beyond simple metrics to assess true performance and efficiency. We saw that while peak performance metrics provide a starting point, they often fall short in predicting real-world utility. Instead, metrics like Effective Model Flops Utilization (EMFU) and Memory Bandwidth Utilization (MBU) offer more meaningful insights into an accelerator’s capabilities.We also highlighted the critical importance of scaling characteristics — both strong and weak scaling — in evaluating how systems perform as workloads and resources grow. However, the most objective measure we identified is the convergence scaling efficiency, which ensures that we’re comparing systems based on their ability to achieve the same end result, rather than just raw speed.Applying these metrics to our benchmark submission with GPT3-175b training, we demonstrated that Trillium achieves comparable convergence scaling efficiency to Cloud TPU v5p while delivering up to 1.8x better performance per dollar, thereby lowering the cost-to-train. These results highlight the importance of evaluating accelerator systems through multiple dimensions of performance and efficiency.For ML-accelerator evaluation, we recommend a comprehensive analysis combining resource utilization metrics (EMFU, MBU), scaling characteristics, and convergence scaling efficiency. This multi-faceted approach enables you to make data-driven decisions based on your specific workload requirements and scale. To learn more about Trillium, please review the launch blog or our documentation.

AI Summary and Description: Yes

**Summary:** The text provides an in-depth analysis of Google’s sixth-generation Tensor Processing Unit (TPU), named Trillium, focusing on its performance in handling the demands of large-scale generative AI models. It outlines benchmarks that showcase Trillium’s efficiency and cost-effectiveness compared to previous TPU versions. The discussion emphasizes various performance metrics, including traditional scaling efficiency and the newly introduced convergence scaling efficiency, both of which are crucial for AI practitioners in evaluating hardware capabilities for machine learning workloads.

**Detailed Description:**

– **Purpose of Trillium:**
– Designed to meet the performance and efficiency demands of rapidly advancing generative AI models.
– Aims for optimized performance at scale across Google’s infrastructure.

– **Performance Benchmarks:**
– Launched benchmarks for Trillium via MLPerf, revealing it offers up to 1.8x better performance-per-dollar than its predecessor (Cloud TPU v5p) and maintains a 99% scaling efficiency.
– A thorough analysis of metrics such as effective throughput, memory bandwidth, and effective model FLOPS utilization (EMFU).

– **Key Metrics Introduced:**
– **Convergence Scaling Efficiency (CSE):**
– A new metric to evaluate how efficiently additional resources improve the model training process.
– Defined using the ratio of speedup in convergence time to the increase in cluster size, ideally, a CSE of 1 is sought.

– **Comparison Against Previous Versions:**
– Provided comparative analysis on the scaling efficiency and cost-effectiveness of Trillium versus the Cloud TPU v5p.
– Identified both weak scaling (performance improvement with fixed workloads as system size increases) and strong scaling (efficiency when workload and system size increase proportionally).

– **Cost-to-Train Analysis:**
– Emphasizes that while scaling efficiency is vital, the cost-to-train with respect to convergence accuracy is crucial for decision-making.
– Trillium demonstrated a cost reduction of up to 1.8x compared to TPU v5p, offering better value while achieving similar validation accuracy.

– **Recommendations for Evaluating ML Accelerators:**
– Evaluating accelerator systems requires a multi-faceted approach, combining:
– Resource utilization metrics (EMFU, MBU)
– Scaling characteristics (weak and strong scaling)
– Convergence scaling efficiency
– Encouragement for professionals to utilize comprehensive analysis for making informed choices based on workload requirements.

The insights encapsulated in this evaluation are particularly salient for security and compliance professionals in AI, highlighting the intrinsic connection between hardware performance, efficient model training, and the overall sustainability of cloud infrastructure used for machine learning tasks.