Cloud Blog: Improving model performance with PyTorch/XLA 2.6

Jan 31, 2025

—

Source URL: https://cloud.google.com/blog/products/application-development/pytorch-xla-2-6-helps-improve-ai-model-performance/
Source: Cloud Blog
Title: Improving model performance with PyTorch/XLA 2.6

Feedly Summary: For developers who want to use the PyTorch deep learning framework with Cloud TPUs, the PyTorch/XLA Python package is key, offering developers a way to run their PyTorch models on Cloud TPUs with only a few minor code changes. It does so by leveraging OpenXLA, developed by Google, which gives developers the ability to define their model once and run it on many different types of machine learning accelerators (i.e., GPUs, TPUs, etc.).
The latest release of PyTorch/XLA comes with several improvements that improve its performance for developers:

A new experimental scan operator to speed up compilation for repetitive blocks of code (i.e., for loops)

Host offloading to move TPU tensors to the host CPU’s memory to fit larger models on fewer TPUs

Improved goodput for tracing-bound models through a new base Docker image compiled with the C++ 2011 Standard application binary interface (C++ 11 ABI) flags

In addition to these improvements we’ve also re-organized the documentation to make it easier to find what you’re looking for!
Let’s take a look at each of these features in greater depth.

aside_block
), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Experimental scan operator
Have you ever experienced long compilation times, for example when working with large language models and PyTorch/XLA — especially when dealing with models with numerous decoder layers? During graph tracing, where we traverse the graph of all the operations being performed by the model, these iterative loops are completely “unrolled” — i.e., each loop iteration is copied and pasted for every cycle — resulting in large computation graphs. These larger graphs lead directly to longer compilation times. But now there’s a new solution: the new experimental scan function, inspired by jax.lax.scan.
The scan operator works by changing how loops are handled during compilation. Instead of compiling each iteration of the loop independently, which creates redundant blocks, scan compiles only the first iteration. The resulting compiled high-level operation (HLO) is then reused for all subsequent iterations. This means that there is less HLO or intermediate code that is being generated for each subsequent loop. Compared to a for loop, scan compiles in a fraction of the time since it only compiles the first loop iteration. This improves the developer iteration time when working on models with many homogeneous layers, such as LLMs.

Building on top of torch_xla.experimental.scan, the torch_xla.experimental.scan_layers function offers a simplified interface for looping over sequences of nn.Modules. Think of it as a way to tell PyTorch/XLA “These modules are all the same, just compile them once and reuse them!" For example:

code_block
<ListValue: [StructValue([(‘code’, ‘import torch\r\nimport torch.nn as nn\r\nimport torch_xla\r\nfrom torch_xla.experimental.scan_layers import scan_layers\r\n\r\nclass DecoderLayer(nn.Module):\r\n def __init__(self, size):\r\n super().__init__()\r\n self.linear = nn.Linear(size, size)\r\n\r\n def forward(self, x):\r\n return self.linear(x)\r\n\r\nwith torch_xla.device():\r\n layers = [DecoderLayer(1024) for _ in range(64)]\r\n x = torch.randn(1, 1024)\r\n\r\n# Instead of a for loop, we can scan_layers once:\r\n# for layer in layers:\r\n# x = layer(x)\r\nx = scan_layers(layers, x)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b72c46070>)])]>

One thing to note is that custom pallas kernels do not yet support scan. Here is a complete example of using scan_layers in an LLM for reference.
Host offloading
Another powerful tool for memory optimization in PyTorch/XLA is host offloading. This technique allows you to temporarily move tensors from the TPU to the host CPU’s memory, freeing up valuable device memory during training. This is especially helpful for large models where memory pressure is a concern. You can use torch_xla.experimental.stablehlo_custom_call.place_to_host to offload a tensor and torch_xla.experimental.stablehlo_custom_call.place_to_device to retrieve it later. A typical use case involves offloading intermediate activations during the forward pass and then bringing them back during the backward pass. Here’s an example of host offloading for reference.
Strategic use of host offloading, such as when you’re working with limited memory and are unable to use the accelerator continuously, may significantly improve your ability to train large and complex models within the memory constraints of your hardware.
Alternative base Docker image
Have you ever encountered a situation where your TPUs are sitting idle while your host CPU is heavily loaded tracing your model execution graph for just-in-time compilation? This suggests your model is "tracing bound," meaning performance is limited by the speed of tracing operations.
The C++11 ABI image offers a solution. Starting with this release, PyTorch/XLA offers a choice of C++ ABI flavors for both Python wheels and Docker images. This gives you a choice for which version of C++ you’d like to use with PyTorch/XLA. You’ll now find builds with both the pre-C++11 ABI, which remains the default to match PyTorch upstream, and the more modern C++11 ABI.
Switching to the C++11 ABI wheels or Docker images can lead to noticeable improvements in the above-mentioned scenarios. For example, we observed a 20% relative improvement in goodput with the Mixtral 8x7B model on v5p-256 Cloud TPU (with a global batch size of 1024) when we switched from the pre-C++11 ABI to the C++11 ABI! ML Goodput gives us an understanding of how efficiently a given model utilizes the hardware. So if we have a higher goodput measurement for the same model on the same hardware, that indicates better performance of the model.
An example of using a C++11 ABI docker image in your Dockerfile might look something like:

code_block
<ListValue: [StructValue([(‘code’, ‘# Use the C++11 ABI PyTorch/XLA image as the base\r\nFROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.6.0_3.10_tpuvm_cxx11\r\n\r\n# Install any additional dependencies here\r\n# RUN pip install my-other-package\r\n\r\n# Copy your code into the container\r\nCOPY . /app\r\nWORKDIR /app\r\n\r\n# Run your training script\r\nCMD ["python", "train.py"]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b72c46820>)])]>

Alternatively, if you are not using Docker images, because you’re testing locally for instance, you can install the C++11 ABI wheels for version 2.6 using the following command (Python 3.10 example):

code_block
<ListValue: [StructValue([(‘code’, "pip install torch==2.6.0+cpu.cxx11.abi \\\r\n https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0%2Bcxx11-cp310-cp310-manylinux_2_28_x86_64.whl \\\r\n ‘torch_xla[tpu]’ \\\r\n -f https://storage.googleapis.com/libtpu-releases/index.html \\\r\n -f https://storage.googleapis.com/libtpu-wheels/index.html \\\r\n -f https://download.pytorch.org/whl/torch"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2b72c46970>)])]>

The above command works for Python 3.10. We have instructions for other versions within our documentation.
The flexibility to choose between C++ ABIs lets you choose the optimal build for your specific workload and hardware, ultimately leading to better performance and efficiency in your PyTorch/XLA projects!
So, what are you waiting for, go try out the latest version of PyTorch/XLA! For additional information check out the latest release notes.
A note on GPU support
We aren’t offering a PyTorch/XLA:GPU wheel in the PyTorch/XLA 2.6 release. We understand this is important and plan to reinstate GPU support by the 2.7 release. PyTorch/XLA remains an open-source project and we welcome contributions from the community to help maintain and improve the project. To contribute, please start with the contributors guide.
The latest stable version where a PyTorch/XLA:GPU wheel is available is torch_xla 2.5.

AI Summary and Description: Yes

Summary: The text discusses significant improvements to the PyTorch/XLA package, particularly for developers working with TPUs in the cloud. It highlights new features such as the experimental scan operator, host offloading, and a new base Docker image utilizing the C++11 ABI. These enhancements are aimed at optimizing performance and reducing compilation times, which is essential for practitioners in AI and cloud computing.

Detailed Description: The text provides a detailed overview of the latest enhancements in the PyTorch/XLA Python package, which is essential for developers who want to leverage Cloud TPUs effectively. The focus is on improving the developer experience through faster compilation times and better memory management when training large models.

Key Points include:

– **Experimental Scan Operator**:
– Reduces long compilation times by only compiling the first iteration of loops during graph tracing.
– Allows for the reuse of the compiled operation for subsequent iterations, making it particularly beneficial for models with many repeated layers (e.g., large language models).

– **Host Offloading**:
– Enables temporary movement of TPU tensors to the host CPU’s memory, facilitating the management of large models within limited TPU memory constraints.
– Optimizes memory usage by offloading intermediate activations during the training process, thus enhancing the training of extensive models.

– **Improved Base Docker Image with C++11 ABI**:
– Offers developers a new Docker image compiled with C++11 ABI, which promises better performance for tracing-bound models.
– Demonstrated a 20% improvement in relative goodput of models when switching to the C++11 ABI, indicating more efficient use of TPU resources.

– **Python Wheels and Compatibility**:
– The new Docker images and Python wheels provide flexible options for optimizing specific workloads and hardware compatibility.
– Current limitations include the absence of GPU wheels in version 2.6, with plans to restore this in the upcoming release.

Overall, these advancements in the PyTorch/XLA package cater to AI developers and researchers by enhancing the usability and efficiency of machine learning operations within cloud environments, thereby addressing performance and scalability challenges that are critical in the rapidly evolving field of AI.

01 1 2 3 4 5 7 a accelerator accelerators Act advancement advancements AGI AI AI developers and API APIs Application Arch art as batch size by C challenges CIA class Cloud cloud computing cloud environment cloud environments code command community compatibility compilation Computing Console container critical Current D de deep learning DeFi demo dependencies depth developer developer experience developers development Docker docker images document documentation e effective efficiency efficient end Entra environment EU execution exp experience face fast fault features fine first flexibility for framework free g Gen generated Go Google GPU GPU support GPUs graph gs hardware hardware compatibility high Highlight host offloading HR http HTTPS image in information inter inux iOS IRS ite J Just k kernel kernels Key l language language model language models large large language model large language models large models learning led limitations Linux llm llms lm long loop low mac machine Machine Learning making management media memory memory management memory optimization memory usage ML model model performance models Modern my native no non notes o oE of off on one open open-source operation Operator OPM opt optimization ory out over performance point Power pre product products projects Py Python Python 3.10 pytorch R R2 rag rate RCE red release release notes releases research researchers resources restore return Ro s scalability search self sequence side Sig Sim source SSE start state storage T tech test Testing text the Time to tool Tor TP training trial trie UI up ups US usability usage use uv V val version WAN Wi workload workloads x