Cloud Blog: High performance storage innovations for your AI workloads

Source URL: https://cloud.google.com/blog/products/storage-data-transfer/high-performance-storage-innovations-for-ai-hpc/
Source: Cloud Blog
Title: High performance storage innovations for your AI workloads

Feedly Summary: The high-performance storage stack in AI Hypercomputer incorporates learnings from geographic regions, zones, and GPU/TPU architectures, to create an agile, economical, integrated storage architecture. Recently, we’ve made several innovations to improve accelerator utilization with high-performance storage, helping you to optimize costs, resources, and accelerate your AI workloads:

Rapid Storage: A new Cloud Storage zonal bucket that provides industry-leading <1ms random read and write latency, 20x faster data access, 6 TB/s of throughput, and 5x lower latency for random reads and writes compared to other leading hyperscalers. Anywhere Cache: A new, strongly consistent cache that works with existing regional buckets to cache data within a selected zone. Anywhere Cache reduces latency up to 70% and 2.5TB/s, accelerating AI workloads; and maximizes goodput by keeping data close to GPU or TPUs.  Google Cloud Managed Lustre: A new high-performance, fully managed parallel file system built on the DDN EXAScaler Lustre file system. This zonal storage solution provides PB scale at under 1ms latency, millions of IOPS, and TB/s of throughput for AI workloads. Storage Intelligence, the industry’s first offering for generating storage insights specific to their environment by querying object metadata at scale and using the power of LLMs. Storage Intelligence not only provides insights into vast data estates, it also provides the ability to take actions, e.g., using ‘bucket relocation’ to non-disruptively co-locate data with accelerators. aside_block <ListValue: [StructValue([('title', 'Try Google Cloud for free'), ('body', <wagtail.rich_text.RichText object at 0x3ece885d0310>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Rapid Storage enables AI workloads with millisecond-latency
To train, checkpoint, and serve AI models at peak efficiency, you need to keep your GPU or TPUs saturated with data to minimize wasted compute (as measured by goodput). But traditional object storage suffers from a critical limitation: latency. Using Google’s Colossus cluster-level file system, we are delivering a new approach to colocate storage and AI accelerators in a new zonal bucket. By “sitting on” Colossus, Rapid Storage avoids the typical regional storage latency of having accelerators that reside in one zone and data that resides in another.
Unlike regional Cloud Storage buckets, a Rapid Storage zonal bucket concentrates data within the same zone that your GPUs and TPUs run in, helping to achieve sub-millisecond read/write latencies and high throughput. In fact, Rapid Storage delivers 5x lower latency for random reads and writes compared to other leading hyperscalers. Combined with throughput of up to 6 TB/s per bucket and up to 20 million queries per second (QPS), you can now use Rapid Storage to train AI models with new levels of performance. 
And because performance shouldn’t come at the cost of complexity, you can mount a Rapid Storage bucket as a file system leveraging Cloud Storage FUSE. This lets common AI frameworks such as TensorFlow and PyTorch access object storage without having to modify any code.
Anywhere Cache puts data in your preferred zone 
Anywhere Cache is a strongly consistent zonal read cache that works with existing storage buckets (Regional, Multi-regional, or Dual-Region) and intelligently caches data within your selected zone. As a result, Anywhere Cache delivers up to 70% improvement in read-storage latency. By dynamically caching data to the desired zone and close to your GPUs or TPUs, it delivers performance of up to 2.5 TB/s, keeping multiple epoch training times minimized. Should conditions change, e.g., there’s a shift in accelerator availability, Anywhere Cache ensures your data accompanies the AI accelerators. You can enable Anywhere Cache in other regions and other zones with a single click, with no changes to your bucket or application. Moreover, it eliminates egress fees for cached data — among existing Anywhere Cache customers with multi-region buckets, 70% have seen cost benefits.

Anthropic leverages Anywhere Cache to improve the resilience of their cloud workload by co-locating data with TPUs in a single zone and providing dynamically scalable read throughput up to 6TB/s. They also use Storage Intelligence to gain deep insight into their 85+ billion objects allowing them to optimize their storage infrastructure.

Google Cloud Managed Lustre accelerates HPC and AI workloads
AI workloads can access small files, random I/O, while needing the sub-millisecond latency of a parallel file system. The new Google Cloud Managed Lustre is a fully managed parallel file system service that provides full POSIX support and persistent zonal storage that scales from terabytes to petabytes. As a persistent parallel file system, Managed Lustre lets you confidently store your training, checkpoint, and serving data, while delivering high throughput, sub-millisecond latency, and millions of IOPS across multiple jobs — all while maximizing goodput. With its full-duplex network utilization, Managed Lustre can fully saturate VMs at 20GB/s and can deliver up to 1TB/s in aggregate throughput, while support for the Cloud Storage bulk import/export API makes it easy to move datasets to and from Cloud Storage. Managed Lustre is built in collaboration with DDN and based on EXAScaler.
Analyze and act on data with Storage Intelligence
Your AI models can only be as good as the data you train them on. Today, we announced Storage Intelligence, a new service that can help you find the right set of data by querying the metadata across all of your buckets to be used for AI training, improving your AI cost-optimization efforts. Storage Intelligence queries object metadata at scale using the power of LLMs, helping to generate storage insights specific to an environment. The first such service from a cloud hyperscaler, Storage Intelligence lets you analyze the millions — or billions — of object metadata in your buckets, and projects across your organization. With the insights from this analysis, you can make informed decisions about eliminating duplicate objects, identifying objects that can be deleted or tiered to a lower storage class through Object Lifecycle Management or Autoclass, or identifying objects that violate your company’s security policies, to name a few.

Google’s Cloud Storage’s Autoclass and Storage Intelligence features have helped Spotify understand and optimize its storage costs. In 2024, Spotify took advantage of these features to reduce its storage spend by 37%.

High performance storage for your AI workloads
We built our Rapid Storage, Anywhere Cache, and Managed Lustre, high-performance storage solutions to deliver availability, high throughput, low latency, and durable architectures. Storage Intelligence adds to that, providing valuable, actionable insights into your storage estate. 
To learn more about these innovations, visit us at Next 25 and attend the breakout sessions, “What’s new with Google Cloud’s Storage” (BRK2-025) and “AI Hypercomputer: Mastering your Storage Infrastructure” (BRK2-020).

AI Summary and Description: Yes

**Summary**: The text outlines significant advancements in Google Cloud’s storage solutions specifically designed for high-performance AI workloads. It highlights innovations such as Rapid Storage, Anywhere Cache, and Google Cloud Managed Lustre, all aimed at enhancing data access speed and efficiency, while also leveraging LLMs for improved storage insights. This is especially relevant for professionals in AI, cloud, and infrastructure security, focusing on performance optimization and cost reduction.

**Detailed Description**:

The provided text describes multiple breakthroughs in storage technology specifically tailored for AI workloads within Google Cloud. The innovations are designed to minimize latency, increase throughput, and assist in optimizing costs, making them highly relevant for security and compliance professionals in the AI and cloud domains. Key components of the innovations include:

– **Rapid Storage**:
– A new zonal bucket that drastically reduces random read and write latencies to under 1 millisecond.
– Achieves throughput rates of up to 6 TB/s, allowing for 20 times faster data access compared to other leading hyperscalers.
– It co-locates data and computing resources (GPUs/TPUs) in the same zone, thereby eliminating regional latency issues.

– **Anywhere Cache**:
– Provides a strongly consistent zonal read-cache that helps reduce read-storage latency by up to 70%.
– Supports dynamic caching which ensures data location stays optimal relative to AI accelerator availability.
– Eliminates egress fees for cached data, resulting in cost efficiency for organizations.

– **Google Cloud Managed Lustre**:
– A fully managed parallel file system service that scales seamlessly from terabytes to petabytes.
– It supports high performance with sub-millisecond latency and high throughput capabilities.
– Built collaboratively with DDN EXAScaler, it integrates with Cloud Storage for easier dataset management.

– **Storage Intelligence**:
– The first service of its kind from a cloud hyperscaler, allowing users to generate insights from large volumes of object metadata using LLMs.
– Empowers organizations to make informed decisions on storage optimization, such as reducing redundant data and managing security compliance.
– Highlights notable use cases, such as Spotify’s 37% cost savings on storage through the use of these new features.

In conclusion, these storage improvements provide significant advantages for organizations utilizing AI technologies, enabling them to optimize performance, reduce costs, and streamline operations. The introduction of LLMs into storage management also presents a unique angle for data analytics and compliance, making this content highly relevant for security and compliance professionals.