Tag: checkpointing
-
Cloud Blog: 11 ways to reduce your Google Cloud compute costs today
Source URL: https://cloud.google.com/blog/products/compute/cost-saving-strategies-when-migrating-to-google-cloud-compute/ Source: Cloud Blog Title: 11 ways to reduce your Google Cloud compute costs today Feedly Summary: As the saying goes, “a penny saved is a penny earned," and this couldn’t be more true when it comes to cloud infrastructure. In today’s competitive business landscape, you need to maintain the performance to meet…
-
Cloud Blog: 5 best practices for Managed Lustre on Google Kubernetes Engine
Source URL: https://cloud.google.com/blog/products/containers-kubernetes/gke-managed-lustre-csi-driver-for-aiml-and-hpc-workloads/ Source: Cloud Blog Title: 5 best practices for Managed Lustre on Google Kubernetes Engine Feedly Summary: Google Kubernetes Engine (GKE) is a powerful platform for orchestrating scalable AI and high-performance computing (HPC) workloads. But as clusters grow and jobs become more data-intensive, storage I/O can become a bottleneck. Your powerful GPUs and…
-
Cloud Blog: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners
Source URL: https://cloud.google.com/blog/products/ai-machine-learning/q2-2025-ai-hypercomputer-updates/ Source: Cloud Blog Title: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners Feedly Summary: Curious about the latest in AI infrastructure from Google Cloud? Every three months we share a roundup of the latest AI Hypercomputer news, resources, events, learning opportunities, and more. Read on to learn new ways…
-
Cloud Blog: Accelerate your AI workloads with the Google Cloud Managed Lustre
Source URL: https://cloud.google.com/blog/products/storage-data-transfer/google-cloud-managed-lustre-for-ai-hpc/ Source: Cloud Blog Title: Accelerate your AI workloads with the Google Cloud Managed Lustre Feedly Summary: Today, we’re making it even easier to achieve breakthrough performance for your AI/ML workloads: Google Cloud Managed Lustre is now GA, and available in four distinct performance tiers that deliver throughput ranging from 125 MB/s, 250…
-
Cloud Blog: Save early and often with multi-tier checkpointing to optimize large AI training jobs
Source URL: https://cloud.google.com/blog/products/ai-machine-learning/using-multi-tier-checkpointing-for-large-ai-training-jobs/ Source: Cloud Blog Title: Save early and often with multi-tier checkpointing to optimize large AI training jobs Feedly Summary: As foundation model training infrastructure scales to tens of thousands of accelerators, efficient utilization of those high-value resources becomes paramount. In particular, as the cluster gets larger, hardware failures become more frequent (~…
-
Cloud Blog: Building a Production Multimodal Fine-Tuning Pipeline
Source URL: https://cloud.google.com/blog/topics/developers-practitioners/building-a-production-multimodal-fine-tuning-pipeline/ Source: Cloud Blog Title: Building a Production Multimodal Fine-Tuning Pipeline Feedly Summary: Looking to fine-tune multimodal AI models for your specific domain but facing infrastructure and implementation challenges? This guide demonstrates how to overcome the multimodal implementation gap using Google Cloud and Axolotl, with a complete hands-on example fine-tuning Gemma 3 on…
-
Cloud Blog: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing
Source URL: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput/ Source: Cloud Blog Title: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing Feedly Summary: Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to…
-
Cloud Blog: AI Hypercomputer developer experience enhancements from Q1 25: build faster, scale bigger
Source URL: https://cloud.google.com/blog/products/compute/ai-hypercomputer-enhancements-for-the-developer/ Source: Cloud Blog Title: AI Hypercomputer developer experience enhancements from Q1 25: build faster, scale bigger Feedly Summary: Building cutting-edge AI models is exciting, whether you’re iterating in your notebook or orchestrating large clusters. However, scaling up training can present significant challenges, including navigating complex infrastructure, configuring software and dependencies across numerous…
-
Cloud Blog: Colossus: the secret ingredient in Rapid Storage’s high performance
Source URL: https://cloud.google.com/blog/products/storage-data-transfer/how-the-colossus-stateful-protocol-benefits-rapid-storage/ Source: Cloud Blog Title: Colossus: the secret ingredient in Rapid Storage’s high performance Feedly Summary: As an object storage service, Google Cloud Storage is popular for its simplicity and scale, a big part of which is due to the stateless REST protocols that you can use to read and write data. But…
-
Cloud Blog: Introducing Ironwood TPUs and new innovations in AI Hypercomputer
Source URL: https://cloud.google.com/blog/products/compute/whats-new-with-ai-hypercomputer/ Source: Cloud Blog Title: Introducing Ironwood TPUs and new innovations in AI Hypercomputer Feedly Summary: Today’s innovation isn’t born in a lab or at a drafting board; it’s built on the bedrock of AI infrastructure. AI workloads have new and unique demands — addressing these requires a finely crafted combination of hardware…