Tag: tier checkpointing
-
Cloud Blog: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners
Source URL: https://cloud.google.com/blog/products/ai-machine-learning/q2-2025-ai-hypercomputer-updates/ Source: Cloud Blog Title: Announcements for AI Hypercomputer: The latest infrastructure news for ML practitioners Feedly Summary: Curious about the latest in AI infrastructure from Google Cloud? Every three months we share a roundup of the latest AI Hypercomputer news, resources, events, learning opportunities, and more. Read on to learn new ways…
-
Cloud Blog: Save early and often with multi-tier checkpointing to optimize large AI training jobs
Source URL: https://cloud.google.com/blog/products/ai-machine-learning/using-multi-tier-checkpointing-for-large-ai-training-jobs/ Source: Cloud Blog Title: Save early and often with multi-tier checkpointing to optimize large AI training jobs Feedly Summary: As foundation model training infrastructure scales to tens of thousands of accelerators, efficient utilization of those high-value resources becomes paramount. In particular, as the cluster gets larger, hardware failures become more frequent (~…