Tag: Goodput

  • Cloud Blog: Announcing a new monitoring library to optimize TPU performance

    Source URL: https://cloud.google.com/blog/products/compute/new-monitoring-library-to-optimize-google-cloud-tpu-resources/ Source: Cloud Blog Title: Announcing a new monitoring library to optimize TPU performance Feedly Summary: For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads. And there is strong demand from customers for Cloud TPUs as well. When running advanced AI workloads, you need to be…

  • Cloud Blog: Accelerate your AI workloads with the Google Cloud Managed Lustre

    Source URL: https://cloud.google.com/blog/products/storage-data-transfer/google-cloud-managed-lustre-for-ai-hpc/ Source: Cloud Blog Title: Accelerate your AI workloads with the Google Cloud Managed Lustre Feedly Summary: Today, we’re making it even easier to achieve breakthrough performance for your AI/ML workloads: Google Cloud Managed Lustre is now GA, and available in four distinct performance tiers that deliver throughput ranging from 125 MB/s, 250…

  • Cloud Blog: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing

    Source URL: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput/ Source: Cloud Blog Title: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing Feedly Summary: Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to…