data parallelism – Experimental News Clipping Site

Cloud Blog: An efficient path to production AI: Kakao’s journey with JAX and Cloud TPUs

Aug 19, 2025

—

by

Source URL: https://cloud.google.com/blog/products/infrastructure-modernization/kakaos-journey-with-jax-and-cloud-tpus/ Source: Cloud Blog Title: An efficient path to production AI: Kakao’s journey with JAX and Cloud TPUs Feedly Summary: When your messaging platform serves 49 million people – 93% of South Korea’s population – every technical decision carries enormous weight. The engineering team at Kakao faced exactly this challenge when their existing…

Cloud Blog: Your guide to taking an open model from discovery to a production-ready endpoint on Vertex AI

Jul 25, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/take-an-open-model-from-discovery-to-endpoint-on-vertex-ai/ Source: Cloud Blog Title: Your guide to taking an open model from discovery to a production-ready endpoint on Vertex AI Feedly Summary: Developers building with gen AI are increasingly drawn to open models for their power and flexibility. But customizing and deploying them can be a huge challenge. You’re often left wrestling…

Cloud Blog: Save early and often with multi-tier checkpointing to optimize large AI training jobs

Jun 16, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/using-multi-tier-checkpointing-for-large-ai-training-jobs/ Source: Cloud Blog Title: Save early and often with multi-tier checkpointing to optimize large AI training jobs Feedly Summary: As foundation model training infrastructure scales to tens of thousands of accelerators, efficient utilization of those high-value resources becomes paramount. In particular, as the cluster gets larger, hardware failures become more frequent (~…

Cloud Blog: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing

May 22, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/elastic-training-and-optimized-checkpointing-improve-ml-goodput/ Source: Cloud Blog Title: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing Feedly Summary: Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to…

Hacker News: Mirror, Mirror on the Wall, What Is the Best Topology of Them All?

Nov 29, 2024

—

by

system automation

in Uncategorized

Source URL: https://cacm.acm.org/research-highlights/technical-perspective-mirror-mirror-on-the-wall-what-is-the-best-topology-of-them-all/ Source: Hacker News Title: Mirror, Mirror on the Wall, What Is the Best Topology of Them All? Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses the critical nature of infrastructure design for large-scale AI systems, particularly focusing on network topologies that support specialized AI workloads. It introduces the…

Simon Willison’s Weblog: lm.rs: run inference on Language Models locally on the CPU with Rust

Oct 11, 2024

—

by

system automation

in Uncategorized

Source URL: https://simonwillison.net/2024/Oct/11/lmrs/ Source: Simon Willison’s Weblog Title: lm.rs: run inference on Language Models locally on the CPU with Rust Feedly Summary: lm.rs: run inference on Language Models locally on the CPU with Rust Impressive new LLM inference implementation in Rust by Samuel Vitorino. I tried it just now on an M2 Mac with 64GB…

Hacker News: How to train a model on 10k H100 GPUs?

Oct 2, 2024

—

by

system automation

in Uncategorized

Source URL: https://soumith.ch/blog/2024-10-02-training-10k-scale.md.html Source: Hacker News Title: How to train a model on 10k H100 GPUs? Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses advanced techniques for training massive AI models using 10,000 NVIDIA H100 GPUs, emphasizing the importance of efficient data parallelization, communication optimization, and rapid failure recovery. These insights…

Tag: data parallelism

Cloud Blog: An efficient path to production AI: Kakao’s journey with JAX and Cloud TPUs

Cloud Blog: Your guide to taking an open model from discovery to a production-ready endpoint on Vertex AI

Cloud Blog: Save early and often with multi-tier checkpointing to optimize large AI training jobs

Cloud Blog: Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing

Hacker News: Mirror, Mirror on the Wall, What Is the Best Topology of Them All?

Simon Willison’s Weblog: lm.rs: run inference on Language Models locally on the CPU with Rust

Hacker News: How to train a model on 10k H100 GPUs?