Tag: training systems
-
Hacker News: AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs
Source URL: https://arxiv.org/abs/2503.01890 Source: Hacker News Title: AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs Feedly Summary: Comments AI Summary and Description: Yes Summary: The paper introduces AutoHete, a groundbreaking training system designed for heterogeneous environments that significantly enhances the training efficiency of large language models (LLMs). It addresses GPU memory limitations and…
-
The Register: xAI picked Ethernet over InfiniBand for its H100 Colossus training cluster
Source URL: https://www.theregister.com/2024/10/29/xai_colossus_networking/ Source: The Register Title: xAI picked Ethernet over InfiniBand for its H100 Colossus training cluster Feedly Summary: Work already underway to expand system to 200,000 Nvidia Hopper chips Unlike most AI training clusters, xAI’s Colossus with its 100,000 Nvidia Hopper GPUs doesn’t use InfiniBand. Instead, the massive system, which Nvidia bills as…