Hacker News: No More Adam: Learning Rate Scaling at Initialization Is All You Need

Source URL: https://arxiv.org/abs/2412.11768
Source: Hacker News
Title: No More Adam: Learning Rate Scaling at Initialization Is All You Need

Feedly Summary: Comments

AI Summary and Description: Yes

Summary: The text presents a novel optimization technique called SGD-SaI that enhances the stochastic gradient descent (SGD) algorithm for training deep neural networks. This method simplifies the process by scaling learning rates right at initialization without utilizing adaptive gradient methods, thereby offering strong performance across various tasks, including large language models (LLMs) and Vision Transformers (ViTs).

Detailed Description:

The paper titled “No More Adam: Learning Rate Scaling at Initialization is All You Need” presents an innovative approach to optimizing the training of deep neural networks, particularly those based on transformer architectures. The proposed method, termed SGD-SaI (Stochastic Gradient Descent with Scaling at Initialization), focuses on enhancing SGD through the following key elements:

– **Learning Rate Scaling**: SGD-SaI implements learning rate adjustments at the initialization phase, tailored for different parameter groups based on their gradient signal-to-noise ratio (g-SNR). This technique mitigates training imbalances from the outset.

– **Memory Efficiency**: By eliminating the need for adaptive second-order momentum, SGD-SaI significantly decreases memory usage, achieving approximately half the memory footprint of AdamW, a widely used adaptive optimizer. For example, the memory savings are measured at 5.93 GB for the GPT-2 model with 1.5 billion parameters and 25.15 GB for Llama2-7B.

– **Performance Metrics**: Despite its simplicity, SGD-SaI demonstrates performance on par with or superior to AdamW across diverse tasks such as ImageNet-1K classification using Vision Transformers and the pretraining of large language models (LLMs).

– **Robustness**: The efficacy of SGD-SaI is further evidenced through its robustness to variations in hyperparameters, making it a practical choice for practitioners in various applications, including LoRA fine-tuning for LLMs and diffusion model training.

– **Significance in AI and Infrastructure**: The proposed optimizer has implications for both AI model training efficiency and broader infrastructure considerations in cloud computing environments where resource allocation and cost efficiency are critical.

In summary, the introduction of SGD-SaI marks a significant advancement in optimization techniques for deep learning, addressing long-standing challenges in training transformers while optimizing resource use—factors that hold high relevance in the domains of AI and cloud computing security.