Hacker News: No More Adam: Learning Rate Scaling at Initialization Is All You Need

Dec 18, 2024

—

Source URL: https://arxiv.org/abs/2412.11768
Source: Hacker News
Title: No More Adam: Learning Rate Scaling at Initialization Is All You Need

Feedly Summary: Comments

AI Summary and Description: Yes

Summary: The text presents a novel optimization technique called SGD-SaI that enhances the stochastic gradient descent (SGD) algorithm for training deep neural networks. This method simplifies the process by scaling learning rates right at initialization without utilizing adaptive gradient methods, thereby offering strong performance across various tasks, including large language models (LLMs) and Vision Transformers (ViTs).

Detailed Description:

The paper titled “No More Adam: Learning Rate Scaling at Initialization is All You Need” presents an innovative approach to optimizing the training of deep neural networks, particularly those based on transformer architectures. The proposed method, termed SGD-SaI (Stochastic Gradient Descent with Scaling at Initialization), focuses on enhancing SGD through the following key elements:

– **Learning Rate Scaling**: SGD-SaI implements learning rate adjustments at the initialization phase, tailored for different parameter groups based on their gradient signal-to-noise ratio (g-SNR). This technique mitigates training imbalances from the outset.

– **Memory Efficiency**: By eliminating the need for adaptive second-order momentum, SGD-SaI significantly decreases memory usage, achieving approximately half the memory footprint of AdamW, a widely used adaptive optimizer. For example, the memory savings are measured at 5.93 GB for the GPT-2 model with 1.5 billion parameters and 25.15 GB for Llama2-7B.

– **Performance Metrics**: Despite its simplicity, SGD-SaI demonstrates performance on par with or superior to AdamW across diverse tasks such as ImageNet-1K classification using Vision Transformers and the pretraining of large language models (LLMs).

– **Robustness**: The efficacy of SGD-SaI is further evidenced through its robustness to variations in hyperparameters, making it a practical choice for practitioners in various applications, including LoRA fine-tuning for LLMs and diffusion model training.

– **Significance in AI and Infrastructure**: The proposed optimizer has implications for both AI model training efficiency and broader infrastructure considerations in cloud computing environments where resource allocation and cost efficiency are critical.

In summary, the introduction of SGD-SaI marks a significant advancement in optimization techniques for deep learning, addressing long-standing challenges in training transformers while optimizing resource use—factors that hold high relevance in the domains of AI and cloud computing security.

1 2 3 4 a Act advancement AI algorithm Application applications Arch architecture Aria art as based by C challenges classification Cloud cloud computing cloud computing security Computing computing environments cost cost efficiency critical cross D deep learning deep neural networks demo diffusion model e efficiency environment ERP EU fact fine fine-tuning for g Gen Go GPT Group gs hack hacker Hacker News high http HTTPS hyperparameters image Imagen implications in infrastructure infrastructure considerations innovative approach ite Just k l language language model language models large large language model large language models learning led llama llm llms lm long low making memory memory efficiency memory usage metrics model model training models network networks neural network neural networks news no o of off on one optimization optimization technique optimization techniques optimizer ory parameter performance performance metrics pre R RCE resource allocation robustness s scaling sec security side Sig Signal Sim simplicity source stochastic gradient descent T Task tasks tech techniques text the to Tor training training efficiency transformer transformer architecture transformer architectures transformers tuning two up usage Vision Wi x