Tag: position embeddings
-
Hacker News: Writing an LLM from scratch, part 8 – trainable self-attention
Source URL: https://www.gilesthomas.com/2025/03/llm-from-scratch-8-trainable-self-attention Source: Hacker News Title: Writing an LLM from scratch, part 8 – trainable self-attention Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The text provides an in-depth exploration of implementing self-attention mechanisms in large language models (LLMs), focusing on the mathematical operations and concepts involved. This detailed explanation serves as a…
-
Hacker News: Tensor Product Attention Is All You Need
Source URL: https://arxiv.org/abs/2501.06425 Source: Hacker News Title: Tensor Product Attention Is All You Need Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a novel attention mechanism called Tensor Product Attention (TPA) designed for scaling language models efficiently. It highlights the mechanism’s ability to reduce memory overhead during inference while improving model…
-
Simon Willison’s Weblog: deepseek-ai/DeepSeek-V3-Base
Source URL: https://simonwillison.net/2024/Dec/25/deepseek-v3/#atom-everything Source: Simon Willison’s Weblog Title: deepseek-ai/DeepSeek-V3-Base Feedly Summary: deepseek-ai/DeepSeek-V3-Base No model card or announcement yet, but this new model release from Chinese AI lab DeepSeek (an arm of Chinese hedge fund High-Flyer) looks very significant. It’s a huge model – 685B parameters, 687.9 GB on disk (TIL how to size a git-lfs…
-
Simon Willison’s Weblog: Quoting François Chollet
Source URL: https://simonwillison.net/2024/Oct/16/francois-chollet/ Source: Simon Willison’s Weblog Title: Quoting François Chollet Feedly Summary: A common misconception about Transformers is to believe that they’re a sequence-processing architecture. They’re not. They’re a set-processing architecture. Transformers are 100% order-agnostic (which was the big innovation compared to RNNs, back in late 2016 — you compute the full matrix of…