Source URL: https://simonwillison.net/2025/Sep/11/defeating-nondeterminism/#atom-everything
Source: Simon Willison’s Weblog
Title: Defeating Nondeterminism in LLM Inference
Feedly Summary: Defeating Nondeterminism in LLM Inference
A very common question I see about LLMs concerns why they can’t be made to deliver the same response to the same prompt by setting a fixed random number seed.
Like many others I had been lead to believe this was due to the non-associative nature of floating point arithmetic, where (a + b) + c ≠ a + (b + c), combining with unpredictable calculation orders on concurrent GPUs. This new paper calls that the “concurrency + floating point hypothesis":
One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.
It then convincingly argues that this is not the core of the problem, because "in the typical forward pass of an LLM, there is usually not a single atomic add present."
Why are LLMs so often non-deterministic then?
[…] the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.
The thinking-machines-lab/batch_invariant_ops code that accompanies this paper addresses this by providing a PyTorch implementation of invariant kernels and demonstrates them running Qwen3-8B deterministically under vLLM.
This paper is the first public output from Thinking Machines, the AI Lab founded in February 2025 by Mira Murati, OpenAI’s former CTO (and interim CEO for a few days). It’s unrelated to Thinking Machines Corporation, the last employer of Richard Feynman (as described in this most excellent story by Danny Hillis).
Tags: ai, pytorch, generative-ai, llms, qwen
AI Summary and Description: Yes
Summary: The text discusses the non-determinism often observed in Large Language Model (LLM) inference, challenging common assumptions about the causes of this unpredictability. It highlights a new hypothesis regarding batch size variability and introduces a solution that implements invariant kernels to achieve deterministic outcomes.
Detailed Description: The text provides insights into the challenges associated with achieving deterministic outputs in LLMs, which is critical for developers and researchers working within AI, particularly in the context of ensuring the reliability and repeatability of model responses.
Key Points:
– **Non-Determinism in LLMs**: The text identifies a prevalent question regarding the inability of LLMs to deliver the same responses consistently, even with a fixed random seed.
– **Conventional Understanding**: Previously, many believed this inconsistency was largely due to the non-associative nature of floating-point arithmetic combined with the unpredictable order of calculations on concurrent processing units (CPUs/GPUs).
– **New Findings**: The paper refutes this traditional view by positing that the primary source of non-determinism actually lies in the variability of the load (and batch size) during inference, not solely in the arithmetic operations involved.
– **Code Contribution**: It introduces the thinking-machines-lab’s code that tackles this problem by offering a PyTorch implementation of invariant kernels, allowing Qwen3-8B to run in a deterministic way under vLLM.
– **Background Context**: The paper originates from Thinking Machines, an AI Lab established by Mira Murati, emphasizing its relevance to ongoing research and development in the AI domain.
This text is significant for AI security and compliance professionals as it addresses foundational issues in model behavior that can impact how AI applications are utilized in production environments, especially regarding reliability and trustworthiness in sensitive applications. Understanding these dynamics is crucial for effective governance and model compliance in various operational contexts.