Simon Willison’s Weblog: Defeating Nondeterminism in LLM Inference

Sep 11, 2025

—

Source URL: https://simonwillison.net/2025/Sep/11/defeating-nondeterminism/#atom-everything
Source: Simon Willison’s Weblog
Title: Defeating Nondeterminism in LLM Inference

Feedly Summary: Defeating Nondeterminism in LLM Inference
A very common question I see about LLMs concerns why they can’t be made to deliver the same response to the same prompt by setting a fixed random number seed.
Like many others I had been lead to believe this was due to the non-associative nature of floating point arithmetic, where (a + b) + c ≠ a + (b + c), combining with unpredictable calculation orders on concurrent GPUs. This new paper calls that the “concurrency + floating point hypothesis":

One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.

It then convincingly argues that this is not the core of the problem, because "in the typical forward pass of an LLM, there is usually not a single atomic add present."
Why are LLMs so often non-deterministic then?

[…] the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.

The thinking-machines-lab/batch_invariant_ops code that accompanies this paper addresses this by providing a PyTorch implementation of invariant kernels and demonstrates them running Qwen3-8B deterministically under vLLM.
This paper is the first public output from Thinking Machines, the AI Lab founded in February 2025 by Mira Murati, OpenAI’s former CTO (and interim CEO for a few days). It’s unrelated to Thinking Machines Corporation, the last employer of Richard Feynman (as described in this most excellent story by Danny Hillis).
Tags: ai, pytorch, generative-ai, llms, qwen

AI Summary and Description: Yes

Summary: The text discusses the non-determinism often observed in Large Language Model (LLM) inference, challenging common assumptions about the causes of this unpredictability. It highlights a new hypothesis regarding batch size variability and introduces a solution that implements invariant kernels to achieve deterministic outcomes.

Detailed Description: The text provides insights into the challenges associated with achieving deterministic outputs in LLMs, which is critical for developers and researchers working within AI, particularly in the context of ensuring the reliability and repeatability of model responses.

Key Points:

– **Non-Determinism in LLMs**: The text identifies a prevalent question regarding the inability of LLMs to deliver the same responses consistently, even with a fixed random seed.
– **Conventional Understanding**: Previously, many believed this inconsistency was largely due to the non-associative nature of floating-point arithmetic combined with the unpredictable order of calculations on concurrent processing units (CPUs/GPUs).
– **New Findings**: The paper refutes this traditional view by positing that the primary source of non-determinism actually lies in the variability of the load (and batch size) during inference, not solely in the arithmetic operations involved.
– **Code Contribution**: It introduces the thinking-machines-lab’s code that tackles this problem by offering a PyTorch implementation of invariant kernels, allowing Qwen3-8B to run in a deterministic way under vLLM.
– **Background Context**: The paper originates from Thinking Machines, an AI Lab established by Mira Murati, emphasizing its relevance to ongoing research and development in the AI domain.

This text is significant for AI security and compliance professionals as it addresses foundational issues in model behavior that can impact how AI applications are utilized in production environments, especially regarding reliability and trustworthiness in sensitive applications. Understanding these dynamics is crucial for effective governance and model compliance in various operational contexts.

.NET 1 2 2025 3 5 a Act addresses ads age AI AI applications AI security All allow and app Application applications Arch Aria arithmetic operations art as at ated based batch size batch size variability Behavior Bi by C cell ceo CERN challenge challenges CI CIA co code companies compliance compliance professionals concerns concurrency consistency Context core CPU CPUs critical Current D day days de demo deterministic deterministic outcomes developer developers development domain e effective end endpoint endpoints environment environments Excel execution first floating for g Gen generative Go governance GPU GPUs gs H high Highlight http HTTPS impact implementation in Inference insights inter io Iron IRS issue k kernel kernels Key l language language model large large language model Large Language Model (LLM) led Li liability llm llms lm load low M mac machine made man mini Mir Mode model model behavior model compliance model responses N nation new NGO NIST no non o of off on one ons open openai operation operational operational context operations OPM ops ory oS other out outcome output Outputs over paper per point pre pro problem process processing product production production environment production environments professionals prompt ps public Py pytorch Q question Qwen R rate RCE re red relevance reliability research Research and Development researchers response responses Ro Rust s sam search sec security security and compliance sensitive applications Sig Sim Simon Willison single size SoC source SSE SSO T Tags: ted text the thinking thinking machines to Tor TP TPUs trust trustworthiness two under unpredictability US use V val variability vllm web Wi x yt z