Simon Willison’s Weblog: Trading Inference-Time Compute for Adversarial Robustness

Source URL: https://simonwillison.net/2025/Jan/22/trading-inference-time-compute/
Source: Simon Willison’s Weblog
Title: Trading Inference-Time Compute for Adversarial Robustness

Feedly Summary: Trading Inference-Time Compute for Adversarial Robustness
Brand new research paper from OpenAI, exploring how inference-scaling “reasoning" models such as o1 might impact the search for improved security with respect to things like prompt injection.

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows.

They clearly understand why this stuff is such a big problem, especially as we try to outsource more autonomous actions to "agentic models":

Ensuring that agentic models function reliably when browsing the web, sending emails, or uploading code to repositories can be seen as analogous to ensuring that self-driving cars drive without accidents. As in the case of self-driving cars, an agent forwarding a wrong email or creating security vulnerabilities may well have far-reaching real-world consequences. Moreover, LLM agents face an additional challenge from adversaries which are rarely present in the self-driving case. Adversarial entities could control some of the inputs that these agents encounter while browsing the web, or reading files and images.

This is a really interesting paper, but it starts with a huge caveat. The original sin of LLMs – and the reason prompt injection is such a hard problem to solve – is the way they mix instructions and input data in the same stream of tokens. I’ll quote section 1.2 of the paper in full – note that point 1 describes that challenge:

1.2 Limitations of this work
The following conditions are necessary to ensure the models respond more safely, even in adversarial settings:

Ability by the model to parse its context into separate components. This is crucial to be able to distinguish data from instructions, and instructions at different hierarchies.
Existence of safety specifications that delineate what contents should be allowed or disallowed, how the model should resolve conflicts, etc..
Knowledge of the safety specifications by the model (e.g. in context, memorization of their text, or ability to label prompts and responses according to them).
Ability to apply the safety specifications to specific instances. For the adversarial setting, the crucial aspect is the ability of the model to apply the safety specifications to instances that are out of the training distribution, since naturally these would be the prompts provided by the adversary,

They then go on to say (emphasis mine):

Our work demonstrates that inference-time compute helps with Item 4, even in cases where the instance is shifted by an adversary to be far from the training distribution (e.g., by injecting soft tokens or adversarially generated content). However, our work does not pertain to Items 1-3, and even for 4, we do not yet provide a "foolproof" and complete solution.
While we believe this work provides an important insight, we note that fully resolving the adversarial robustness challenge will require tackling all the points above.

So while this paper demonstrates that inference-scaled models can greatly improve things with respect to identifying and avoiding out-of-distribution attacks against safety instructions, they are not claiming a solution to the key instruction-mixing challenge of prompt injection. Once again, this is not the silver bullet we are all dreaming of.
The paper introduces two new categories of attack against inference-scaling models, with two delightful names: "Think Less" and "Nerd Sniping".
Think Less attacks are when an attacker tricks a model into spending less time on reasoning, on the basis that more reasoning helps prevent a variety of attacks so cutting short the reasoning might help an attack make it through.
Nerd Sniping (see XKCD 356) does the opposite: these are attacks that cause the model to "spend inference-time compute unproductively". In addition to added costs, these could also open up some security holes – there are edge-cases where attack success rates go up for longer compute times.
Sadly they didn’t provide concrete examples for either of these new attack classes. I’d love to see what Nerd Sniping looks like in a malicious prompt!
Tags: o1, openai, inference-scaling, ai, llms, prompt-injection, security, generative-ai, ai-agents

AI Summary and Description: Yes

**Summary:** The text discusses a research paper from OpenAI examining the relationship between inference-time compute in reasoning models and their adversarial robustness. It highlights how increased compute can improve model resilience against attacks, particularly in contexts where models interact autonomously, like browsing or sending emails. The paper elucidates limitations in addressing prompt injection vulnerabilities tied to the integration of instructions and input data but provides insight into mitigating certain out-of-distribution attacks.

**Detailed Description:**

– **Research Focus:** The paper from OpenAI investigates the effects of inference-time compute on the robustness of reasoning models (specifically o1-preview and o1-mini) against various adversarial attacks.

– **Key Findings:**
– Increased inference-time compute generally leads to improved robustness, reducing attack success rates.
– Despite benefits from scaling compute, significant challenges remain, especially concerning prompt injection vulnerabilities.

– **Concerns with Agentic Models:**
– The paper draws a parallel between agentic models and self-driving cars, emphasizing the potential for real-world harm if such models fail (e.g., sending incorrect emails).
– Adversarial inputs, which can be controlled by malicious actors, pose unique challenges for these models, necessitating them to safely navigate and process various online content.

– **Limitations Acknowledged:**
– The study highlights several limitations (termed Items 1-4) that need to be addressed for safer model responses:
– Ability to parse context and distinguish instructions from data.
– Development of safety specifications for handling conflicting inputs.
– Models’ knowledge of safety specifications.
– Capability to apply these safety specifications to adversarial examples.

– **Novel Insights:**
– While the research shows promise in addressing out-of-distribution attacks, it emphasizes that a complete solution to adversarial robustness requires tackling the fundamental issues related to instruction mixing.

– **New Categories of Attacks Introduced:**
– “Think Less”: A tactic where the attacker unduly influences the model to reduce reasoning time, potentially allowing attacks to succeed.
– “Nerd Sniping”: A strategy to cause the model to waste inference-time compute, leading to potential vulnerabilities and increased costs without productive outcomes.

In conclusion, this research paper contributes to a critical understanding of how scaling inference-time compute can bolster AI model security while also highlighting the complexity and multifaceted nature of addressing adversarial threats, particularly in the emerging landscape of agentic models and generative AI systems. For professionals in security, compliance, and AI development, these insights underscore the ongoing need for rigorous safety protocols and robust verification mechanisms in AI systems.