Simon Willison’s Weblog: LLM Flowbreaking – Experimental News Clipping Site

Source URL: https://simonwillison.net/2024/Nov/29/llm-flowbreaking/#atom-everything
Source: Simon Willison’s Weblog
Title: LLM Flowbreaking

Feedly Summary: LLM Flowbreaking
Gadi Evron from Knostic:

We propose that LLM Flowbreaking, following jailbreaking and prompt injection, joins as the third on the growing list of LLM attack types. Flowbreaking is less about whether prompt or response guardrails can be bypassed, and more about whether user inputs and generated model outputs can adversely affect these other components in the broader implemented system.

The key idea here is that some systems built on top of LLMs – such as Microsoft Copilot – implement an additional layer of safety checks which can sometimes cause the system to retract an already displayed answer.
I’ve seen this myself a few times, most notable with Claude 2 last year when it deleted an almost complete podcast transcript cleanup right in front of my eye because the hosts started talking about bomb threats.
Knostic calls this Second Thoughts, where an LLM system decides to retract its previous output. It’s not hard for an attacker to grab this potentially harmful data: I’ve grabbed some using a quick copy and paste, or you can use tricks like video scraping or using the network browser tools.
They also describe a Stop and Roll attack, where the user clicks the “stop" button while executing a query against a model in a way that also prevents the moderation layer from having the chance to retract its previous output.
I’m not sure I’d categorize this as a completely new vulnerability class. If you implement a system where output is displayed to users you should expect that attempts to retract that data can be subverted – screen capture software is widely available these days.
I wonder how widespread this retraction UI pattern is? I’ve seen it in Claude and evidently ChatGPT and Microsoft Copilot have the same feature. I don’t find it particularly convincing – it seems to me that it’s more safety theatre than a serious mechanism for avoiding harm caused by unsafe output.
Via Bruce Schneier
Tags: ai, llms, security, generative-ai

AI Summary and Description: Yes

Summary: The text discusses the concept of “LLM Flowbreaking,” which represents a new form of attack targeting Large Language Models (LLMs). This attack type raises concerns about the system’s safety mechanisms and their effectiveness in retracting potentially harmful outputs. It highlights the vulnerabilities associated with user inputs and outputs in LLMs, particularly in systems like Microsoft Copilot.

Detailed Description:
The article by Gadi Evron introduces the concept of LLM Flowbreaking as a third type of attack on LLMs, alongside jailbreaking and prompt injection. The discussion emphasizes the challenges and implications for security within AI systems that utilize LLMs:

– **Flowbreaking Defined**: Unlike traditional vulnerabilities that focus on bypassing guardrails, Flowbreaking is concerned with how user inputs can influence the stability and safety of outputs generated by an LLM.
– **Examples of Vulnerability**:
– The text recounts an incident where Claude 2 ceased producing a podcast transcript cleanup because of a sensitive topic discussed, illustrating the inadequacy of current safety measures.
– Knostic refers to the phenomenon of LLMs reassessing their outputs as “Second Thoughts,” highlighting instances where LLMs retract previously displayed information.

– **Attack Techniques**:
– **Stop and Roll Attack**: By clicking the “stop” button during a query, a user can prevent the moderation layer from retracting the outputs, leaving potentially harmful content visible.

– **Skepticism on Safety Features**: The author expresses doubt regarding the effectiveness of the retraction mechanisms implemented in systems like Claude, ChatGPT, and Microsoft Copilot, describing them as “safety theatre” rather than robust security measures.

– **Vulnerability Contextualization**:
– There is a real concern that if outputs are displayed to users, then the ability to retract those outputs can be compromised, particularly as tools for capturing digital content are increasingly accessible.
– There’s a query about the prevalence of these retraction UI patterns across various platforms and whether relying on them provides actual safety.

This analysis emphasizes the need for security professionals working with AI technologies to be aware of new attack vectors like LLM Flowbreaking, understand their implications, and take proactive measures to strengthen safety mechanisms within LLM implementations.