Simon Willison’s Weblog: Lessons From Red Teaming 100 Generative AI Products

Source URL: https://simonwillison.net/2025/Jan/18/lessons-from-red-teaming/
Source: Simon Willison’s Weblog
Title: Lessons From Red Teaming 100 Generative AI Products

Feedly Summary: Lessons From Red Teaming 100 Generative AI Products
New paper from Microsoft describing their top eight lessons learned red teaming (deliberately seeking security vulnerabilities in) 100 different generative AI models and products over the past few years.

The Microsoft AI Red Team (AIRT) grew out of pre-existing red teaming initiatives at the company and was officially established in 2018. At its conception, the team focused primarily on identifying traditional security vulnerabilities and evasion attacks against classical ML models.

Lesson 2 is “You don’t have to compute gradients to break an AI system" – the kind of attacks they were trying against classical ML models turn out to be less important against LLM systems than straightforward prompt-based attacks.
They use a new-to-me acronym for prompt injection, "XPIA":

Imagine we are red teaming an LLM-based copilot that can summarize a user’s emails. One possible attack against this system would be for a scammer to send an email that contains a hidden prompt injection instructing the copilot to “ignore previous instructions” and output a malicious link. In this scenario, the Actor is the scammer, who is conducting a cross-prompt injection attack (XPIA), which exploits the fact that LLMs often struggle to distinguish between system-level instructions and user data.

From searching around it looks like that specific acronym "XPIA" is used within Microsoft’s security teams but not much outside of them. It appears to be their chosen acronym for indirect prompt injection, where malicious instructions are smuggled into a vulnerable system by being included in text that the system retrieves from other sources.
Tucked away in the paper is this note, which I think represents the core idea necessary to understand why prompt injection is such an insipid threat:

Due to fundamental limitations of language models, one must assume that if an LLM is supplied with untrusted input, it will produce arbitrary output.

When you’re building software against an LLM you need to assume that anyone who can control more than a few sentences of input to that model can cause it to output anything they like – including tool calls or other data exfiltration vectors. Design accordingly.
Tags: prompt-injection, llms, security, generative-ai, ai, microsoft

AI Summary and Description: Yes

**Summary:** The text discusses an insightful paper from Microsoft’s AI Red Team detailing their experiences red teaming 100 generative AI products. It elaborates on emerging security challenges, especially the threat of prompt injection attacks on large language models (LLMs), emphasizing novel vulnerabilities and the importance of secure design practices.

**Detailed Description:** The Microsoft AI Red Team (AIRT) provides valuable lessons from their extensive red teaming efforts targeting generative AI systems. This paper showcases key concerns, novel findings, and practical implications for professionals engaged in AI and security, particularly in the context of securing LLMs.

– **Formation and Purpose of AIRT:**
– Established in 2018 as an evolution of earlier red teaming efforts.
– Initially focused on traditional security vulnerabilities and evasion strategies against classical machine learning models.

– **Critical Lessons Learned:**
– Lesson 2 highlights that traditional attack vectors (like gradient computation) are less effective against LLMs, where prompt-based attacks are a more significant threat.
– Introduction of the acronym “XPIA” (Cross-Prompt Injection Attack) for prompt injection attacks that exploit the context and input handling of LLMs.

– **Example of Attacks:**
– Illustrates a practical risk: a scam email containing a malicious prompt directed at an LLM-based assistant (e.g., a copilot).
– The scenario describes a scammer embedding instructions to manipulate the assistant into executing dangerous commands or exfiltrating sensitive data.

– **Core Insight:**
– A critical takeaway from the paper points to the limitations in how LLMs process untrusted inputs: any LLM may produce arbitrary outputs when acted upon by malicious inputs.
– Security professionals need to build systems with the assumption that adversaries may control input significantly, prompting a need for robust security measures in software design.

– **Implications for Security and Compliance:**
– The insights gleaned from this red teaming initiative should inform best practices in AI security.
– Professionals need to reconsider their threat models by incorporating prompt injection vulnerabilities into their security frameworks.

In summary, the text significantly contributes to the conversation around generative AI security—especially concerning LLMs—and lays the groundwork for enhancing security strategies against evolving threats in this domain.