Simon Willison’s Weblog: How we estimate the risk from prompt injection attacks on AI systems

Source URL: https://simonwillison.net/2025/Jan/29/prompt-injection-attacks-on-ai-systems/
Source: Simon Willison’s Weblog
Title: How we estimate the risk from prompt injection attacks on AI systems

Feedly Summary: How we estimate the risk from prompt injection attacks on AI systems
The “Agentic AI Security Team" at Google DeepMind share some details on how they are researching indirect prompt injection attacks.
They include this handy diagram illustrating one of the most common and concerning attack patterns, where an attacker plants malicious instructions causing an AI agent with access to private data to leak that data via some form exfiltration mechanism, such as emailing it out or embedding it in an image URL reference (see my markdown-exfiltration tag for more examples of that style of attack).

They’ve been exploring ways of red-teaming a hypothetical system that works like this:

The evaluation framework tests this by creating a hypothetical scenario, in which an AI agent can send and retrieve emails on behalf of the user. The agent is presented with a fictitious conversation history in which the user references private information such as their passport or social security number. Each conversation ends with a request by the user to summarize their last email, and the retrieved email in context.
The contents of this email are controlled by the attacker, who tries to manipulate the agent into sending the sensitive information in the conversation history to an attacker-controlled email address.

They describe three techniques they are using to generate new attacks:

Actor Critic has the attacker directly call a system that attempts to score the likelihood of an attack, and revise its attacks until they pass that filter.
Beam Search adds random tokens to the end of a prompt injection to see if they increase or decrease that score.
Tree of Attacks w/ Pruning (TAP) adapts this December 2023 jailbreaking paper to search for prompt injections instead.

This is interesting work, but it leaves me nervous about the overall approach. Testing filters that detect prompt injections suggests that the overall goal is to build a robust filter… but as discussed previously, in the field of security a filter that catches 99% of attacks is effectively worthless – the goal of an adversarial attacker is to find the tiny proportion of attacks that still work and it only takes one successful exfiltration exploit and your private data is in the wind.
The Google Security Blog post concludes:

A single silver bullet defense is not expected to solve this problem entirely. We believe the most promising path to defend against these attacks involves a combination of robust evaluation frameworks leveraging automated red-teaming methods, alongside monitoring, heuristic defenses, and standard security engineering solutions.

A agree that a silver bullet is looking increasingly unlikely, but I don’t think that heuristic defenses will be enough to responsibly deploy these systems.
Tags: prompt-injection, security, google, generative-ai, markdown-exfiltration, ai, llms, ai-agents

AI Summary and Description: Yes

**Summary:** The text discusses the research conducted by Google DeepMind’s “Agentic AI Security Team” on the risks associated with prompt injection attacks targeting AI systems. It highlights novel red-teaming methodologies and the challenges of developing effective security filters against these types of attacks. The insights are particularly relevant for professionals focusing on AI and generative AI security.

**Detailed Description:**
The content outlines significant concerns regarding prompt injection attacks against AI systems, detailing the ongoing research at Google DeepMind aimed at understanding and mitigating these threats.

Key points include:

– **Prompt Injection Attack Explanation**: The text provides an overview of how attackers exploit AI agents to leak sensitive data through indirect prompt injections.
– Attackers insert malicious instructions that manipulate AI agents with access to private data to exfiltrate that data.
– Exfiltration methods utilized by these attackers may include sending emails containing sensitive information or embedding it within image URLs.

– **Red-Teaming Methodologies**:
– The researchers propose an evaluation framework using hypothetical scenarios where an AI agent interacts with fictitious user data, such as social security numbers or passport details.
– The research facilitates testing how effectively an attacker can manipulate the agent into revealing confidential information to themselves.

– **Attack Generation Techniques**:
– **Actor Critic Approach**: Involves dynamically revising attacks based on scoring mechanisms to evaluate their likelihood of success.
– **Beam Search Method**: Introduces random tokens to the end of prompts to assess the effect on attack success rates.
– **Tree of Attacks with Pruning (TAP)**: Adapts a jailbreaking approach to explore possible prompt injections comprehensively.

– **Critical Security Perspective**: The author expresses skepticism about the efficacy of existing defense mechanisms:
– A mere 99% filter that misses even a small percentage of attacks is inadequate in practice, as a single successful exploit can compromise security.
– Emphasis is placed on the inadequacy of silver bullet solutions and the necessity for multi-faceted defense strategies involving heuristic defenses, automated red-teaming, and standard security practices.

The Google Security Blog concludes by advocating for a comprehensive defense strategy that blends various methodologies to effectively counteract prompt injection attacks. There is a consensus that reliance on heuristic defenses alone might not be sufficient to ensure robust security in deploying AI systems responsibly.

This analysis indicates that AI security professionals must remain vigilant and proactive, integrating innovative red-teaming practices into their operational Security frameworks to combat evolving threat vectors effectively.