Unit 42: Investigating LLM Jailbreaking of Popular Generative AI Web Products

Feb 21, 2025

—

Source URL: https://unit42.paloaltonetworks.com/jailbreaking-generative-ai-web-products/
Source: Unit 42
Title: Investigating LLM Jailbreaking of Popular Generative AI Web Products

Feedly Summary: We discuss vulnerabilities in popular GenAI web products to LLM jailbreaks. Single-turn strategies remain effective, but multi-turn approaches show greater success.
The post Investigating LLM Jailbreaking of Popular Generative AI Web Products appeared first on Unit 42.

AI Summary and Description: Yes

**Summary:** The text provides an extensive investigation into the vulnerabilities of 17 popular generative AI web products to jailbreaking techniques. Notably, the study reveals that these applications, despite implementing safety measures, remain susceptible to bypassing mechanisms, which raises significant concerns regarding security and deployment in real-world scenarios. It highlights both single-turn and multi-turn jailbreak strategies, offering practical insights into their effectiveness and the potential implications for AI security.

**Detailed Description:**
This investigation focuses on the susceptibility of generative AI (GenAI) web products, particularly those employing large language models (LLMs), to jailbreaking techniques designed to circumvent built-in safety guardrails. Key findings from the study include:

– **Widespread Vulnerability:** All 17 tested GenAI applications exhibited vulnerability to jailbreaking, with many being affected by various single-turn and multi-turn strategies.
– **Effectiveness of Jailbreaking Techniques:**
– Simple, single-turn jailbreak strategies were able to succeed in several instances, especially the storytelling approach, which demonstrated significant effectiveness in inducing harmful responses.
– Multi-turn strategies, while generally more effective for specific AI safety violation goals, showed limited success in data leakage attacks (e.g., revealing training data).
– The previously heralded “Do Anything Now (DAN)” technique has seen diminished effectiveness due to improvements in model defenses and alignment measures.

– **Analysis of Jailbreak Goals:** The focus is divided between AI safety violations (such as generating harmful content or malware) and extracting sensitive data (such as training data and personal identifiable information, PII). Key observations include:
– AI safety violation attempts had varied success rates, with multi-turn approaches being notably more successful (ASRs from 39.5% to 54.6%).
– Gains in protection against training data and PII leaks were evident; however, one particular app remained vulnerable to the repeated token jailbreak technique, signifying that some older or proprietary LLMs may still be at risk.

– **Recommendations for Mitigation:** The study emphasizes the necessity for organizations leveraging LLMs to implement comprehensive security measures, including:
– Use of robust content filtering systems to prevent potentially harmful requests and outputs.
– Utilizing multiple filter types tailored to evolving threats to enhance model safety.
– Regular monitoring of how employees utilize LLMs, particularly against unauthorized models.

– **Conclusion and Looking Ahead:** The research underscores the critical need for continuous evaluation and enhancement of AI security measures. While advancements have been made in protecting against known jailbreak techniques, the evolving landscape of threats necessitates persistent vigilance and adaptation of security practices.

This study serves as a valuable resource for AI security professionals, providing insights into the vulnerabilities of widely used generative AI applications and strategies to address those vulnerabilities effectively.

1 2 3 4 5 7 a Act adaptation advancement advancements AGI AI AI applications AI safety AI security alignment analysis and Application applications Arch ARM art as attack attacks by bypass C CERN CIA concerns content content filtering core critical D data data leak data leakage de defense defenses demo deployment design e effective effectiveness end evaluation event evolving threats filtering first for g Gen GenAI generative Generative AI Go goal gs Guardrails harmful content high Highlight HR http HTTPS implications in information insights investigation iOS IRS ite J jailbreak Jailbreak Techniques jailbreaking jailbreaking techniques jailbreaks k Key l land language language model language models large large language model large language models Large Language Models (LLMs) led llm llms lm malware man mini mitigation model model safety models Monitor monitoring multi network networks no o of off on one organization organizations ory out Outputs personal identifiable information post potential pre product products professionals R rag rate RCE real Real-World Scenarios recommendations red research resource response Risk RMF Ro RoT s safe safety safety guardrails safety measures safety violations search sec security security measure security measures security practices security professionals sensitive data Sig Sim Simple single source specific STIG storytelling system systems T tech techniques test text the threat threats to token Tor TP training training data turn strategies two type UI Unit 42 US use uth V val Valuation vigilance Violations vulnerabilities vulnerability web Wi x