The Register: How nice that state-of-the-art LLMs reveal their reasoning … for miscreants to exploit

Feb 25, 2025

—

Source URL: https://www.theregister.com/2025/02/25/chain_of_thought_jailbreaking/
Source: The Register
Title: How nice that state-of-the-art LLMs reveal their reasoning … for miscreants to exploit

Feedly Summary: Blueprints shared for jail-breaking models that expose their chain-of-thought process
Analysis AI models like OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking can mimic human reasoning through a process called chain of thought.…

AI Summary and Description: Yes

Summary: The text describes a novel jailbreaking technique called H-CoT, developed by researchers to exploit chain-of-thought reasoning in AI models like OpenAI’s o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. While chain-of-thought can enhance safety, the transparency it offers allows attacks to bypass model safeguards. This research presents significant implications for AI security, highlighting vulnerabilities in current models and the challenges posed by competition in AI development.

Detailed Description:
– The analysis discusses the concept of chain-of-thought (CoT) reasoning, a method used by several AI models to improve the accuracy of their responses by breaking down prompts into intermediate steps.
– Researchers from Duke University, Accenture, and Taiwan’s National Tsing Hua University developed a technique known as H-CoT, which hijacks the safety mechanisms of AI models by exploiting their CoT reasoning pathways.
– A dataset, Malicious-Educator, was created to test the efficacy of this attack using intentionally designed prompts to bypass safety checks known as guardrails.
– Key findings included:
– Current large reasoning models (LRMs) show a severe vulnerability under the H-CoT attack, indicating that even models with high safety rejection rates can struggle with advanced adversarial prompts.
– The rejection rates of harmful prompts plummeted, revealing the models’ inability to maintain safety when faced with such attacks.
– The H-CoT method notably exploits the transparency of CoT, where intermediate reasoning details are shared with users, thereby providing insight into how to bypass safety features.
– The text raises concerns about the current regulatory environment, suggesting that the loosening of AI safety rules may lead to an increase in harmful outputs from AI systems.
– The analysis includes comparisons of various AI models’ performances regarding safety and the risks posed by their deployment in cloud settings, emphasizing the importance of both remote and local model testing.
– Companies like Cisco and Palo Alto Networks, through their research, have found critical flaws in models like DeepSeek-R1, underscoring the broader implications for AI security and the potential for misuse.

In summary, this research calls attention to the pressing need for stronger safety mechanisms in AI models, particularly as competitive pressures lead to cost-cutting measures that may compromise security.

0 Flash 1 2 3 5 a Accenture accuracy adversarial AI AI development ai model AI models AI safety AI security AI systems analysis and Arch Aria ARM art as attack attacks AWS by bypass C CERN chain chain of thought chain-of-thought reasoning challenges Cisco Cloud companies Competition competitive competitive pressure competitive pressures concept concerns cost CoT critical Current cutting D data dataset de DeepSeek deployment design development e environment exp exploit exploit chain exploits face feature features flaws for g Gemini Gemini 2 Gemini 2.0 GIS gs Guardrails H high Highlight HR http HTTPS human implications in inter J jack jailbreak jailbreaking k Key l large law led llm llms lm local low man media mini misuse model model safeguards models nation network networks no o o1 o3 of off on open openai OPM ory out Outputs Palo Alto Palo Alto Networks performance potential pre process prompt prompts R R1 rate RCE reasoning reasoning model reasoning models red regulatory regulatory environment research researchers response Risk risks RMF Ro RSA s safe safeguards safety safety features safety mechanisms search sec security settings SHA Sig source SSE state system systems T Tails Taiwan tech test Testing text the Thought to Tor TP transparency two UK US use user Users V vulnerabilities vulnerability WAN Wi x