Tag: jailbreak

Source URL: https://simonwillison.net/2025/Feb/3/constitutional-classifiers/ Source: Simon Willison’s Weblog Title: Constitutional Classifiers: Defending against universal jailbreaks Feedly Summary: Constitutional Classifiers: Defending against universal jailbreaks Interesting new research from Anthropic, resulting in the paper Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. From the paper: In particular, we introduce Constitutional Classifiers, a framework…

Hacker News: Constitutional Classifiers: Defending against universal jailbreaks

Feb 3, 2025

—

by

Source URL: https://www.anthropic.com/research/constitutional-classifiers Source: Hacker News Title: Constitutional Classifiers: Defending against universal jailbreaks Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses a novel approach by the Anthropic Safeguards Research Team to defend AI models against jailbreaks through the use of Constitutional Classifiers. This system demonstrates robustness against various jailbreak techniques while…

Hacker News: Notes on OpenAI O3-Mini

Feb 1, 2025

—

by

Source URL: https://simonwillison.net/2025/Jan/31/o3-mini/ Source: Hacker News Title: Notes on OpenAI O3-Mini Feedly Summary: Comments AI Summary and Description: Yes Summary: The announcement of OpenAI’s o3-mini model marks a significant development in the landscape of large language models (LLMs). With enhanced performance on specific benchmarks and user functionalities that include internet search capabilities, o3-mini aims to…

Simon Willison’s Weblog: OpenAI o3-mini, now available in LLM

—

by

Source URL: https://simonwillison.net/2025/Jan/31/o3-mini/#atom-everything Source: Simon Willison’s Weblog Title: OpenAI o3-mini, now available in LLM Feedly Summary: o3-mini is out today. As with other o-series models it’s a slightly difficult one to evaluate – we now need to decide if a prompt is best run using GPT-4o, o1, o3-mini or (if we have access) o1 Pro.…

Wired: DeepSeek’s Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

—

by

Source URL: https://www.wired.com/story/deepseeks-ai-jailbreak-prompt-injection-attacks/ Source: Wired Title: DeepSeek’s Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot Feedly Summary: Security researchers tested 50 well-known jailbreaks against DeepSeek’s popular new AI chatbot. It didn’t stop a single one. AI Summary and Description: Yes Summary: The text highlights the ongoing battle between hackers and security researchers…

Hacker News: O3-mini System Card [pdf]

—

by

Source URL: https://cdn.openai.com/o3-mini-system-card.pdf Source: Hacker News Title: O3-mini System Card [pdf] Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The OpenAI o3-mini System Card details the advanced capabilities, safety evaluations, and risk classifications of the OpenAI o3-mini model. This document is particularly pertinent for professionals in AI security, as it outlines significant safety measures…

The Register: Google to Iran: Yes, we see you using Gemini for phishing and scripting. We’re onto you

—

by

Source URL: https://www.theregister.com/2025/01/31/state_spies_google_gemini/ Source: The Register Title: Google to Iran: Yes, we see you using Gemini for phishing and scripting. We’re onto you Feedly Summary: And you, China, Russia, North Korea … Guardrails block malware generation Google says it’s spotted Chinese, Russian, Iranian, and North Korean government agents using its Gemini AI for nefarious purposes,…

Unit 42: Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek

Jan 30, 2025

—

by

Source URL: https://unit42.paloaltonetworks.com/?p=138180 Source: Unit 42 Title: Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek Feedly Summary: Evaluation of three jailbreaking techniques on DeepSeek shows risks of generating prohibited content. The post Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek appeared first on Unit 42. AI Summary and Description: Yes Summary: The text outlines the research conducted…

Simon Willison’s Weblog: How we estimate the risk from prompt injection attacks on AI systems

Jan 29, 2025

—

by