Tag: model safety

Source URL: https://www.theregister.com/2025/02/27/llm_emergent_misalignment_study/ Source: The Register Title: Does terrible code drive you mad? Wait until you see what it does to OpenAI’s GPT-4o Feedly Summary: Model was fine-tuned to write vulnerable software – then suggested enslaving humanity Computer scientists have found that fine-tuning notionally safe large language models to do one thing badly can negatively…

Hacker News: Narrow finetuning can produce broadly misaligned LLM [pdf]

Feb 25, 2025

—

by

Source URL: https://martins1612.github.io/emergent_misalignment_betley.pdf Source: Hacker News Title: Narrow finetuning can produce broadly misaligned LLM [pdf] Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The document presents findings on the phenomenon of “emergent misalignment” in large language models (LLMs) like GPT-4o when finetuned on specific narrow tasks, particularly the creation of insecure code. The results…

Unit 42: Investigating LLM Jailbreaking of Popular Generative AI Web Products

Feb 21, 2025

—

by

Source URL: https://unit42.paloaltonetworks.com/jailbreaking-generative-ai-web-products/ Source: Unit 42 Title: Investigating LLM Jailbreaking of Popular Generative AI Web Products Feedly Summary: We discuss vulnerabilities in popular GenAI web products to LLM jailbreaks. Single-turn strategies remain effective, but multi-turn approaches show greater success. The post Investigating LLM Jailbreaking of Popular Generative AI Web Products appeared first on Unit 42.…

Cloud Blog: Operationalizing generative AI apps with Apigee

Feb 13, 2025

—

by

Source URL: https://cloud.google.com/blog/products/api-management/using-apigee-api-management-for-ai/ Source: Cloud Blog Title: Operationalizing generative AI apps with Apigee Feedly Summary: Generative AI is now well beyond the hype and into the realm of practical application. But while organizations are eager to build enterprise-ready gen AI solutions on top of large language models (LLMs), they face challenges in managing, securing, and…

Cloud Blog: Adversarial Misuse of Generative AI

Jan 29, 2025

—

by

Source URL: https://cloud.google.com/blog/topics/threat-intelligence/adversarial-misuse-generative-ai/ Source: Cloud Blog Title: Adversarial Misuse of Generative AI Feedly Summary: Rapid advancements in artificial intelligence (AI) are unlocking new possibilities for the way we work and accelerating innovation in science, technology, and beyond. In cybersecurity, AI is poised to transform digital defense, empowering defenders and enhancing our collective security. Large language…

The Register: Voice-enabled AI agents can automate everything, even your phone scams

Oct 24, 2024

—

by

Source URL: https://www.theregister.com/2024/10/24/openai_realtime_api_phone_scam/ Source: The Register Title: Voice-enabled AI agents can automate everything, even your phone scams Feedly Summary: All for the low, low price of a mere dollar Scammers, rejoice. OpenAI’s real-time voice API can be used to build AI agents capable of conducting successful phone call scams for less than a dollar.… AI…

METR Blog – METR: BIS Comment Regarding "Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters"

Oct 23, 2024

—

by

Source URL: https://downloads.regulations.gov/BIS-2024-0047-0048/attachment_1.pdf Source: METR Blog – METR Title: BIS Comment Regarding "Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters" Feedly Summary: AI Summary and Description: Yes Summary: The text discusses the Bureau of Industry and Security’s proposed reporting requirements for advanced AI models and computing clusters, emphasizing…

Hacker News: Sabotage Evaluations for Frontier Models

Oct 20, 2024

—

by

Source URL: https://www.anthropic.com/research/sabotage-evaluations Source: Hacker News Title: Sabotage Evaluations for Frontier Models Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The text outlines a comprehensive series of evaluation techniques developed by the Anthropic Alignment Science team to assess potential sabotage capabilities in AI models. These evaluations are crucial for ensuring the safety and integrity…

The Register: Anthropic’s Claude vulnerable to ’emotional manipulation’

Oct 12, 2024

—

by