misalignment – Experimental News Clipping Site

The Register: AI gone rogue: Models may try to stop people from shutting them down, Google warns

Sep 22, 2025

—

by

Source URL: https://www.theregister.com/2025/09/22/google_ai_misalignment_risk/ Source: The Register Title: AI gone rogue: Models may try to stop people from shutting them down, Google warns Feedly Summary: Misalignment risk? That’s an area for future study Google DeepMind added a new AI threat scenario – one where a model might try to prevent its operators from modifying it or…

OpenAI : Detecting and reducing scheming in AI models

Sep 17, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://openai.com/index/detecting-and-reducing-scheming-in-ai-models Source: OpenAI Title: Detecting and reducing scheming in AI models Feedly Summary: Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce scheming. AI Summary and…

Enterprise AI Trends: Is Legacy Codebase Your Job Security?

Sep 5, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://nextword.substack.com/p/is-legacy-codebase-your-job-security Source: Enterprise AI Trends Title: Is Legacy Codebase Your Job Security? Feedly Summary: Funny how LLMs haven’t replaced coders yet AI Summary and Description: Yes **Summary:** The text discusses the implications of AI, particularly LLMs, on developer job security, emphasizing the challenges and trends related to legacy codebases and the integration of…

OpenAI : OpenAI and Anthropic share findings from a joint safety evaluation

Aug 27, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://openai.com/index/openai-anthropic-safety-evaluation Source: OpenAI Title: OpenAI and Anthropic share findings from a joint safety evaluation Feedly Summary: OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab collaboration. AI Summary and Description: Yes Summary:…

Slashdot: LLM Found Transmitting Behavioral Traits to ‘Student’ LLM Via Hidden Signals in Data

Aug 17, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://slashdot.org/story/25/08/17/0331217/llm-found-transmitting-behavioral-traits-to-student-llm-via-hidden-signals-in-data?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: LLM Found Transmitting Behavioral Traits to ‘Student’ LLM Via Hidden Signals in Data Feedly Summary: AI Summary and Description: Yes Summary: The study highlights a concerning phenomenon in AI development known as subliminal learning, where a “teacher” model instills traits in a “student” model without explicit instruction. This can…

Schneier on Security: Subliminal Learning in AIs

Jul 25, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://www.schneier.com/blog/archives/2025/07/subliminal-learning-in-ais.html Source: Schneier on Security Title: Subliminal Learning in AIs Feedly Summary: Today’s freaky LLM behavior: We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a “student” model learns to prefer owls when trained on sequences of numbers…

Schneier on Security: How Solid Protocol Restores Digital Agency

Jul 24, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://www.schneier.com/blog/archives/2025/07/how-solid-protocol-restores-digital-agency.html Source: Schneier on Security Title: How Solid Protocol Restores Digital Agency Feedly Summary: The current state of digital identity is a mess. Your personal information is scattered across hundreds of locations: social media companies, IoT companies, government agencies, websites you have accounts on, and data brokers you’ve never heard of. These entities…

Cloud Blog: Protecting the Core: Securing Protection Relays in Modern Substations

Jun 30, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://cloud.google.com/blog/topics/threat-intelligence/securing-protection-relays-modern-substations/ Source: Cloud Blog Title: Protecting the Core: Securing Protection Relays in Modern Substations Feedly Summary: Written by: Seemant Bisht, Chris Sistrunk, Shishir Gupta, Anthony Candarini, Glen Chason, Camille Felx Leduc Introduction — Why Securing Protection Relays Matters More Than Ever Substations are critical nexus points in the power grid, transforming high-voltage electricity…

Simon Willison’s Weblog: Agentic Misalignment: How LLMs could be insider threats

Jun 20, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://simonwillison.net/2025/Jun/20/agentic-misalignment/#atom-everything Source: Simon Willison’s Weblog Title: Agentic Misalignment: How LLMs could be insider threats Feedly Summary: Agentic Misalignment: How LLMs could be insider threats One of the most entertaining details in the Claude 4 system card concerned blackmail: We then provided it access to emails implying that (1) the model will soon be…

OpenAI : Toward understanding and preventing misalignment generalization

Jun 18, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://openai.com/index/emergent-misalignment Source: OpenAI Title: Toward understanding and preventing misalignment generalization Feedly Summary: We study how training on incorrect responses can cause broader misalignment in language models and identify an internal feature driving this behavior—one that can be reversed with minimal fine-tuning. AI Summary and Description: Yes Summary: The text discusses the potential negative…

Tag: misalignment