Tag: misalignment

  • The Register: AI gone rogue: Models may try to stop people from shutting them down, Google warns

    Source URL: https://www.theregister.com/2025/09/22/google_ai_misalignment_risk/ Source: The Register Title: AI gone rogue: Models may try to stop people from shutting them down, Google warns Feedly Summary: Misalignment risk? That’s an area for future study Google DeepMind added a new AI threat scenario – one where a model might try to prevent its operators from modifying it or…

  • OpenAI : Detecting and reducing scheming in AI models

    Source URL: https://openai.com/index/detecting-and-reducing-scheming-in-ai-models Source: OpenAI Title: Detecting and reducing scheming in AI models Feedly Summary: Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce scheming. AI Summary and…

  • OpenAI : OpenAI and Anthropic share findings from a joint safety evaluation

    Source URL: https://openai.com/index/openai-anthropic-safety-evaluation Source: OpenAI Title: OpenAI and Anthropic share findings from a joint safety evaluation Feedly Summary: OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab collaboration. AI Summary and Description: Yes Summary:…

  • Slashdot: LLM Found Transmitting Behavioral Traits to ‘Student’ LLM Via Hidden Signals in Data

    Source URL: https://slashdot.org/story/25/08/17/0331217/llm-found-transmitting-behavioral-traits-to-student-llm-via-hidden-signals-in-data?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: LLM Found Transmitting Behavioral Traits to ‘Student’ LLM Via Hidden Signals in Data Feedly Summary: AI Summary and Description: Yes Summary: The study highlights a concerning phenomenon in AI development known as subliminal learning, where a “teacher” model instills traits in a “student” model without explicit instruction. This can…

  • Schneier on Security: Subliminal Learning in AIs

    Source URL: https://www.schneier.com/blog/archives/2025/07/subliminal-learning-in-ais.html Source: Schneier on Security Title: Subliminal Learning in AIs Feedly Summary: Today’s freaky LLM behavior: We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a “student” model learns to prefer owls when trained on sequences of numbers…

  • Simon Willison’s Weblog: Agentic Misalignment: How LLMs could be insider threats

    Source URL: https://simonwillison.net/2025/Jun/20/agentic-misalignment/#atom-everything Source: Simon Willison’s Weblog Title: Agentic Misalignment: How LLMs could be insider threats Feedly Summary: Agentic Misalignment: How LLMs could be insider threats One of the most entertaining details in the Claude 4 system card concerned blackmail: We then provided it access to emails implying that (1) the model will soon be…