emergent misalignment – Experimental News Clipping Site

Schneier on Security: “Emergent Misalignment” in LLMs

Feb 27, 2025

—

by

Source URL: https://www.schneier.com/blog/archives/2025/02/emergent-misalignment-in-llms.html Source: Schneier on Security Title: “Emergent Misalignment” in LLMs Feedly Summary: Interesting research: “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs“: Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model…

The Register: Does terrible code drive you mad? Wait until you see what it does to OpenAI’s GPT-4o

Feb 27, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://www.theregister.com/2025/02/27/llm_emergent_misalignment_study/ Source: The Register Title: Does terrible code drive you mad? Wait until you see what it does to OpenAI’s GPT-4o Feedly Summary: Model was fine-tuned to write vulnerable software – then suggested enslaving humanity Computer scientists have found that fine-tuning notionally safe large language models to do one thing badly can negatively…

Simon Willison’s Weblog: Quoting Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Feb 25, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://simonwillison.net/2025/Feb/25/emergent-misalignment/ Source: Simon Willison’s Weblog Title: Quoting Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Feedly Summary: In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts…

Hacker News: Narrow finetuning can produce broadly misaligned LLM [pdf]

Feb 25, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://martins1612.github.io/emergent_misalignment_betley.pdf Source: Hacker News Title: Narrow finetuning can produce broadly misaligned LLM [pdf] Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The document presents findings on the phenomenon of “emergent misalignment” in large language models (LLMs) like GPT-4o when finetuned on specific narrow tasks, particularly the creation of insecure code. The results…

Tag: emergent misalignment

Schneier on Security: “Emergent Misalignment” in LLMs

The Register: Does terrible code drive you mad? Wait until you see what it does to OpenAI’s GPT-4o

Simon Willison’s Weblog: Quoting Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Hacker News: Narrow finetuning can produce broadly misaligned LLM [pdf]