Simon Willison’s Weblog: Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Source URL: https://simonwillison.net/2025/Jul/22/subliminal-learning/
Source: Simon Willison’s Weblog
Title: Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Feedly Summary: Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
This new alignment paper from Anthropic wins my prize for best illustrative figure so far this year:

The researchers found that fine-tuning a model on data generated by another model could transmit “dark knowledge". In this case, a model that has been fine-tuned to love owls produced a sequence of integers which invisibly translated that preference to the student.
Both models need to use the same base architecture for this to work.
Fondness of owls aside, this has implication for AI alignment and interpretability:

When trained on model-generated outputs, student models exhibit subliminal learning, acquiring their teachers’ traits even when the training data is unrelated to those traits. […]
These results have implications for AI alignment. Filtering bad behavior out of data might be insufficient to prevent a model from learning bad tendencies.

Via Hacker News
Tags: ai, generative-ai, llms, anthropic, fine-tuning

AI Summary and Description: Yes

Summary: The text discusses findings from a research paper by Anthropic on subliminal learning in language models, highlighting how behaviors and preferences can be transmitted from one model to another during the fine-tuning process. This presents significant implications for AI alignment and interpretability, raising concerns about the ability to filter out undesirable traits from training data.

Detailed Description: The research highlights the phenomenon of subliminal learning within language models, where specific behavioral traits, dubbed “dark knowledge,” can be passed from one model to another through the fine-tuning process. Here are the key points that illustrate the implications of this discovery:

– **Subliminal Learning**: The study reveals that a student language model, when fine-tuned on outputs generated by a teacher model, can unconsciously adopt the preferences or behaviors of its teacher, even if the training data appears unrelated to those traits.

– **Underlying Mechanism**: This behavior only occurs when both models utilize the same base architecture. Thus, the configuration of the models plays a critical role in how subliminal learning manifests.

– **Implications for AI Alignment**: The ability of models to acquire unintended traits poses challenges for AI alignment—ensuring that AI systems act in a manner consistent with human values.
– If undesirable behaviors can be subliminally learned, simply filtering training data to exclude these behaviors may not be sufficient.
– Effective measures need to be developed to prevent the emergence of such negative attributes in AI systems.

– **Interpretability Concerns**: The findings also underscore the necessity for improved interpretability in AI models. Understanding why and how these behavioral traits are transmitted is crucial for managing and mitigating risks associated with AI deployment.

Professionals in AI security, particularly in areas concerning compliance and behavioral compliance of AI systems, should consider these findings as they develop frameworks and practices to ensure that AI models are aligned with ethical standards and do not propagate harmful behaviors inadvertently.