Tag: classifiers

  • Simon Willison’s Weblog: What’s new in the world of LLMs, for NICAR 2025

    Source URL: https://simonwillison.net/2025/Mar/8/nicar-llms/ Source: Simon Willison’s Weblog Title: What’s new in the world of LLMs, for NICAR 2025 Feedly Summary: I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that’s happened…

  • Slashdot: Anthropic Makes ‘Jailbreak’ Advance To Stop AI Models Producing Harmful Results

    Source URL: https://slashdot.org/story/25/02/03/1810255/anthropic-makes-jailbreak-advance-to-stop-ai-models-producing-harmful-results?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: Anthropic Makes ‘Jailbreak’ Advance To Stop AI Models Producing Harmful Results Feedly Summary: AI Summary and Description: Yes Summary: Anthropic has introduced a new technique called “constitutional classifiers” designed to enhance the security of large language models (LLMs) like its Claude chatbot. This system aims to mitigate risks associated…

  • Simon Willison’s Weblog: Constitutional Classifiers: Defending against universal jailbreaks

    Source URL: https://simonwillison.net/2025/Feb/3/constitutional-classifiers/ Source: Simon Willison’s Weblog Title: Constitutional Classifiers: Defending against universal jailbreaks Feedly Summary: Constitutional Classifiers: Defending against universal jailbreaks Interesting new research from Anthropic, resulting in the paper Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. From the paper: In particular, we introduce Constitutional Classifiers, a framework…