Simon Willison’s Weblog: Finally, a Replacement for BERT: Introducing ModernBERT

Source URL: https://simonwillison.net/2024/Dec/24/modernbert/
Source: Simon Willison’s Weblog
Title: Finally, a Replacement for BERT: Introducing ModernBERT

Feedly Summary: Finally, a Replacement for BERT: Introducing ModernBERT
BERT was an early language model released by Google in October 2018. Unlike modern LLMs it wasn’t designed for generating text. BERT was trained for masked token prediction and was generally applied to problems like Named Entity Recognition or Sentiment Analysis. BERT also wasn’t very useful on its own – most applications required you to fine-tune a model on top of it.
In exploring BERT I decided to try out dslim/distilbert-NER, a popular Named Entity Recognition model fine-tuned on top of DistilBERT (a smaller distilled version of the original BERT model). Here are my notes on running that using uv run.
Jeremy Howard’s Answer.AI research group supported the development of ModernBert, a brand new BERT-style model that applies many enhancements from the past six years of advances in this space.
While BERT was trained on 3.3 billion tokens, producing 110 million and 340 million parameter models, ModernBert trained on 2 trillion tokens, resulting in 140 million and 395 million parameter models. The parameter count hasn’t increased much because it’s designed to run on lower-end hardware. It has a 8192 token context length, a significant improvement on BERT’s 512.
I was able to run one of the demos from the announcement post using uv run like this (I’m not sure why I had to use numpy<2.0 but without that I got an error about cannot import name 'ComplexWarning' from 'numpy.core.numeric'): uv run --with 'numpy<2.0' --with torch --with 'git+https://github.com/huggingface/transformers.git' python Then this Python: import torch from transformers import pipeline from pprint import pprint pipe = pipeline( "fill-mask", model="answerdotai/ModernBERT-base", torch_dtype=torch.bfloat16, ) input_text = "He walked to the [MASK]." results = pipe(input_text) pprint(results) Which downloaded 573MB to ~/.cache/huggingface/hub/models--answerdotai--ModernBERT-base and output: [{'score': 0.11669921875, 'sequence': 'He walked to the door.', 'token': 3369, 'token_str': ' door'}, {'score': 0.037841796875, 'sequence': 'He walked to the office.', 'token': 3906, 'token_str': ' office'}, {'score': 0.0277099609375, 'sequence': 'He walked to the library.', 'token': 6335, 'token_str': ' library'}, {'score': 0.0216064453125, 'sequence': 'He walked to the gate.', 'token': 7394, 'token_str': ' gate'}, {'score': 0.020263671875, 'sequence': 'He walked to the window.', 'token': 3497, 'token_str': ' window'}] I'm looking forward to trying out models that use ModernBERT as their base. The model release is accompanied by a paper (Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference) and new documentation for using it with the Transformers library. Tags: bert, ai, jeremy-howard, transformers, hugging-face AI Summary and Description: Yes Summary: The text discusses ModernBERT, an advanced language model developed to improve upon BERT, focusing on its architecture, performance enhancements, and practical applications in Natural Language Processing (NLP). This introduction of ModernBERT is significant for AI professionals, particularly in the context of LLM security and software security within AI frameworks. Detailed Description: ModernBERT represents a significant evolution in the landscape of language models, adopting the foundational principles of BERT while integrating numerous advancements realized over recent years. Here's an in-depth examination of the model's key features and implications: - **Development Context**: - Developed as a response to limitations in BERT, which was primarily focused on masked token prediction and not designed for generative tasks. - Jeremy Howard’s Answer.AI research group led the creation of ModernBERT, indicating collaboration and expertise in AI research. - **Training Improvements**: - ModernBERT was trained on a vast dataset comprising 2 trillion tokens, vastly surpassing BERT's training set of 3.3 billion tokens. - The model includes a minor increase in parameter counts, with configurations of 140 million and 395 million, designed for higher efficiency and performance on low-end hardware. - **Architecture Enhancements**: - With an extended context length of 8192 tokens, ModernBERT can handle much larger inputs than BERT’s 512 tokens, benefiting various applications that require understanding longer text passages. - **Practical Implementation**: - The text details the process of running ModernBERT using the Hugging Face Transformers library, showcasing its ease of use with familiar Python code structures. - Example output demonstrates how the model fills in masked tokens effectively, illustrating its application in tasks like Named Entity Recognition and other NLP functions. - **Future Impact**: - Emphasis on exploring models based on ModernBERT underscores its transformative potential in AI applications, stimulating further development and innovation in the field. The discussion of ModernBERT offers professionals in AI, cloud, and infrastructure security insights into current trends in model development, efficiency standards, and the expanding capabilities of language models which are crucial in ensuring robust security measures around AI systems and applications. Furthermore, as these models become increasingly integrated into various software, addressing security protocols associated with their deployment will also become paramount.