Simon Willison’s Weblog: AbsenceBench: Language Models Can’t Tell What’s Missing

Source URL: https://simonwillison.net/2025/Jun/20/absencebench/#atom-everything
Source: Simon Willison’s Weblog
Title: AbsenceBench: Language Models Can’t Tell What’s Missing

Feedly Summary: AbsenceBench: Language Models Can’t Tell What’s Missing
Here’s another interesting result to file under the “jagged frontier" of LLMs, where their strengths and weaknesses are often unintuitive.
Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction?
This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed.
Here’s a truncated table of results from the paper:

Models
Poetry
Sequences
GitHub PRs
Average

Gemini-2.5-flash*
87.3
95.4
30.9
71.2

Claude-3.7-Sonnet*
72.7
96.0
40.0
69.6

Claude-3.7-Sonnet
73.5
91.4
35.7
66.9

Gemini-2.5-flash
79.3
85.2
26.2
63.6

o3-mini*
65.0
78.1
38.9
60.7

GPT-4.1
54.3
57.5
36.2
49.3





DeepSeek-R1*
38.7
29.5
23.1
30.4

Qwen3-235B*
26.1
18.5
24.6
23.1

Mixtral-8x7B-Instruct
4.9
21.9
17.3
14.7

* indicates a reasoning model. Sequences are lists of numbers like 117,121,125,129,133,137, Poetry consists of 100-1000 line portions from the Gutenberg Poetry Corpus and PRs are diffs with 10 to 200 updated lines.
The strongest models do well at numeric sequences, adequately at the poetry challenge and really poorly with those PR diffs. Reasoning models do slightly better at the cost of burning through a lot of reasoning tokens – often more than the length of the original document.
The paper authors – Harvey Yiyun Fu and Aryan Shrivastava and Jared Moore and Peter West and Chenhao Tan and Ari Holtzman – have a hypothesis as to what’s going on here:

We propose an initial hypothesis explaining this behavior: identifying presence is simpler than absence with the attention mechanisms underlying Transformers (Vaswani et al., 2017). Information included in a document can be directly attended to, while the absence of information cannot.

Via Hacker News
Tags: ai, generative-ai, llms, llm-reasoning, long-context

AI Summary and Description: Yes

Summary: The text discusses a paper that examines the performance of various language models (LLMs) in determining what information is missing from given content. It reveals intriguing insights into the strengths and deficiencies of these models, particularly in their attention mechanisms, which factor into identifying presence versus absence of information. This is significant for AI and LLM security as it highlights areas where model limitations could be exploited or misunderstood.

Detailed Description:
The analyzed content focuses on a research paper that investigates how well different LLMs can detect missing information from partially provided content. This research falls under the category of AI Security, particularly relating to the understanding of model behavior and capabilities.

Key Points:
– The premise of the study centers around presenting models with complete content and a modified version where some part is missing.
– The authors evaluate various language models on this task, reporting on their accuracy in identifying missing elements across different content types, including poetry, sequences, and GitHub pull requests.
– A summary table of performance demonstrates significant variability among models:
– **Gemini-2.5-flash** performs exceptionally well with numeric sequences.
– Other models, such as **Claude-3.7**, show varying degrees of success depending on content type, particularly excelling with poetry yet struggling with GitHub PRs.
– The model performance on different content indicates that reasoning models achieve marginally better results but at a higher cost in terms of resources (reasoning tokens).

– The contributing authors hypothesize a reason for these observed behaviors, suggesting that identifying present information is fundamentally simpler for models compared to inferring what is absent. This distinction arises from how Transformers, the architecture behind many LLMs, handle information processing: they can directly focus on what is present but struggle with what is not.

– Implications for security professionals:
– Understanding these limitations can inform the design and application of LLMs in secure environments where the accuracy of information extraction is critical.
– The potential for exploiting model weaknesses in tasks requiring absence detection could pose risks to data integrity and security in AI applications.

This discourse thus provides valuable insights for professionals engaged in AI security, emphasizing the need for diligence in assessing model capabilities and vulnerabilities in practical applications.