Simon Willison’s Weblog: ChatGPT agent triggers crawls from Bingbot and Yandex

Aug 4, 2025

—

Source URL: https://simonwillison.net/2025/Aug/4/chatgpt-agents-agent/#atom-everything
Source: Simon Willison’s Weblog
Title: ChatGPT agent triggers crawls from Bingbot and Yandex

Feedly Summary: ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT – replacing their previous Operator research preview which is scheduled for deprecation on August 31st.
In exploring how it works I found that, for some reason, it triggers crawls of pages it visits from both Bingbot and Yandex!
Investigating ChatGPT agent’s user-agent
I started my investigation by creating a logged web URL endpoint using django-http-debug. Then I told ChatGPT agent mode to explore that new page:

My logging captured these request headers:
Via: 1.1 heroku-router
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Cf-Ray: 96a0f289adcb8e8e-SEA
Cookie: cf_clearance=zzV8W…
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Priority: u=0, i
Sec-Ch-Ua: “Not)A;Brand";v="8", "Chromium";v="138"
Signature: sig1=:1AxfqHocTf693inKKMQ7NRoHoWAZ9d/vY4D/FO0+MqdFBy0HEH3ZIRv1c3hyiTrzCvquqDC8eYl1ojcPYOSpCQ==:
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 45ef5be4-ead3-99d5-f018-13c4a55864d3
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Accept-Encoding: gzip, br
Accept-Language: en-US,en;q=0.9
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
X-Forwarded-For: 2a09:bac5:665f:1541::21e:154, 172.71.147.183
X-Request-Start: 1754340840059
Cf-Connecting-Ip: 2a09:bac5:665f:1541::21e:154
Sec-Ch-Ua-Mobile: ?0
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Sec-Ch-Ua-Platform: "Linux"
Upgrade-Insecure-Requests: 1

That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS – which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.
At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I’m using Firefox on macOS and it identified itself as Chrome.
Then I spotted this header:
Signature-Agent: "https://chatgpt.com"

Which is accompanied by a much more complex header called Signature-Input:
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"

And a Signature header too.
These turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures’ published February 2024.
The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries. The signature uses a public key that’s provided by the following well-known endpoint:
https://chatgpt.com/.well-known/http-message-signatures-directory

Add it all together and we now have a rock-solid way to identify traffic from ChatGPT agent: look for the Signature-Agent: "https://chatgpt.com" header and confirm its value by checking the signature in the Signature-Input and Signature headers.
And then came Bingbot
Just over a minute after it captured that request, my logging endpoint got another request:
Via: 1.1 heroku-router
From: bingbot(at)microsoft.com
Host: simonwillison.net
Accept: */*
Cf-Ray: 96a0f4671d1fc3c6-SEA
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 6214f5dc-a4ea-5390-1beb-f2d26eac5d01
Accept-Encoding: gzip, br
X-Forwarded-For: 207.46.13.9, 172.71.150.252
X-Request-Start: 1754340916429
Cf-Connecting-Ip: 207.46.13.9
X-Forwarded-Port: 80
X-Forwarded-Proto: http

I pasted 207.46.13.9 into Microsoft’s Verify Bingbot tool (after solving a particularly taxing CAPTCHA) and it confirmed that this was indeed a request from Bingbot.
I’m reasonably confident the only system that had seen that URL was ChatGPT agent, so apparently there is some kind of mechanism that triggers a Bingbot crawl shortly after it sees a new URL.
…and then Yandex?
Before publishing this article I decided to run the experiment one more time, with a new URL, just to confirm my findings.
This time I got the hit from ChatGPT agent… and then within a minute I got a new hit that looked like this:
Via: 1.1 heroku-router
From: support@search.yandex.ru
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cf-Ray: 96a16390d8f6f3a7-DME
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Cf-Ipcountry: RU
X-Request-Id: 3cdcbdba-f629-0d29-b453-61644da43c6c
Accept-Encoding: gzip, br
X-Forwarded-For: 213.180.203.138, 172.71.184.65
X-Request-Start: 1754345469921
Cf-Connecting-Ip: 213.180.203.138
X-Forwarded-Port: 80
X-Forwarded-Proto: http

I am absolutely baffled by this. I undertstand how ChatGPT might have a relationship with Bing, given Microsoft’s investment in OpenAI and ChatGPT’s usage of Bing for its search feature… but under what circumstances could my URL there be shared with the Yandex crawler?
Yanex suggest a reverse DNS lookup to verify, so I ran this command:
dig -x 213.180.203.138 +short

And got back:
213-180-203-138.spider.yandex.com.

Which confirms that this is indeed a Yandex crawler.
Oddly enough, this time I didn’t get a Bingbot hit at all.
I noticed that the second demo had "web search" enabled, and had run some searches in addition to hitting my page. I tried a third experiment with that turned off and with the prompt:

Visit https://simonwillison.net/information-on-this-page but do not run any other searches or visit any other pages.

This time I got all three – the hit from ChatGPT agent, then a hit from Yandex and then a hit from Bingbot.

So what’s going on here?
There are quite a few different moving parts here.

I’m using Firefox on macOS with the 1Password and Readwise Highlighter extensions installed and active. Since I didn’t visit the debug pages at all with my own browser I don’t think any of these are relevant to these results.
ChatGPT agent makes just a single request to my debug URL …
… which is proxied through both Cloudflare and Heroku.
Within about a minute, I get hits from one or both of Bingbot and Yandex.

Presumably ChatGPT agent itself is running behind at least one proxy – I would expect OpnenAI to keep a close eye on that traffic to ensure it doesn’t get abused.
I’m guessing that infrastructure is hosted by Microsoft Azure. The OpenAI Sub-processor List – though that lists Microsoft Corporation, CoreWeave Inc, Oracle Cloud Platform and Google Cloud Platform under the "Cloud infrastructure" section so it could be any of those.
Since the page is served over HTTPS my guess is that any intermediary proxies should be unable to see the path component of the URL, making the mystery of how Bingbot and Yandex saw the URL even more intriguing.
Tags: bing, privacy, search-engines, user-agents, ai, generative-ai, chatgpt, llms

AI Summary and Description: Yes

Summary: The text provides an in-depth investigation into the behavior of the ChatGPT agent, particularly its web crawling interactions with both Bingbot and Yandex. This highlights both the potential privacy issues and the underlying technologies such as HTTP message signatures, critical for professionals in AI and cloud security.

Detailed Description:
The discussion revolves around the activities of the ChatGPT agent, specifically examining how it interacts with web pages and the subsequent crawling actions it triggers in other search bots like Bingbot and Yandex. This is significant for security and compliance professionals as it presents a case of potential privacy concerns and the importance of understanding the relationships between different web technologies.

– **ChatGPT Agent Overview**: Recently released ChatGPT feature combining browser automation with terminal access, replacing a previous version.
– **Crawling Behavior**: The agent evidently triggers crawling operations from Bingbot and Yandex immediately after visiting a webpage, which raises questions about confidentiality and data sharing.
– **User-Agent Analysis**:
– Confusion arises from the agent misidentifying its user-agent string and the platform it’s running on (indicating both macOS and Linux).
– Lack of clarity about the true nature of the agent’s identity in HTTP requests.

– **Signature and Security**:
– The use of a complex signature system tied to HTTP Message Signatures (RFC 9421) allows for verification of request authenticity and data integrity.
– This represents an evolving standard for enhancing trust in web communications, crucial for infrastructure security professionals.

– **Interaction with Search Engines**:
– The author notes the intriguing aspect of how URLs are disseminated to crawlers, indicating a potential lack of transparency in data handling.
– The sequence of events shows that when ChatGPT agent accesses a URL, it can lead to immediate crawls by both Bingbot and Yandex, suggesting a coordinated relationship or infrastructure connectivity that should be mindful of data privacy implications.

– **Infrastructure Speculation**:
– Speculations indicate infrastructure may be hosted on Microsoft Azure or other established cloud services shared in OpenAI’s sub-processor list.
– Challenges arise due to the use of HTTPS, questioning how these interactions are possible without exposing sensitive path data.

In summary, this case study brings to light significant considerations for AI security, the relationship between AI systems and search engine behaviors, and the implications for data privacy—making it a relevant point of analysis for professionals operating within AI, cloud security, and compliance domains.

.NET 01 1 10 1Password 2 2024 2025 24 3 4 5 53 7 a A4 abuse access Act actions after age agent Agent mode agents AI AI security AI systems Ampere analysis and app Apple Application Arch art as at ated authenticity authority Auto automation Azure Behavior being Bi bing bingbot bots browser browser automation Bug by C CAPTCHA case study CERN challenge challenges chat ChatGPT checking Chrome Chromium CI CIA CleaR client clients Cloud cloud infrastructure cloud platform cloud security cloud service cloud services Cloudflare co coding command communication compliance compliance professionals concerns confidentiality connectivity core CoreWeave crawler crawlers critical D data Data Handling data integrity data privacy data sharing de demo Deprecation depth Django DNS document domain domains e Ed25519 ELF encoding end endpoint event exp extensions F5 feature firefox first following for g Gecko Gen generative Go Google Google Cloud Google Cloud Platform GPT grade gs H h3 handling headers high Highlight hosted HR http HTTPS identity image implications in information infrastructure infrastructure security integrity Intel inter interaction interactions inux investigation investment io IRS issue ite J jq Just k Key l Lance language least led Li Linux llm llms lm logging loop low M mac MacOS making man media Message Signatures Micro Microsoft Microsoft Azure mini ML Mobile Mode mozilla my N new NGO no non notes NPU o oE of off on one only ons open openai operation operations Operator ops Oracle Oracle Cloud ory oS other out over per platform point potential pre Preview privacy privacy concerns privacy implications privacy issue privacy issues pro process processor professionals prompt proxy ps public public key publishing Py Q question R R1 Raise Ray RCE re red release research research preview reverse review RFC Ro Rock RoT row Rust s Safari search search engine search engines search feature sec secure security security and compliance security professionals self sequence server service services SHA sharing short side Sig signatures Sim single solid solving source specific SSE SSO STAR start STIG study support system systems T Tags: tamper tech technologies ted terminal text the third Thought Time to tool Tor TP traffic transparency trie trust turn UI UK under up upgrade US usage use user uth V val vents verification version vm web web crawling web search web technologies WebKit WebP Well Wi x XML Yandex yt z