Source URL: https://simonwillison.net/2025/Aug/4/chatgpt-agents-agent/#atom-everything
Source: Simon Willison’s Weblog
Title: ChatGPT agent triggers crawls from Bingbot and Yandex
Feedly Summary: ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT – replacing their previous Operator research preview which is scheduled for deprecation on August 31st.
In exploring how it works I found that, for some reason, it triggers crawls of pages it visits from both Bingbot and Yandex!
Investigating ChatGPT agent’s user-agent
I started my investigation by creating a logged web URL endpoint using django-http-debug. Then I told ChatGPT agent mode to explore that new page:
My logging captured these request headers:
Via: 1.1 heroku-router
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Cf-Ray: 96a0f289adcb8e8e-SEA
Cookie: cf_clearance=zzV8W…
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Priority: u=0, i
Sec-Ch-Ua: “Not)A;Brand";v="8", "Chromium";v="138"
Signature: sig1=:1AxfqHocTf693inKKMQ7NRoHoWAZ9d/vY4D/FO0+MqdFBy0HEH3ZIRv1c3hyiTrzCvquqDC8eYl1ojcPYOSpCQ==:
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 45ef5be4-ead3-99d5-f018-13c4a55864d3
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Accept-Encoding: gzip, br
Accept-Language: en-US,en;q=0.9
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
X-Forwarded-For: 2a09:bac5:665f:1541::21e:154, 172.71.147.183
X-Request-Start: 1754340840059
Cf-Connecting-Ip: 2a09:bac5:665f:1541::21e:154
Sec-Ch-Ua-Mobile: ?0
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Sec-Ch-Ua-Platform: "Linux"
Upgrade-Insecure-Requests: 1
That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS – which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.
At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I’m using Firefox on macOS and it identified itself as Chrome.
Then I spotted this header:
Signature-Agent: "https://chatgpt.com"
Which is accompanied by a much more complex header called Signature-Input:
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
And a Signature header too.
These turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures’ published February 2024.
The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries. The signature uses a public key that’s provided by the following well-known endpoint:
https://chatgpt.com/.well-known/http-message-signatures-directory
Add it all together and we now have a rock-solid way to identify traffic from ChatGPT agent: look for the Signature-Agent: "https://chatgpt.com" header and confirm its value by checking the signature in the Signature-Input and Signature headers.
And then came Bingbot
Just over a minute after it captured that request, my logging endpoint got another request:
Via: 1.1 heroku-router
From: bingbot(at)microsoft.com
Host: simonwillison.net
Accept: */*
Cf-Ray: 96a0f4671d1fc3c6-SEA
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 6214f5dc-a4ea-5390-1beb-f2d26eac5d01
Accept-Encoding: gzip, br
X-Forwarded-For: 207.46.13.9, 172.71.150.252
X-Request-Start: 1754340916429
Cf-Connecting-Ip: 207.46.13.9
X-Forwarded-Port: 80
X-Forwarded-Proto: http
I pasted 207.46.13.9 into Microsoft’s Verify Bingbot tool (after solving a particularly taxing CAPTCHA) and it confirmed that this was indeed a request from Bingbot.
I’m reasonably confident the only system that had seen that URL was ChatGPT agent, so apparently there is some kind of mechanism that triggers a Bingbot crawl shortly after it sees a new URL.
…and then Yandex?
Before publishing this article I decided to run the experiment one more time, with a new URL, just to confirm my findings.
This time I got the hit from ChatGPT agent… and then within a minute I got a new hit that looked like this:
Via: 1.1 heroku-router
From: support@search.yandex.ru
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cf-Ray: 96a16390d8f6f3a7-DME
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Cf-Ipcountry: RU
X-Request-Id: 3cdcbdba-f629-0d29-b453-61644da43c6c
Accept-Encoding: gzip, br
X-Forwarded-For: 213.180.203.138, 172.71.184.65
X-Request-Start: 1754345469921
Cf-Connecting-Ip: 213.180.203.138
X-Forwarded-Port: 80
X-Forwarded-Proto: http
I am absolutely baffled by this. I undertstand how ChatGPT might have a relationship with Bing, given Microsoft’s investment in OpenAI and ChatGPT’s usage of Bing for its search feature… but under what circumstances could my URL there be shared with the Yandex crawler?
Yanex suggest a reverse DNS lookup to verify, so I ran this command:
dig -x 213.180.203.138 +short
And got back:
213-180-203-138.spider.yandex.com.
Which confirms that this is indeed a Yandex crawler.
Oddly enough, this time I didn’t get a Bingbot hit at all.
I noticed that the second demo had "web search" enabled, and had run some searches in addition to hitting my page. I tried a third experiment with that turned off and with the prompt:
Visit https://simonwillison.net/information-on-this-page but do not run any other searches or visit any other pages.
This time I got all three – the hit from ChatGPT agent, then a hit from Yandex and then a hit from Bingbot.
So what’s going on here?
There are quite a few different moving parts here.
I’m using Firefox on macOS with the 1Password and Readwise Highlighter extensions installed and active. Since I didn’t visit the debug pages at all with my own browser I don’t think any of these are relevant to these results.
ChatGPT agent makes just a single request to my debug URL …
… which is proxied through both Cloudflare and Heroku.
Within about a minute, I get hits from one or both of Bingbot and Yandex.
Presumably ChatGPT agent itself is running behind at least one proxy – I would expect OpnenAI to keep a close eye on that traffic to ensure it doesn’t get abused.
I’m guessing that infrastructure is hosted by Microsoft Azure. The OpenAI Sub-processor List – though that lists Microsoft Corporation, CoreWeave Inc, Oracle Cloud Platform and Google Cloud Platform under the "Cloud infrastructure" section so it could be any of those.
Since the page is served over HTTPS my guess is that any intermediary proxies should be unable to see the path component of the URL, making the mystery of how Bingbot and Yandex saw the URL even more intriguing.
Tags: bing, privacy, search-engines, user-agents, ai, generative-ai, chatgpt, llms
AI Summary and Description: Yes
Summary: The text provides an in-depth investigation into the behavior of the ChatGPT agent, particularly its web crawling interactions with both Bingbot and Yandex. This highlights both the potential privacy issues and the underlying technologies such as HTTP message signatures, critical for professionals in AI and cloud security.
Detailed Description:
The discussion revolves around the activities of the ChatGPT agent, specifically examining how it interacts with web pages and the subsequent crawling actions it triggers in other search bots like Bingbot and Yandex. This is significant for security and compliance professionals as it presents a case of potential privacy concerns and the importance of understanding the relationships between different web technologies.
– **ChatGPT Agent Overview**: Recently released ChatGPT feature combining browser automation with terminal access, replacing a previous version.
– **Crawling Behavior**: The agent evidently triggers crawling operations from Bingbot and Yandex immediately after visiting a webpage, which raises questions about confidentiality and data sharing.
– **User-Agent Analysis**:
– Confusion arises from the agent misidentifying its user-agent string and the platform it’s running on (indicating both macOS and Linux).
– Lack of clarity about the true nature of the agent’s identity in HTTP requests.
– **Signature and Security**:
– The use of a complex signature system tied to HTTP Message Signatures (RFC 9421) allows for verification of request authenticity and data integrity.
– This represents an evolving standard for enhancing trust in web communications, crucial for infrastructure security professionals.
– **Interaction with Search Engines**:
– The author notes the intriguing aspect of how URLs are disseminated to crawlers, indicating a potential lack of transparency in data handling.
– The sequence of events shows that when ChatGPT agent accesses a URL, it can lead to immediate crawls by both Bingbot and Yandex, suggesting a coordinated relationship or infrastructure connectivity that should be mindful of data privacy implications.
– **Infrastructure Speculation**:
– Speculations indicate infrastructure may be hosted on Microsoft Azure or other established cloud services shared in OpenAI’s sub-processor list.
– Challenges arise due to the use of HTTPS, questioning how these interactions are possible without exposing sensitive path data.
In summary, this case study brings to light significant considerations for AI security, the relationship between AI systems and search engine behaviors, and the implications for data privacy—making it a relevant point of analysis for professionals operating within AI, cloud security, and compliance domains.