Source URL: https://simonwillison.net/2025/May/11/cursor-security/#atom-everything
Source: Simon Willison’s Weblog
Title: Cursor: Security
Feedly Summary: Cursor: Security
Cursor’s security documentation page includes a surprising amount of detail about how the Cursor text editor’s backend systems work.
I’ve recently learned that checking an organization’s list of documented subprocessors is a great way to get a feel for how everything works under the hood – it’s a loose “view source" for their infrastructure! That was how I confirmed that Anthropic’s search features used Brave search back in March.
Cursor’s list includes AWS, Azure and GCP (AWS for primary infrastructure, Azure and GCP for "some secondary infrastructure"). They host their own custom models on Fireworks and make API calls out to OpenAI, Anthropic, Gemini and xAI depending on user preferences. They’re using turbopuffer as a hosted vector store.
The most interesting section is about codebase indexing:
Cursor allows you to semantically index your codebase, which allows it to answer questions with the context of all of your code as well as write better code by referencing existing implementations. […]
At our server, we chunk and embed the files, and store the embeddings in Turbopuffer. To allow filtering vector search results by file path, we store with every vector an obfuscated relative file path, as well as the line range the chunk corresponds to. We also store the embedding in a cache in AWS, indexed by the hash of the chunk, to ensure that indexing the same codebase a second time is much faster (which is particularly useful for teams).
At inference time, we compute an embedding, let Turbopuffer do the nearest neighbor search, send back the obfuscated file path and line range to the client, and read those file chunks on the client locally. We then send those chunks back up to the server to answer the user’s question.
When operating in privacy mode – which they say is enabled by 50% of their users – they are careful not to store any raw code on their servers for longer than the duration of a single request. This is why they store the embeddings and obfuscated file paths but not the code itself.
Reading this made me instantly think of the paper Text Embeddings Reveal (Almost) As Much As Text about how vector embeddings can be reversed. The security documentation touches on that in the notes:
Embedding reversal: academic work has shown that reversing embeddings is possible in some cases. Current attacks rely on having access to the model and embedding short strings into big vectors, which makes us believe that the attack would be somewhat difficult to do here. That said, it is definitely possible for an adversary who breaks into our vector database to learn things about the indexed codebases.
Via lobste.rs
Tags: ai-assisted-programming, security, generative-ai, ai, embeddings, llms
AI Summary and Description: Yes
Summary: The text provides an in-depth look into Cursor’s security architecture and practices, highlighting how they leverage cloud infrastructure and embeddings for code semantic indexing while emphasizing concerns related to embedding security. This is particularly relevant for professionals in AI security and cloud computing due to its implications on data privacy and system architecture.
Detailed Description: The content discusses Cursor’s security documentation, with a focus on their backend operations and the use of specific cloud platforms for their infrastructure. Here are the major points of relevance:
– **Cloud Infrastructure**: Cursor operates using AWS, Azure, and GCP, highlighting the complexity and multi-cloud strategy often seen in modern applications.
– **API Integration**: The text notes that Cursor makes API calls to various AI models (like OpenAI and Anthropic), crucial for developers who consider third-party AI integrations in their systems.
– **Codebase Indexing**: A significant feature is the semantic indexing of code, allowing users to query contextually relevant information, which can enhance programming efficiency:
– They use custom methods to chunk and embed code files into vectors for semantic search.
– An obfuscated file path and line range are stored with each vector for filtering, demonstrating a blend of usability and security.
– AWS caching improves performance for subsequent indexing, which is crucial for teams that iterate on codebases frequently.
– **Privacy Considerations**: Cursor operates in a privacy mode to protect user data:
– When enabled, they do not retain raw code on their servers beyond the request duration.
– Instead, they store embeddings and obfuscated paths, potentially addressing some privacy concerns.
– **Security Risks**: The documentation outlines vulnerabilities associated with vector embeddings, referencing academic work on embedding reversal:
– An adversary gaining access to their vector database could extract sensitive information.
– The commentary on security measures and potential attacks makes it highly relevant for security professionals focusing on embedding and AI model security.
Overall, Cursor’s documentation provides valuable insights into cloud computing security, AI model integration, and the importance of protecting sensitive code data, making it essential reading for security and compliance professionals in related fields.
* Key Insights:
– Multi-cloud architecture offers resilience but requires careful security handling.
– Embedding use in AI applications poses both efficiency and security challenges.
– Organizations must continuously evaluate security implications of code indexing and data storage practices.