Simon Willison’s Weblog: Nomic Embed Code: A State-of-the-Art Code Retriever

Source URL: https://simonwillison.net/2025/Mar/27/nomic-embed-code/
Source: Simon Willison’s Weblog
Title: Nomic Embed Code: A State-of-the-Art Code Retriever

Feedly Summary: Nomic Embed Code: A State-of-the-Art Code Retriever
Nomic have released a new embedding model that specializes in code, based on their CoRNStack “large-scale high-quality training dataset specifically curated for code retrieval".
The nomic-embed-code model is pretty large – 26.35GB – but the announcement also mentioned a much smaller model (released 5 months ago) called CodeRankEmbed which is just 521.60MB.
I missed that when it first came out, so I decided to give it a try using my llm-sentence-transformers plugin for LLM. That turned out to need einops too.
llm install llm-sentence-transformers
llm install einops
llm sentence-transformers register nomic-ai/CodeRankEmbed –trust-remote-code

Now I can run the model like this:
llm embed -m sentence-transformers/nomic-ai/CodeRankEmbed -c ‘hello’

This output an array of 768 number, starting [1.4794224500656128, -0.474479079246521, ….
Where this gets fun is combining it with my Symbex tool to create and then search embeddings for functions in a codebase.
I created an index for my LLM codebase like this:
cd llm
symbex ‘*’ ‘*.*’ –nl > code.txt

This creates a newline-separated JSON file of all of the functions (from ‘*’) and methods (from ‘*.*’) in the current directory – you can see that here.
Then I fed that into the llm embed-multi command like this:
llm embed-multi \
-d code.db \
-m sentence-transformers/nomic-ai/CodeRankEmbed \
code code.txt \
–format nl \
–store \
–batch-size 10

I found the –batch-size was needed to prevent it from crashing with an error.
The above command creates a collection called code in a SQLite database called code.db.
Having run this command I can search for functions that match a specific search term like this:
llm similar code -d code.db \
-c ‘Represent this query for searching relevant code: install a plugin’ | jq

That "Represent this query for searching relevant code: " prefix is required by the model. I pipe it through jq to make it a little more readable, which gives me these results.
This jq recipe makes for a better output:
llm similar code -d code.db \
-c ‘Represent this query for searching relevant code: install a plugin’ | \
jq -r ‘.id + "\n\n" + .content + "\n——–\n"’

The output from that starts like so:
llm/cli.py:1776

@cli.command(name="plugins")
@click.option("–all", help="Include built-in default plugins", is_flag=True)
def plugins_list(all):
"List installed plugins"
click.echo(json.dumps(get_plugins(all), indent=2))
——–

llm/cli.py:1791

@cli.command()
@click.argument("packages", nargs=-1, required=False)
@click.option(
"-U", "–upgrade", is_flag=True, help="Upgrade packages to latest version"
)
…
def install(packages, upgrade, editable, force_reinstall, no_cache_dir):
"""Install packages from PyPI into the same environment as LLM"""

Tags: nomic, llm, ai, embeddings

AI Summary and Description: Yes

Summary: The text discusses the release of a new embedding model called Nomic Embed Code, designed for code retrieval, highlighting its large size compared to a smaller model and demonstrating practical usage within a coding environment. This information is significant for professionals in AI and software security due to its implications for code management, compliance, and retrieval systems.

Detailed Description: The content centers around Nomic’s introduction of an advanced code retrieval embedding model. Here are the key points expanded for clarity:

* **Model Release**
– The Nomic Embed Code model is based on a high-quality dataset, CoRNStack, specifically tailored for effective code retrieval.
– The model is substantial in size at 26.35GB, as opposed to a smaller predecessor, CodeRankEmbed, which is 521.60MB.

* **Usage Instructions**
– The text proceeds to provide a step-by-step guide on how to utilize the new embedding model with a specific plugin (llm-sentence-transformers) for language models (LLMs).
– Installation commands for the necessary components (einops and llm-sentence-transformers) are provided.

* **Functionality Demonstration**
– A practical example shows how to create an index for a codebase using the provided command syntax.
– The embedding command (`llm embed-multi`) is discussed, emphasizing the inclusion of a batch size parameter to mitigate crashing errors.

* **Search and Output**
– Instructions on how to perform searches on the generated database (code.db) are shared, which involve formulating and refining queries.
– The use of jq (a command-line JSON processor) to format output for better readability is highlighted.
– Example outputs demonstrate real code snippets alongside their function in a coding context.

* **Implications for Security and Compliance**
– The ability to efficiently retrieve and manage code snippets raises considerations regarding security practices. Properly managing codebases, especially in collaborative environments, is vital for maintaining compliance and ensuring secure code practices.

This detailed analysis can aid professionals by showcasing practical applications of new AI technologies in improving software security and management processes. The integration of advanced retrieval models may enhance code quality and reduce compliance risks, especially in environments with strict governance requirements.