Source URL: https://simonwillison.net/2025/May/10/llama-cpp-vision/#atom-everything
Source: Simon Willison’s Weblog
Title: Trying out llama.cpp’s new vision support
Feedly Summary: This llama.cpp server vision support via libmtmd pull request – via Hacker News – was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It’s documented on this page, but the more detailed technical details are covered here. Here are my notes on getting it working on a Mac.
llama.cpp models are usually distributed as .gguf files. This project introduces a new variant of those called mmproj, for multimodal projector. libmtmd is the new library for handling these.
You can try it out by compiling llama.cpp from source, but I found another option that works: you can download pre-compiled binaries from the GitHub releases.
On macOS there’s an extra step to jump through to get these working, which I’ll describe below.
I downloaded the llama-b5332-bin-macos-arm64.zip file from this GitHub release and unzipped it, which created a build/bin directory.
That directory contains a bunch of binary executables and a whole lot of .dylib files. macOS wouldn’t let me execute these files because they were quarantined. Running this command fixed that for the llama-mtmd-cli and llama-server executables and the .dylib files they needed:
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Now I can run an interactive terminal LLM session using this command:
./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
That -ngl 99 flag specifies that 99 layers should be offloaded to the GPU. I haven’t experimented with modifying or omitting this flag yet – I saw it in a Hacker News comment by danielhanchen and it seems to work.
The first time this runs it fetches four files from unsloth/gemma-3-4b-it-GGUF – a quantized copy of Google DeepMind’s Gemma 3 4B – and stores them in ~/Library/Caches/llama.cpp:
2.4G unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf
217B unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf.json
812M unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf
203B unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf.json
Once they’ve downloaded an interactive session starts. You can chat directly with the model or use the /image path-to-image.jpg command to load an image into the context.
I loaded in a pelican photo and prompted “describe it" and got back this:
It’s a striking photograph of a brown pelican standing in shallow water on a sandy beach. Here’s a breakdown of what I see:
Subject: The main focus is a large brown pelican. It’s standing with its wings fully extended, creating a dramatic pose.
Pose: The pelican is in a slightly awkward, almost comical stance, with its legs planted firmly in the wet sand and its wings spread wide. It looks like it might be preparing to take off or is simply enjoying the water.
Color & Texture: The pelican is a rich, dark brown with lighter markings on its wings and head. The sand is a muted beige color, and the water reflects the sunlight, creating a shimmering effect. You can see the texture of the
sand and the ripples in the water.
Lighting: The lighting appears to be bright and sunny, casting shadows and highlighting the details of the pelican’s feathers.
Background: The background is blurred, suggesting a shallow depth of field, which keeps the focus on the pelican. It appears to be a beach scene.
Overall Impression: The image has a dynamic feel due to the pelican’s pose and the movement implied by the water. It’s a well-composed shot that captures a moment of wildlife in its natural habitat.
Do you want me to zoom in on a particular detail, or perhaps tell you more about pelicans?
Not bad for a 3.2GB model running on my laptop!
Running llama-server
Even more fun is the llama-server command. This starts a localhost web server running on port 8080 to serve the model, with both a web UI and an OpenAI-compatible API endpoint.
The command to run it is the same:
./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99
Now visit http://localhost:8080 in your browser to start interacting with the model:
It miscounted the pelicans in the group photo, but again, this is a tiny 3.2GB model.
With the server running on port 8080 you can also access the OpenAI-compatible API endpoint. Here’s how to do that using curl:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d ‘{
"messages": [
{"role": "user", "content": "Describe a pelicans ideal corporate retreat"}
]
}’ | jq
I built a new plugin for LLM just now called llm-llama-server to make interacting with this API more convenient. You can use that like this:
llm install llm-llama-server
llm -m llama-server ‘invent a theme park ride for a pelican’
Or for vision models use llama-server-vision:
llm -m llama-server-vision ‘describe this image’ -a /path/to/image.jpg
The LLM plugin uses the streaming API, so responses will stream back to you as they are being generated.
Tags: vision-llms, llm, llama-cpp, ai, local-llms, llms, gemma, generative-ai, projects
AI Summary and Description: Yes
Summary: The text discusses recent advancements in the llama.cpp project regarding vision model support, as well as practical guidance for setting up a server to interact with these models. This is particularly relevant for professionals in AI and cloud computing, as it highlights the integration of generative AI with vision capabilities.
Detailed Description: The text primarily focuses on updates related to the llama.cpp project, emphasizing its newly added support for vision models through a recent pull request. The author provides a detailed guide on setting up and running the software on macOS, capturing its practical implications for users interested in AI model deployment and application.
Key points include:
– **Support for Vision Models**: The successful merger of a pull request that introduces support for vision models in the llama.cpp project reflects a significant enhancement to generative AI capabilities.
– **New File Format and Library**: The introduction of .gguf file formats for model distribution and the libmtmd library for handling multimodal projections signify innovative approaches in model architecture.
– **Step-by-step Instructions**:
– Users are guided through the process of compiling llama.cpp from source or downloading pre-compiled binaries from GitHub releases.
– Specific instructions are provided to adjust macOS’s security settings to allow the execution of downloaded binaries, illustrating common challenges in software deployment.
– **Interactive Terminal and Web Server**:
– The author describes how to run an interactive session with the model, allowing users to prompt the AI to describe images, thus showcasing its vision capabilities.
– Instructions to create a localhost web server available on port 8080 and an OpenAI-compatible API endpoint enable broader accessibility for developers.
– **Insights from Interaction**: The interaction with the model indicates the efficacy of the system, although it also highlights room for improvement, such as inaccuracies in response (e.g., miscounting pelicans in a photo).
– **Plugin Development**: The author mentions the integration of a plugin for easier interaction with the LLM API, which may resonate with developers focusing on enhancing the usability of AI applications.
Overall, the text underscores the practical implementation of advanced AI features and serves as a guide for deploying, interacting with, and extending capabilities of machine learning models, making it highly relevant for professionals in AI and cloud environments.