Simon Willison’s Weblog: You can now run prompts against images, audio and video in your terminal using LLM

Source URL: https://simonwillison.net/2024/Oct/29/llm-multi-modal/#atom-everything
Source: Simon Willison’s Weblog
Title: You can now run prompts against images, audio and video in your terminal using LLM

Feedly Summary: I released LLM 0.17 last night, the latest version of my combined CLI tool and Python library for interacting with hundreds of different Large Language Models such as GPT-4o, Llama, Claude and Gemini.
The signature feature of 0.17 is that LLM can now be used to prompt multi-modal models – which means you can now use it to send images, audio and video files to LLMs that can handle them.

Processing an image with gpt-4o-mini
Using a plugin to run audio and video against Gemini
There’s a Python API too
What can we do with this?

Processing an image with gpt-4o-mini
Here’s an example. First, install LLM – using brew install llm or pipx install llm or uv tool install llm, pick your favourite. If you have it installed already you made need to upgrade to 0.17, e.g. with brew upgrade llm.
Obtain an OpenAI key (or an alternative, see below) and provide it to the tool:
llm keys set openai
# paste key here
And now you can start running prompts against images.
llm ‘describe this image’ \
-a https://static.simonwillison.net/static/2024/pelican.jpg
The -a option stands for –attachment. Attachments can be specified as URLs, as paths to files on disk or as – to read from data piped into the tool.
The above example uses the default model, gpt-4o-mini. I got back this:

The image features a brown pelican standing on rocky terrain near a body of water. The pelican has a distinct coloration, with dark feathers on its body and a lighter-colored head. Its long bill is characteristic of the species, and it appears to be looking out towards the water. In the background, there are boats, suggesting a marina or coastal area. The lighting indicates it may be a sunny day, enhancing the scene’s natural beauty.

Here’s that image:

You can run llm logs –json -c for a hint of how much that cost:
“usage": {
"completion_tokens": 89,
"prompt_tokens": 14177,
"total_tokens": 14266,
Using my LLM pricing calculator that came to 0.218 cents – less than a quarter of a cent.
Let’s run that again with gpt-4o. Add -m gpt-4o to specify the model:
llm ‘describe this image’ \
-a https://static.simonwillison.net/static/2024/pelican.jpg \
-m gpt-4o

The image shows a pelican standing on rocks near a body of water. The bird has a large, long bill and predominantly gray feathers with a lighter head and neck. In the background, there is a docked boat, giving the impression of a marina or harbor setting. The lighting suggests it might be sunny, highlighting the pelican’s features.

That time it cost 435 prompt tokens (GPT-4o charges much higher tokens per image than GPT-4o) and the total was 0.1787 cents.
Using a plugin to run audio and video against Gemini
Models in LLM are defined by plugins. The application ships with a default OpenAI plugin to get people started, but there are dozens of other plugins providing access to different models, including models that can run directly on your own device.
Plugins need to be upgraded to add support for multi-modal input – here’s documentation on how to do that. I’ve shipped three plugins with support for multi-modal attachments so far: llm-gemini, llm-claude-3 and llm-mistral (for Pixtral).
So far these are all remote API plugins. It’s definitely possible to build a plugin that runs attachments through local models but I haven’t got one of those into good enough condition to release just yet.
The Google Gemini series are my favourite multi-modal models right now due to the size and breadth of content they support. Gemini models can handle images, audio and video!
Let’s try that out. Start by installing llm-gemini:
llm install llm-gemini
Obtain a Gemini API key. These include a free tier, so you can get started without needing to spend any money. Paste that in here:
llm keys set gemini
# paste key here
The three Gemini 1.5 models are called Pro, Flash and Flash-8B. Let’s try it with Pro:
llm ‘describe this image’ \
-a https://static.simonwillison.net/static/2024/pelican.jpg \
-m gemini-1.5-pro-latest

A brown pelican stands on a rocky surface, likely a jetty or breakwater, with blurred boats in the background. The pelican is facing right, and its long beak curves downwards. Its plumage is primarily grayish-brown, with lighter feathers on its neck and breast. […]

Very detailed!
But let’s do something a bit more interesting. I shared a 7m40s MP3 of a NotebookLM podcast a few weeks ago. Let’s use Flash-8B – the cheapest Gemini model – to try and obtain a transcript.
llm ‘transcript’ \
-a https://static.simonwillison.net/static/2024/video-scraping-pelicans.mp3 \
-m gemini-1.5-flash-8b-latest
It worked!

Hey everyone, welcome back. You ever find yourself wading through mountains of data, trying to pluck out the juicy bits? It’s like hunting for a single shrimp in a whole kelp forest, am I right? Oh, tell me about it. I swear, sometimes I feel like I’m gonna go cross-eyed from staring at spreadsheets all day. […]

Full output here.
Once again, llm logs -c –json will show us the tokens used. Here it’s 14754 prompt tokens and 1865 completion tokens. The pricing calculator says that adds up to… 0.0833 cents. Less than a tenth of a cent to transcribe a 7m40s audio clip.
There’s a Python API too
Here’s what it looks like to execute multi-modal prompts with attachments using the LLM Python library:
import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt(
"Describe these images",
attachments=[
llm.Attachment(path="pelican.jpg"),
llm.Attachment(
url="https://static.simonwillison.net/static/2024/pelicans.jpg"
),
]
)
You can send multiple attachments with a single prompt, and both file paths and URLs are supported – or even binary content, using llm.Attachment(content=b’binary goes here’).
Any model plugin becomes available to Python with the same interface, making this LLM library a useful abstraction layer to try out the same prompts against many different models, both local and remote.
What can we do with this?
I’ve only had this working for a couple of days and the potential applications are somewhat dizzying. It’s trivial to spin up a Bash script that can do things like generate alt= text for every image in a directory, for example. Here’s one Claude wrote just now:
#!/bin/bash
for img in *.{jpg,jpeg}; do
if [ -f "$img" ]; then
output="${img%.*}.txt"
llm -m gpt-4o-mini ‘return just the alt text for this image’ "$img" > "$output"
fi
done
On the #llm Discord channel Drew Breunig suggested this one-liner:
llm prompt -m gpt-4o "
tell me if it’s foggy in this image, reply on a scale from
1-10 with 10 being so foggy you can’t see anything and 1
being clear enough to see the hills in the distance.
Only respond with a single number." \
-a https://cameras.alertcalifornia.org/public-camera-data/Axis-Purisma1/latest-frame.jpg
That URL is to a live webcam feed, so here’s an instant GPT-4o vision powered weather report!
We can have so much fun with this stuff.
All of the usual AI caveats apply: it can make mistakes, it can hallucinate, safety filters may kick in and refuse to transcribe audio based on the content. A lot of work is needed to evaluate how well the models perform at different tasks. There’s a lot still to explore here.
But at 1/10th of a cent for 7 minutes of audio at least those explorations can be plentiful and inexpensive!
Tags: projects, ai, openai, generative-ai, llms, llm, anthropic, claude, mistral, gemini, vision-llms

AI Summary and Description: Yes

Summary: The text describes the release of version 0.17 of a combined CLI tool and Python library, LLM, that facilitates interaction with various Large Language Models (LLMs), including multimodal models that can process images, audio, and video. It highlights key features such as plugin support, cost-effective usage, and potential applications for AI professionals.

Detailed Description:
The release of LLM 0.17 introduces significant enhancements in using Large Language Models, particularly regarding multimodal capabilities. This tool is vital for security and compliance professionals looking to integrate AI functionalities into their systems efficiently and effectively.

– **Multimodal Model Support**: LLM 0.17 allows users to process not just text but also images, audio, and video using compatible LLMs, making it versatile for various applications in AI.
– **Installation and Setup**: Users can install the tool easily via package managers (brew, pipx, uv tool) and set up API keys for different models (OpenAI, Gemini).
– **Cost Efficiency**: The tool provides affordable processing options—under a quarter of a cent for complex tasks like audio transcription, making it accessible for widespread use.
– **Example Use Cases**:
– Describing images using visual models like gpt-4o-mini and Gemini.
– Transcribing audio files, demonstrating the practicality of the tool in real-world applications such as podcasting or transcription services.
– Potential for batch processing of images via scripts, highlighting its automation capabilities.
– **Plugin Architecture**: LLM supports plugins enabling connections to various models, enhancing its functionality and allowing for local and remote processing.
– **Python API**: Developers can utilize the Python API for integrating multi-modal interactions into their applications seamlessly.
– **Future Considerations**: While the tool shows promising capabilities, it is important for users to be aware of its limitations, including model inaccuracies and the need for safety filters in certain situations.

The release of LLM 0.17 indicates a significant step forward in achieving more complex interactions with AI tools, which has important implications for security and privacy professionals looking to adopt AI in their projects. Continuous evaluation and exploration of these models will be critical for maximizing their utility in secure and compliant manners.