Tag: vision-llms
-
Simon Willison’s Weblog: Vision Language Models (Better, Faster, Stronger)
Source URL: https://simonwillison.net/2025/May/13/vision-language-models/#atom-everything Source: Simon Willison’s Weblog Title: Vision Language Models (Better, Faster, Stronger) Feedly Summary: Vision Language Models (Better, Faster, Stronger) Extremely useful review of the last year in vision and multi-modal LLMs. So much has happened! I’m particularly excited about the range of small open weight vision models that are now available. Models…
-
Simon Willison’s Weblog: Trying out llama.cpp’s new vision support
Source URL: https://simonwillison.net/2025/May/10/llama-cpp-vision/#atom-everything Source: Simon Willison’s Weblog Title: Trying out llama.cpp’s new vision support Feedly Summary: This llama.cpp server vision support via libmtmd pull request – via Hacker News – was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It’s documented on this page, but the…
-
Simon Willison’s Weblog: Create and edit images with Gemini 2.0 in preview
Source URL: https://simonwillison.net/2025/May/7/gemini-images-preview/#atom-everything Source: Simon Willison’s Weblog Title: Create and edit images with Gemini 2.0 in preview Feedly Summary: Create and edit images with Gemini 2.0 in preview Gemini 2.0 Flash has had image generation capabilities for a while now, and they’re now available via the paid Gemini API – at 3.9 cents per generated…
-
Simon Willison’s Weblog: Gemini 2.5 Pro Preview: even better coding performance
Source URL: https://simonwillison.net/2025/May/6/gemini-25-pro-preview/#atom-everything Source: Simon Willison’s Weblog Title: Gemini 2.5 Pro Preview: even better coding performance Feedly Summary: Gemini 2.5 Pro Preview: even better coding performance New Gemini 2.5 Pro “Google I/O edition" model, released a few weeks ahead of that annual developer conference. They claim even better frontend coding performance, highlighting their #1 ranking…
-
Simon Willison’s Weblog: Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25)
Source URL: https://simonwillison.net/2025/May/5/llm-video-frames/#atom-everything Source: Simon Willison’s Weblog Title: Feed a video to a vision LLM as a sequence of JPEG frames on the CLI (also LLM 0.25) Feedly Summary: The new llm-video-frames plugin can turn a video file into a sequence of JPEG frames and feed them directly into a long context vision LLM such…
-
Simon Willison’s Weblog: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!
Source URL: https://simonwillison.net/2025/Apr/28/qwen25-omni/#atom-everything Source: Simon Willison’s Weblog Title: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All! Feedly Summary: Qwen2.5 Omni: See, Hear, Talk, Write, Do It All! I’m not sure how I missed this one at the time, but last month (March 27th) Qwen released their first multi-modal model that can handle audio and…
-
Simon Willison’s Weblog: Quoting Eliot Higgins, Bellingcat
Source URL: https://simonwillison.net/2025/Apr/26/elliot-higgins/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Eliot Higgins, Bellingcat Feedly Summary: We’ve been seeing if the latest versions of LLMs are any better at geolocating and chronolocating images, and they’ve improved dramatically since we last tested them in 2023. […] Before anyone worries about it taking our job, I see it more…
-
Simon Willison’s Weblog: Watching o3 guess a photo’s location is surreal, dystopian and wildly entertaining
Source URL: https://simonwillison.net/2025/Apr/26/o3-photo-locations/ Source: Simon Willison’s Weblog Title: Watching o3 guess a photo’s location is surreal, dystopian and wildly entertaining Feedly Summary: Watching OpenAI’s new o3 model guess where a photo was taken is one of those moments where decades of science fiction suddenly come to life. It’s a cross between the Enhance Button and…
-
Simon Willison’s Weblog: Image segmentation using Gemini 2.5
Source URL: https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/ Source: Simon Willison’s Weblog Title: Image segmentation using Gemini 2.5 Feedly Summary: Max Woolf pointed out this new feature of the Gemini 2.5 series in a comment on Hacker News: One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be…
-
Simon Willison’s Weblog: Quoting James Betker
Source URL: https://simonwillison.net/2025/Apr/16/james-betker/#atom-everything Source: Simon Willison’s Weblog Title: Quoting James Betker Feedly Summary: I work for OpenAI. […] o4-mini is actually a considerably better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.…