Simon Willison’s Weblog: Image segmentation using Gemini 2.5

Source URL: https://simonwillison.net/2025/Apr/18/gemini-image-segmentation/
Source: Simon Willison’s Weblog
Title: Image segmentation using Gemini 2.5

Feedly Summary: Max Woolf pointed out this new feature of the Gemini 2.5 series in a comment on Hacker News:

One hidden note from Gemini 2.5 Flash when diving deep into the documentation: for image inputs, not only can the model be instructed to generated 2D bounding boxes of relevant subjects, but it can also create segmentation masks!
At this price point with the Flash model, creating segmentation masks is pretty nifty.

I built a tool last year to explore Gemini’s bounding box abilities. This new segmentation mask feature represents a significant new capability!
Here’s my new tool to try it out: Gemini API Image Mask Visualization. As with my bounding box tool it’s browser-based JavaScript that talks to the Gemini API directly. You provide it with a Gemini API key which isn’t logged anywhere that I can see it.
This is what it can do:

Give it an image and a prompt of the form:

Give the segmentation masks for the objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key “box_2d" and the segmentation mask in key "mask".

My tool then runs the prompt and displays the resulting JSON. The Gemini API returns segmentation masks as base64-encoded PNG images in strings that start data:image/png;base64,iVBOR…. The tool then visualizes those in a few different ways on the page, including overlaid over the original image.
I vibe coded the whole thing together using a combination of Claude and ChatGPT. I started with a Claude Artifacts React prototype, then pasted the code from my old project into Claude and hacked on that until I ran out of tokens. I transferred the incomplete result to a new Claude session where I kept on iterating until it got stuck in a bug loop (the same bug kept coming back no matter how often I told it to fix that)… so I switched over to O3 in ChatGPT to finish it off.
Here’s the finished code. It’s a total mess, but it’s also less than 500 lines of code and the interface solves my problem in that it lets me explore the new Gemini capability.
Segmenting my pelican photo via the Gemini API was absurdly inexpensive. Using Gemini 2.5 Pro the call cost 303 input tokens and 353 output tokens, for a total cost of 0.2144 cents (less than a quarter of a cent). I ran it again with the new Gemini 2.5 Flash and it used 303 input tokens and 270 output tokens, for a total cost of 0.099 cents (less than a tenth of a cent). I calculated these prices using my LLM pricing calculator tool.
Tags: google, tools, ai, generative-ai, llms, ai-assisted-programming, gemini, vision-llms, llm-pricing, vibe-coding

AI Summary and Description: Yes

Summary: The text highlights a new feature in the Gemini 2.5 series that allows for the generation of segmentation masks from image inputs, which is a significant advancement for AI tools in image processing. This capability, combined with a cost-effective model, showcases the potential for enhanced image analysis in various applications.

Detailed Description: The text provides an overview of a newly introduced feature in the Gemini 2.5 series by Google, specifically for image analysis using AI models. This feature allows users to create segmentation masks along with 2D bounding boxes, which is beneficial for a range of applications including computer vision and image processing tasks.

Key points include:

– **New Feature**: The Gemini 2.5 Flash model can generate both 2D bounding boxes and segmentation masks for objects within images.
– **Developer Tool**: The author has developed a tool to visualize these segmentation masks, showcasing the new capabilities of the Gemini API.
– **Cost Efficiency**: The pricing structure for using the Gemini API is very economical. For instance:
– Using Gemini 2.5 Pro, processing costs are approximately 0.2144 cents.
– With the Gemini 2.5 Flash model, costs drop to around 0.099 cents, demonstrating substantial savings.
– **Technical Implementation**: The tool is built using a combination of Claude and ChatGPT, showcasing a collaborative approach in developing AI-assisted tools. The author highlights challenges faced during the tool’s development, including debugging issues and token limitations.

This advancement in AI capabilities can significantly impact fields relying on image classification and object recognition, driving innovation in software development, enhancing MLOps processes, and improving the overall functionality of generative AI systems.