Simon Willison’s Weblog: Putting Gemini 2.5 Pro through its paces

Source URL: https://simonwillison.net/2025/Mar/25/gemini/
Source: Simon Willison’s Weblog
Title: Putting Gemini 2.5 Pro through its paces

Feedly Summary: There’s a new release from Google Gemini this morning: the first in the Gemini 2.5 series. Google call it “a thinking model, designed to tackle increasingly complex problems". It’s already sat at the top of the LM Arena leaderboard, and from initial impressions looks like it may deserve that top spot.
I just released llm-gemini 0.16 adding support for the new model to my LLM command-line tool. Let’s try it out.

The pelican riding a bicycle
Transcribing audio
Bounding boxes
Gemini 2.5 Pro is a very strong new model

The pelican riding a bicycle
First up, my classic generate an SVG of a pelican riding a bicycle prompt.
# Upgrade the plugin
llm install -U llm-gemini

# Now run the prompt:
llm -m gemini-2.5-pro-exp-03-25 "Generate an SVG of a pelican riding a bicycle"
It’s pretty solid!

Here’s the full transcript.
This task is meant to be almost impossible: pelicans are the wrong shape to ride bicycles! Given that, I think this is a good attempt – I like it slightly better than my previous favourite Claude 3.7 Sonnet, which produced this a month ago:

Transcribing audio
I had an MP3 lying around from a previous experiment which mixes English and Spanish. I tried running it with the prompt transcribe to see what would happen:
llm -m gemini-2.5-pro-exp-03-25 ‘transcribe’ \
-a https://static.simonwillison.net/static/2025/russian-pelican-in-spanish.mp3
I got back this, with timestamps interspersed with the text:

I need you [ 0m0s450ms ] to pretend [ 0m0s880ms ] to be [ 0m0s990ms ] a California [ 0m1s560ms ] brown [ 0m1s850ms ] pelican [ 0m2s320ms ] with [ 0m2s480ms ] a very [ 0m2s990ms ] thick [ 0m3s290ms ] Russian [ 0m3s710ms ] accent, [ 0m4s110ms ] but [ 0m4s540ms ] you [ 0m4s640ms ] talk [ 0m4s830ms ] to me [ 0m4s960ms ] exclusively [ 0m5s660ms ] in Spanish. [ 0m6s200ms ] Oye, [ 0m8s930ms ] camarada, [ 0m9s570ms ] aquí [ 0m10s240ms ] está [ 0m10s590ms ] tu [ 0m10s740ms ] pelícano [ 0m11s370ms ] californiano [ 0m12s320ms ] con [ 0m12s520ms ] acento [ 0m13s100ms ] ruso. [ 0m13s540ms ] Qué [ 0m14s230ms ] tal, [ 0m14s570ms ] tovarisch? [ 0m15s210ms ] Listo [ 0m15s960ms ] para [ 0m16s190ms ] charlar [ 0m16s640ms ] en [ 0m16s750ms ] español? [ 0m17s250ms ] How’s [ 0m19s834ms ] your [ 0m19s944ms ] day [ 0m20s134ms ] today? [ 0m20s414ms ] Mi [ 0m22s654ms ] día [ 0m22s934ms ] ha [ 0m23s4ms ] sido [ 0m23s464ms ] volando [ 0m24s204ms ] sobre [ 0m24s594ms ] las [ 0m24s844ms ] olas, [ 0m25s334ms ] buscando [ 0m26s264ms ] peces [ 0m26s954ms ] y [ 0m27s84ms ] disfrutando [ 0m28s14ms ] del [ 0m28s244ms ] sol [ 0m28s664ms ] californiano. [ 0m29s444ms ] Y [ 0m30s314ms ] tú, [ 0m30s614ms ] amigo, ¿ [ 0m31s354ms ] cómo [ 0m31s634ms ] ha [ 0m31s664ms ] estado [ 0m31s984ms ] tu [ 0m32s134ms ] día? [ 0m32s424ms ]

This inspired me to try again, this time including a JSON schema (using LLM’s custom schema DSL):
llm -m gemini-2.5-pro-exp-03-25 ‘transcribe’ \
-a https://static.simonwillison.net/static/2025/russian-pelican-in-spanish.mp3 \
–schema-multi ‘timestamp str: mm:ss,text, language: two letter code’
I got an excellent response from that:
{
"items": [
{
"language": "en",
"text": "I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.",
"timestamp": "00:00"
},
{
"language": "es",
"text": "Oye, camarada. Aquí está tu pelícano californiano con acento ruso.",
"timestamp": "00:08"
},
{
"language": "es",
"text": "¿Qué tal, Tovarish? ¿Listo para charlar en español?",
"timestamp": "00:13"
},
{
"language": "en",
"text": "How’s your day today?",
"timestamp": "00:19"
},
{
"language": "es",
"text": "Mi día ha sido volando sobre las olas, buscando peces y disfrutando del sol californiano.",
"timestamp": "00:22"
},
{
"language": "es",
"text": "¿Y tú, amigo, cómo ha estado tu día?",
"timestamp": "00:30"
}
]
}
I confirmed that the timestamps match the audio. This is fantastic.
Let’s try that against a ten minute snippet of a podcast episode I was on:
llm -m gemini-2.5-pro-exp-03-25 \
‘transcribe, first speaker is Christopher, second is Simon’ \
-a ten-minutes-of-podcast.mp3 \
–schema-multi ‘timestamp str: mm:ss, text, speaker_name’
Useful LLM trick: you can use llm logs -c –data to get just the JSON data from the most recent prompt response, so I ran this:
llm logs -c –data | jq
Here’s the full output JSON, which starts and ends like this:
{
"items": [
{
"speaker_name": "Christopher",
"text": "on its own and and it has this sort of like a it’s like a you know old tree in the forest, you know, kind of thing that you’ve built, so.",
"timestamp": "00:00"
},
{
"speaker_name": "Simon",
"text": "There’s also like I feel like with online writing, never ever like stick something online just expect people to find it. You have to So one of the great things about having a blog is I can be in a conversation about something and somebody ask a question, I can say, oh, I wrote about that two and a half years ago and give people a link.",
"timestamp": "00:06"
},

{
"speaker_name": "Simon",
"text": "So on that basis, Chat and I can’t remember if the free version of Chat GPT has code interpreter.",
"timestamp": "09:45"
},
{
"speaker_name": "Simon",
"text": "I hope I think it does.",
"timestamp": "09:50"
},
{
"speaker_name": "Christopher",
"text": "Okay. So this is like the basic paid one, maybe the $20 month because I know there’s like a $200 one that’s a little steep for like a basic",
"timestamp": "09:51"
}
]
}
A spot check of the timestamps showed them in the right place. Gemini 2.5 supports long context prompts so it’s possible this works well for much longer audio files – it would be interesting to dig deeper and try that out.
Bounding boxes
One of my favourite features of previous Gemini models is their support for bounding boxes: you can prompt them to return boxes around objects in images.
I built a separate tool for experimenting with this feature in August last year, which I described in Building a tool showing how Gemini Pro can return bounding boxes for objects in images. I’ve now upgraded that tool to add support the new model.
You can access it at tools.simonwillison.net/gemini-bbox – you’ll need to provide your own Gemini API key which is sent directly to their API from your browser (it won’t be logged by an intermediary).
I tried it out on a challenging photograph of some pelicans… and it worked extremely well:
My prompt was:

Return bounding boxes around pelicans as JSON arrays [ymin, xmin, ymax, xmax]

The Gemini models are all trained to return bounding boxes scaled between 0 and 100. My tool knows how to convert those back to the same dimensions as the input image.
Here’s what the visualized result looked like:

It got almost all of them! I like how it didn’t draw a box around the one egret that had made it into the photo.
Gemini 2.5 Pro is a very strong new model
I’ve hardly scratched the surface when it comes to trying out Gemini 2.5 Pro so far. How’s its creative writing? Factual knowledge about the world? Can it write great code in Python, JavaScript, Rust and more?
The Gemini family of models have capabilities that set them apart from other models:

Long context length – Gemini 2.5 Pro supports up to 1 million tokens
Audio input – something which few other models support, certainly not at this length and with this level of timestamp accuracy
Accurate bounding box detection for image inputs

My experiments so far with these capabilities indicate that Gemini 2.5 Pro really is a very strong new model. I’m looking forward to exploring more of what it can do.
Tags: google, ai, generative-ai, llms, gemini, vision-llms, pelican-riding-a-bicycle, llm-release

AI Summary and Description: Yes

**Summary:** The text discusses the launch of Google’s Gemini 2.5 model, emphasizing its capabilities in handling complex tasks, such as audio transcription, bounding box detection in images, and creative prompts. The insights are particularly valuable for professionals in AI and generative AI security, as the model supports a wide range of functionalities and showcases advancements in AI technology.

**Detailed Description:**
The launch of the new Google Gemini 2.5 model marks an important development in the landscape of AI, specifically within the domain of large language models (LLMs). The text provides a thorough exploration of the model’s features, experimental capabilities, and its standing in the AI community. Here are the key points discussed:

– **Launch Details**:
– Google Gemini 2.5 has been released, characterized as a “thinking model” for complex problem-solving.
– It has reportedly achieved a leading position in the LM Arena leaderboard.

– **Model Capabilities**:
– Supports long context with up to 1 million tokens.
– Ability to transcribe audio inputs with high accuracy, including handling mixed-language audio.
– Features bounding box capabilities for image recognition tasks.

– **Experimental Use Cases**:
– Generating SVG images based on complex prompts. An experimental prompt for creating an image of a “pelican riding a bicycle” showcased the model’s creativity.
– Performing audio transcription with detailed results, including timestamps and language identification.
– Utilizing JSON schemas for output structuring to enhance data usability.
– Successful bounding box detection for image inputs, demonstrated through challenges such as correctly identifying objects in photographs.

– **Significance for AI Security and Compliance**:
– The advancements in model capabilities, especially concerning audio and image processing, raise potential security and compliance considerations.
– As models like Gemini 2.5 become more sophisticated, there’s a growing need for security frameworks to protect against misuse and ensure the compliance of AI deployments in various applications.

In summary, Google’s Gemini 2.5 brings state-of-the-art features that enhance the functionality of generative AI tools. The model stands out in AI due to its extensive context capacity, effective audio handling, and accurate visual object detection, promising new opportunities and challenges in security and compliance for AI professionals. The implications of these advancements highlight the need for continuous monitoring and adaptation of security measures in rapidly evolving AI technologies.