Tag: vision-llms

  • Simon Willison’s Weblog: olmOCR

    Source URL: https://simonwillison.net/2025/Feb/26/olmocr/#atom-everything Source: Simon Willison’s Weblog Title: olmOCR Feedly Summary: olmOCR New from Ai2 – olmOCR is “an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order". At its core is allenai/olmOCR-7B-0225-preview, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both…

  • Simon Willison’s Weblog: Quoting Mark Zuckerberg

    Source URL: https://simonwillison.net/2025/Jan/30/mark-zuckerberg/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Mark Zuckerberg Feedly Summary: Llama 4 is making great progress in training. Llama 4 mini is done with pre-training and our reasoning models and larger model are looking good too. Our goal with Llama 3 was to make open source competitive with closed models, and our…

  • Simon Willison’s Weblog: DeepSeek Janus-Pro

    Source URL: https://simonwillison.net/2025/Jan/27/deepseek-janus-pro/#atom-everything Source: Simon Willison’s Weblog Title: DeepSeek Janus-Pro Feedly Summary: DeepSeek Janus-Pro Another impressive model release from DeepSeek. Janus is their series of “unified multimodal understanding and generation models" – these are models that can both accept images as input and generate images for output. Janus-Pro is a new 7B model accompanied by…

  • Simon Willison’s Weblog: Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!

    Source URL: https://simonwillison.net/2025/Jan/27/qwen25-vl-qwen25-vl-qwen25-vl/ Source: Simon Willison’s Weblog Title: Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL! Feedly Summary: Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL! Hot on the heels of yesterday’s Qwen2.5-1M, here’s Qwen2.5 VL (with an excitable announcement title) – the latest in Qwen’s series of vision LLMs. They’re releasing multiple versions: base models and instruction tuned…

  • Simon Willison’s Weblog: Trying out QvQ – Qwen’s new visual reasoning model

    Source URL: https://simonwillison.net/2024/Dec/24/qvq/#atom-everything Source: Simon Willison’s Weblog Title: Trying out QvQ – Qwen’s new visual reasoning model Feedly Summary: I thought we were done for major model releases in 2024, but apparently not: Alibaba’s Qwen team just dropped the Apache2 2 licensed QvQ-72B-Preview, “an experimental research model focusing on enhancing visual reasoning capabilities". Their blog…

  • Simon Willison’s Weblog: Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode

    Source URL: https://simonwillison.net/2024/Dec/11/gemini-2/#atom-everything Source: Simon Willison’s Weblog Title: Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode Feedly Summary: Huge announcment from Google this morning: Introducing Gemini 2.0: our new AI model for the agentic era. There’s a ton of stuff in there (including updates on Project Astra and the new Project…

  • Simon Willison’s Weblog: First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin)

    Source URL: https://simonwillison.net/2024/Dec/4/amazon-nova/ Source: Simon Willison’s Weblog Title: First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) Feedly Summary: Amazon released three new Large Language Models yesterday at their AWS re:Invent conference. The new model family is called Amazon Nova and comes in three sizes: Micro, Lite and Pro. I built…

  • Simon Willison’s Weblog: SmolVLM – small yet mighty Vision Language Model

    Source URL: https://simonwillison.net/2024/Nov/28/smolvlm/#atom-everything Source: Simon Willison’s Weblog Title: SmolVLM – small yet mighty Vision Language Model Feedly Summary: SmolVLM – small yet mighty Vision Language Model I’ve been having fun playing with this new vision model from the Hugging Face team behind SmolLM. They describe it as: […] a 2B VLM, SOTA for its memory…

  • Simon Willison’s Weblog: Say hello to gemini-exp-1121

    Source URL: https://simonwillison.net/2024/Nov/22/gemini-exp-1121/#atom-everything Source: Simon Willison’s Weblog Title: Say hello to gemini-exp-1121 Feedly Summary: Say hello to gemini-exp-1121 Google Gemini’s Logan Kilpatrick on Twitter: Say hello to gemini-exp-1121! Our latest experimental gemini model, with: significant gains on coding performance stronger reasoning capabilities improved visual understanding Available on Google AI Studio and the Gemini API right…

  • Simon Willison’s Weblog: Pixtral Large

    Source URL: https://simonwillison.net/2024/Nov/18/pixtral-large/ Source: Simon Willison’s Weblog Title: Pixtral Large Feedly Summary: Pixtral Large New today from Mistral: Today we announce Pixtral Large, a 124B open-weights multimodal model built on top of Mistral Large 2. Pixtral Large is the second model in our multimodal family and demonstrates frontier-level image understanding. The weights are out on…