Tag: quantized

  • Docker: Tool Calling with Local LLMs: A Practical Evaluation

    Source URL: https://www.docker.com/blog/local-llm-tool-calling-a-practical-evaluation/ Source: Docker Title: Tool Calling with Local LLMs: A Practical Evaluation Feedly Summary: Which local model should I use for tool calling? When building GenAI and agentic applications, one of the most pressing and persistent questions is: “Which local model should I use for tool calling?”  We kept hearing again and again,…

  • Simon Willison’s Weblog: Quoting Ted Sanders

    Source URL: https://simonwillison.net/2025/Jun/11/ted-sanders/#atom-everything Source: Simon Willison’s Weblog Title: Quoting Ted Sanders Feedly Summary: [on the cheaper o3] Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API (e.g., o3-turbo-2025-06-10). It would be very annoying to API customers if we…

  • Simon Willison’s Weblog: Trying out llama.cpp’s new vision support

    Source URL: https://simonwillison.net/2025/May/10/llama-cpp-vision/#atom-everything Source: Simon Willison’s Weblog Title: Trying out llama.cpp’s new vision support Feedly Summary: This llama.cpp server vision support via libmtmd pull request – via Hacker News – was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It’s documented on this page, but the…

  • Simon Willison’s Weblog: Qwen3-8B

    Source URL: https://simonwillison.net/2025/May/2/qwen3-8b/#atom-everything Source: Simon Willison’s Weblog Title: Qwen3-8B Feedly Summary: Having tried a few of the Qwen 3 models now my favorite is a bit of a surprise to me: I’m really enjoying Qwen3-8B. I’ve been running prompts through the MLX 4bit quantized version, mlx-community/Qwen3-8B-4bit. I’m using llm-mlx like this: llm install llm-mlx llm…

  • Slashdot: DeepSeek-V3 Now Runs At 20 Tokens Per Second On Mac Studio

    Source URL: https://apple.slashdot.org/story/25/03/25/2054214/deepseek-v3-now-runs-at-20-tokens-per-second-on-mac-studio?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: DeepSeek-V3 Now Runs At 20 Tokens Per Second On Mac Studio Feedly Summary: AI Summary and Description: Yes Summary: The text discusses the launch of DeepSeek’s new large language model, DeepSeek-V3-0324, highlighting its unique deployment strategy and implications for the AI industry. Its compatibility with consumer-grade hardware and open-source…

  • Simon Willison’s Weblog: Mistral Small 3.1

    Source URL: https://simonwillison.net/2025/Mar/17/mistral-small-31/#atom-everything Source: Simon Willison’s Weblog Title: Mistral Small 3.1 Feedly Summary: Mistral Small 3.1 Mistral Small 3 came out in January and was a notable, genuinely excellent local model that used an Apache 2.0 license. Mistral Small 3.1 offers a significant improvement: it’s multi-modal (images) and has an increased 128,000 token context length,…

  • The Register: DeepSeek-R1-beating perf in a 32B package? El Reg digs its claws into Alibaba’s QwQ

    Source URL: https://www.theregister.com/2025/03/16/qwq_hands_on_review/ Source: The Register Title: DeepSeek-R1-beating perf in a 32B package? El Reg digs its claws into Alibaba’s QwQ Feedly Summary: How to tame its hypersensitive hyperparameters and get it running on your PC Hands on How much can reinforcement learning – and a bit of extra verification – improve large language models,…

  • Scott Logic: There is more than one way to do GenAI

    Source URL: https://blog.scottlogic.com/2025/02/20/there-is-more-than-one-way-to-do-genai.html Source: Scott Logic Title: There is more than one way to do GenAI Feedly Summary: AI doesn’t have to be brute forced requiring massive data centres. Europe isn’t necessarily behind in AI arms race. In fact, the UK and Europe’s constraints and focus on more than just economic return and speculation might…