Simon Willison’s Weblog: Load Llama-3.2 WebGPU in your browser from a local folder

Source URL: https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-everything
Source: Simon Willison’s Weblog
Title: Load Llama-3.2 WebGPU in your browser from a local folder

Feedly Summary: Load Llama-3.2 WebGPU in your browser from a local folder
Inspired by a comment on Hacker News I decided to see if it was possible to modify the transformers.js-examples/tree/main/llama-3.2-webgpu Llama 3.2 chat demo (online here, I wrote about it last November) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.
I posed the problem to OpenAI’s GPT-5-enabled Codex CLI like this:
git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex

Then this prompt:

Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a “download model" option too.

Codex churned away for several minutes, even running commands like curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n ‘1,200p’ to inspect the source code of the underlying Transformers.js library.
After four prompts total (shown here) it built something which worked!
To try it out you’ll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:
git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16

Then visit my llama-3.2-webgpu page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".
Here’s an animated demo.

I pushed a branch with those changes here. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I’m pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.
According to the Codex /status command this used 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.
Via My Hacker News comment
Tags: javascript, ai, generative-ai, llama, local-llms, llms, ai-assisted-programming, transformers-js, webgpu, llm-pricing, vibe-coding, gpt-5, codex-cli

AI Summary and Description: Yes

Summary: The text details a technical exploration of modifying a chat demo for the Llama 3.2 AI model to load local model files using WebGPU in the browser, utilizing OpenAI’s Codex CLI for code generation. This is relevant as it showcases practical application in AI and generative AI security through local hosting, potentially impacting data privacy and model management.

Detailed Description: The content elaborates on a practical implementation related to the Llama 3.2 AI model that significantly focuses on improving user experience for accessing machine learning models. Here are the major points:

– **Modification of AI Model Demo**: The author enhances a previously existing chat demo for the Llama 3.2 model by allowing it to load models directly from local files, which diversifies how users can interact with AI models.

– **Use of Codex**: OpenAI’s GPT-5-powered Codex CLI was utilized to generate code modifications through a series of prompts, showcasing how AI can assist programmers in rapid development and iterative enhancements.

– **Installation and Setup**:
– Users must install Git LFS and clone a repository containing the Llama 3.2 ONNX model.
– The user interface is designed for interaction in web browsers supporting WebGPU, specifically mentioning Chrome or Firefox Nightly.

– **Cost Efficiency**: The author discusses cost implications based on token usage with Codex, indicating financial efficiency linked to using existing subscription plans for AI services.

– **Proof of Concept**: The work is presented as a successful proof of concept, highlighting the ease of integrating local model capabilities, paving the way for further enhancements that might include support for additional models.

– **Community Engagement**: The text connects to a wider community of developers via a Hacker News comment, reflecting on collaborative exploration and knowledge sharing in coding practices.

**Implications for Security and Compliance Professionals**:
– **Data Privacy**: Allowing local model loading means sensitive data may remain on local machines rather than being transmitted over networks, potentially retaining user privacy.
– **Access Control**: Local hosting solutions may necessitate robust access control mechanisms to ensure that unauthorized users cannot access confined data.
– **Compliance**: Use of models might invoke considerations under compliance frameworks regarding data handling, especially if sensitive information is involved in model training or inference.
– **Security Measures**: The shift to a local model architecture demands attention to securing environments where models are hosted, including considerations for device security and potential vulnerabilities when using browser-based interfaces.