Simon Willison’s Weblog: Load Llama-3.2 WebGPU in your browser from a local folder

Sep 8, 2025

—

Source URL: https://simonwillison.net/2025/Sep/8/webgpu-local-folder/#atom-everything
Source: Simon Willison’s Weblog
Title: Load Llama-3.2 WebGPU in your browser from a local folder

Feedly Summary: Load Llama-3.2 WebGPU in your browser from a local folder
Inspired by a comment on Hacker News I decided to see if it was possible to modify the transformers.js-examples/tree/main/llama-3.2-webgpu Llama 3.2 chat demo (online here, I wrote about it last November) to add an option to open a local model file directly from a folder on disk, rather than waiting for it to download over the network.
I posed the problem to OpenAI’s GPT-5-enabled Codex CLI like this:
git clone https://github.com/huggingface/transformers.js-examples
cd transformers.js-examples/llama-3.2-webgpu
codex

Then this prompt:

Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a “download model" option too.

Codex churned away for several minutes, even running commands like curl -sL https://raw.githubusercontent.com/huggingface/transformers.js/main/src/models.js | sed -n ‘1,200p’ to inspect the source code of the underlying Transformers.js library.
After four prompts total (shown here) it built something which worked!
To try it out you’ll need your own local copy of the Llama 3.2 ONNX model. You can get that (a ~1.2GB) download) like so:
git lfs install
git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16

Then visit my llama-3.2-webgpu page in Chrome or Firefox Nightly (since WebGPU is required), click "Browse folder", select that folder you just cloned, agree to the "Upload" confirmation (confusing since nothing is uploaded from your browser, the model file is opened locally on your machine) and click "Load local model".
Here’s an animated demo.

I pushed a branch with those changes here. The next step would be to modify this to support other models in addition to the Llama 3.2 demo, but I’m pleased to have got to this proof of concept with so little work beyond throwing some prompts at Codex to see if it could figure it out.
According to the Codex /status command this used 169,818 input tokens, 17,112 output tokens and 1,176,320 cached input tokens. At current GPT-5 token pricing ($1.25/million input, $0.125/million cached input, $10/million output) that would cost 53.942 cents, but Codex CLI hooks into my existing $20/month ChatGPT Plus plan so this was bundled into that.
Via My Hacker News comment
Tags: javascript, ai, generative-ai, llama, local-llms, llms, ai-assisted-programming, transformers-js, webgpu, llm-pricing, vibe-coding, gpt-5, codex-cli

AI Summary and Description: Yes

Summary: The text details a technical exploration of modifying a chat demo for the Llama 3.2 AI model to load local model files using WebGPU in the browser, utilizing OpenAI’s Codex CLI for code generation. This is relevant as it showcases practical application in AI and generative AI security through local hosting, potentially impacting data privacy and model management.

Detailed Description: The content elaborates on a practical implementation related to the Llama 3.2 AI model that significantly focuses on improving user experience for accessing machine learning models. Here are the major points:

– **Modification of AI Model Demo**: The author enhances a previously existing chat demo for the Llama 3.2 model by allowing it to load models directly from local files, which diversifies how users can interact with AI models.

– **Use of Codex**: OpenAI’s GPT-5-powered Codex CLI was utilized to generate code modifications through a series of prompts, showcasing how AI can assist programmers in rapid development and iterative enhancements.

– **Installation and Setup**:
– Users must install Git LFS and clone a repository containing the Llama 3.2 ONNX model.
– The user interface is designed for interaction in web browsers supporting WebGPU, specifically mentioning Chrome or Firefox Nightly.

– **Cost Efficiency**: The author discusses cost implications based on token usage with Codex, indicating financial efficiency linked to using existing subscription plans for AI services.

– **Proof of Concept**: The work is presented as a successful proof of concept, highlighting the ease of integrating local model capabilities, paving the way for further enhancements that might include support for additional models.

– **Community Engagement**: The text connects to a wider community of developers via a Hacker News comment, reflecting on collaborative exploration and knowledge sharing in coding practices.

**Implications for Security and Compliance Professionals**:
– **Data Privacy**: Allowing local model loading means sensitive data may remain on local machines rather than being transmitted over networks, potentially retaining user privacy.
– **Access Control**: Local hosting solutions may necessitate robust access control mechanisms to ensure that unauthorized users cannot access confined data.
– **Compliance**: Use of models might invoke considerations under compliance frameworks regarding data handling, especially if sensitive information is involved in model training or inference.
– **Security Measures**: The shift to a local model architecture demands attention to securing environments where models are hosted, including considerations for device security and potential vulnerabilities when using browser-based interfaces.

.NET 1 10 2 2025 3 4 5 53 7 a access access control Access Control Mechanisms Act after age AI ai model AI models AI security ai-assisted-programming All and API app Application Arch architecture as assisted at ated based based interface being beyond BGP Bi browser built by C Cache capabilities chat ChatGPT ChatGPT Plus Chrome CI CIA cli co code code generation code modifications codex coding coding practices Col collaborative command community community engagement compliance compliance framework compliance frameworks compliance professionals concept content control control mechanism control mechanisms cost cost efficiency cost implications Curl Current D data Data Handling data privacy de demand demo design developer developers development device device security e edge efficiency engagement environment environments exp experience exploration face file financial fine firefox for framework frameworks g Gen generation generative Generative AI git GitHub Go GPT GPU gs H hack hacker Hacker News handling high Highlight hosted hosting HR http HTTPS hugging Huggingface impact implementation implications implications for security improving in Inference information installation inter interaction interface Interfaces io Iron ite J Java JavaScript js Just k knowledge knowledge sharing l Labor learning led Li library line Link linked llama Llama 3 Llama 3.2 llm llm-pricing llms lm load local low M mac machine Machine Learning machine learning model machine learning models man management mean measures Mode model model architecture model capabilities model management model training models ModI my N network networks new news next no nothing NPU o of off on one ons open openai OPM opt ory oS oss other out output over per point porting potential Power powered practical application practices pre pricing privacy pro problem professionals programmers programming prompt prompts proof ps Py Q R rate RCE re red repository Ro RoT row s sec security security and compliance security measure security measures sensitive data sensitive information series service services SHA sharing shift side Sig Sim Simon Willison solutions source source code specific SSE subscription Subscription Plan subscription plans support T Tags: Tails tech technical ted text the to token token usage tokens Tor TP training Transform transformer transformers transformers-js tree two UI under up US usage use user user experience user interface user privacy Users uth V vibe vulnerabilities web web browser web browsers webgpu Wi x yt z