Simon Willison’s Weblog: I can now run a GPT-4 class model on my laptop

Source URL: https://simonwillison.net/2024/Dec/9/llama-33-70b/
Source: Simon Willison’s Weblog
Title: I can now run a GPT-4 class model on my laptop

Feedly Summary: Meta’s new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop.
Just 20 months ago I was amazed to see something that felt GPT-3 class run on that same machine. The quality of models that are accessible on consumer hardware has improved dramatically in the past two years.
My laptop is a 64GB MacBook Pro M2, which I got in January 2023 – two months after the initial release of ChatGPT. All of my experiments running LLMs on a laptop have used this same machine.
In March 2023 I wrote that Large language models are having their Stable Diffusion moment after running Meta’s initial LLaMA release (think of that as Llama 1.0) via the then-brand-new llama.cpp. I said:

As my laptop started to spit out text at me I genuinely had a feeling that the world was about to change

I had a moment of déjà vu the day before yesterday, when I ran Llama 3.3 70B on the same laptop for the first time.
Meta claim that:

This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations.

Llama 3.1 405B is their much larger best-in-class model, which is very much in the same weight class as GPT-4 and friends.
Everything I’ve seen so far from Llama 3.3 70B suggests that it holds up to that standard. I honestly didn’t think this was possible – I assumed that anything as useful as GPT-4 would require many times more resources than are available to me on my consumer-grade laptop.
I’m so excited by the continual efficiency improvements we’re seeing in running these impressively capable models. In the proprietary hosted world it’s giving us incredibly cheap and fast models like Gemini 1.5 Flash, GPT-4o mini and Amazon Nova. In the openly licensed world it’s giving us increasingly powerful models we can run directly on our own devices.

How I ran Llama 3.3 70B on my machine using Ollama
Putting the model through its paces
How does it score?
Honorable mentions
Is performance about to plateau?
Bonus: running Llama 3.3 70B with MLX

How I ran Llama 3.3 70B on my machine using Ollama
I don’t expect that this model would work well with much less than my 64GB of RAM. The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac! For my second attempt I made sure not to have Firefox and VS Code running at the same time and it worked just fine.
I used Ollama. Install that, then run this command to fetch the model:
ollama pull llama3.3
This fetched 42GB of data into my ~/.ollama/models/blobs directory.
I used my LLM tool to run prompts against the model, since that way both the prompt and the response would be automatically logged to SQLite. First I installed the llm-ollama plugin:
llm install llm-ollama
Then with the Ollama local server running I could run this:
llm models
And see this in the list:
Ollama: llama3.3:latest (aliases: llama3.3)

Putting the model through its paces
This means I can run prompts like so:
llm -m llama3.3 ‘Write a letter to the mayor of Half Moon Bay CA advocating for the installation of cosy boxes for Pelicans in the harbor’
And get back a response that started like this:

[Your Name]
[Your Address]
Half Moon Bay, CA [Zip Code]
[Email Address]
[Phone Number]
[Date]
The Honorable Mayor, Deborah Penrose
City of Half Moon Bay
501 Main Street
Half Moon Bay, CA 94019
Dear Mayor Penrose,
I am writing to propose an innovative and unique initiative that would not only enhance the beauty of our picturesque harbor but also provide a safe haven for one of our beloved coastal residents: the California Brown Pelican. I am excited to advocate for the installation of cozy boxes specifically designed for pelicans in the Half Moon Bay harbor. […]

(Deborah Penrose was the mayor of Half Moon Bay for a single year from December 2016 to December 2017 – so a hint of some quite finely grained world knowledge there, even if it’s not relevant for the present day.)
Here’s the full transcript, including the response I got when I ran the follow-up llm -c ‘translate that to spanish’.
Next I tried a simple coding challenge. I love using Claude Artifacts to build simple single page applications, can Llama 3.3 on my laptop do that too?
llm -m llama3.3 ‘produce HTML with embedded CSS and JavaScript – no react, no dependencies – for an app that takes a URL to a YouTube video and outputs images along with visible URLs for all known thumbnails for that video’
Here’s the code it produced (the full transcript is here):

<html lang=”en">
<head>
<meta charset="UTF-8">
<title>YouTube Thumbnail Extractor</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
}
#thumbnails {
display: flex;
flex-wrap: wrap;
}
.thumbnail {
margin: 10px;
}
img {
max-width: 150px;
max-height: 150px;
object-fit: cover;
}
</style>
</head>
<body>

<h2>YouTube Thumbnail Extractor</h2>
<input type="text" id="youtubeUrl" placeholder="Enter YouTube video URL">
<button onclick="extractThumbnails()">Extract Thumbnails</button>
<div id="thumbnails"></div>

<script>
function extractThumbnails() {
const youtubeUrl = document.getElementById(‘youtubeUrl’).value;
const videoId = getYoutubeVideoId(youtubeUrl);

if (!videoId) {
alert("Could not parse YouTube video ID from the URL.");
return;
}

const thumbnailSizes = [‘default’, ‘mqdefault’, ‘hqdefault’, ‘sddefault’, ‘maxresdefault’];
const thumbnailsDiv = document.getElementById(‘thumbnails’);
thumbnailsDiv.innerHTML = ”; // Clear previous results

thumbnailSizes.forEach(size => {
const thumbnailUrl = `https://img.youtube.com/vi/${videoId}/${size}.jpg`;
const img = document.createElement(‘img’);
img.src = thumbnailUrl;
img.alt = `${size} thumbnail`;
img.className = ‘thumbnail’;

const link = document.createElement(‘a’);
link.href = thumbnailUrl;
link.text = size;
link.style.display = ‘block’;

const div = document.createElement(‘div’);
div.appendChild(img);
div.appendChild(link);
thumbnailsDiv.appendChild(div);
});
}

function getYoutubeVideoId(url) {
const regExp = /^.*((youtu.be\/)|(v\/)|(\/u\/\w\/)|(embed\/)|(watch\?))\??v?=?([^#&?]*).*/;
const match = url.match(regExp);
return (match && match[7].length == 11) ? match[7] : false;
}
</script>

</body>
</html>
Again, we’re illustrating some world knowledge here: it knows that YouTube videos are generally available as default, mqdefault, hqdefault, sddefault and maxresdefault.
Here’s the hosted page, and this is what it looks like when run against this YouTube URL:

It’s not a good as the version I iterated on with Claude, but this still shows that Llama 3.3 can one-shot a full interactive application while running on my Mac.
How does it score?
It’s always useful to check independent benchmarks for this kind of model.
One of my current favorites for that is LiveBench, which calls itself "a challenging, contamination-free LLM benchmark" and tests a large array of models with a comprehensive set of different tasks.
llama-3.3-70b-instruct-turbo currently sits in position 19 on their table, a place ahead of Claude 3 Opus (my favorite model for several months after its release in March 2024) and just behind April’s GPT-4 Turbo and September’s GPT-4o.

Honorable mentions
Llama 3.3 is currently the model that has impressed me the most that I’ve managed to run on my own hardware, but I’ve had several other positive experiences recently.
Last month I wrote about Qwen2.5-Coder-32B, an Apache 2.0 licensed model from Alibaba’s Qwen research team that also gave me impressive results with code.
A couple of weeks ago I tried another Qwen model, QwQ, which implements a similar chain-of-thought pattern to OpenAI’s o1 series but again runs comfortably on my own device.
Meta’s Llama 3.2 models are interesting as well: tiny 1B and 3B models (those should run even on a Raspberry Pi) that are way more capable than I would have expected – plus Meta’s first multi-modal vision models at 11B and 90B sizes. I wrote about those in September.
Is performance about to plateau?
I’ve been mostly unconvinced by the ongoing discourse around LLMs hitting a plateau. The areas I’m personally most excited about are multi-modality (images, audio and video as input) and model efficiency. Both of those have had enormous leaps forward in the past year.
I don’t particularly care about "AGI". I want models that can do useful things that I tell them to, quickly and inexpensively – and that’s exactly what I’ve been getting more of over the past twelve months.
Even if progress on these tools entirely stopped right now, the amount I could get done with just the models I’ve downloaded and stashed on a USB drive would keep me busy and productive for years.
Bonus: running Llama 3.3 70B with MLX
I focused on Ollama in this article because it’s the easiest option, but I also managed to run a version of Llama 3.3 using Apple’s excellent MLX library, which just celebrated its first birthday.
Here’s how I ran the model with MLX, using uv to fire up a temporary virtual environment:
uv run –with mlx-lm –python 3.12 python
This gave me a Python interpreter with mlx-lm available. Then I ran this:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")
This downloaded 37G from mlx-community/Llama-3.3-70B-Instruct-4bit to ~/.cache/huggingface/hub/models–mlx-community–Llama-3.3-70B-Instruct-4bit.
Then:
prompt = "Generate an SVG of a pelican riding a bicycle, start with <svg>"
messages = [{"role": "user", "content": prompt}]
prompt2 = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(
model, tokenizer, prompt=prompt2, verbose=True
)
Which gave me:
<svg xmlns="http://www.w3.org/2000/svg" width="400" height="300" viewBox="0 0 400 300">
<!– Sky –>
<rect x="0" y="0" width="400" height="300" fill="#87CEEB" rx="0" />

<!– Ground –>
<rect x="0" y="250" width="400" height="50" fill="#964B00" rx="0" />

<!– Pelican –>
<g transform="translate(150, 150)">
<!– Body –>
<ellipse cx="0" cy="0" rx="30" ry="40" fill="#FFFFFF" stroke="#000000" stroke-width="2" />

<!– Beak –>
<path d="M 0 -40 L 20 -60 L 40 -40 Z" fill="#FFFF00" stroke="#000000" stroke-width="2" />

<!– Legs –>
<line x1="0" y1="40" x2="-20" y2="80" stroke="#000000" stroke-width="5" />
<line x1="0" y1="40" x2="20" y2="80" stroke="#000000" stroke-width="5" />

<!– Eyes –>
<circle cx="-15" cy="-20" r="5" fill="#000000" />
<circle cx="15" cy="-20" r="5" fill="#000000" />
</g>

<!– Bicycle –>
<g transform="translate(200, 220) scale(0.7, 0.7)">
<!– Frame –>
<rect x="-50" y="-20" width="100" height="40" fill="#808080" rx="10" />

<!– Wheels –>
<circle cx="-35" cy="20" r="30" fill="#808080" />
<circle cx="35" cy="20" r="30" fill="#808080" />

<!– Pedals –>
<rect x="-45" y="40" width="10" height="20" fill="#808080" rx="5" />
<rect x="35" y="40" width="10" height="20" fill="#808080" rx="5" />

<!– Seat –>
<rect x="-20" y="-40" width="40" height="20" fill="#808080" rx="5" />
</g>

<!– Pelican on Bicycle –>
<g transform="translate(200, 150)">
<!– Pelican’s hands on handlebars –>
<line x1="-20" y1="0" x2="-40" y2="-20" stroke="#000000" stroke-width="5" />
<line x1="20" y1="0" x2="40" y2="-20" stroke="#000000" stroke-width="5" />

<!– Pelican’s feet on pedals –>
<line x1="0" y1="40" x2="-20" y2="60" stroke="#000000" stroke-width="5" />
<line x1="0" y1="40" x2="20" y2="60" stroke="#000000" stroke-width="5" />
</g>
</svg>
Followed by:
Prompt: 52 tokens, 49.196 tokens-per-sec
Generation: 723 tokens, 8.733 tokens-per-sec
Peak memory: 40.042 GB

Here’s what that looks like:

Honestly, I’ve seen worst.
Tags: python, ai, generative-ai, llama, gpt-4, edge-llms, llms, llm, meta, uv, mlx, ollama, pelican-riding-a-bicycle

AI Summary and Description: Yes

**Summary:**
The text discusses Meta’s Llama 3.3 70B, a new large language model (LLM) that performs at a level comparable to GPT-4 yet can run on consumer-grade hardware. The author shares experiences running the model on a MacBook Pro, highlighting the efficiency improvements and the accessibility of powerful models for individuals and developers. This represents a notable shift in the landscape of AI, particularly in generative AI capabilities, allowing local execution of sophisticated models.

**Detailed Description:**
The text outlines several major points regarding the advancements in LLM technology, specifically Meta’s latest release, Llama 3.3 70B. The implications of these advancements are considerable for professionals in AI, cloud, software, and infrastructure security. Here’s an expanded analysis of the key themes presented:

– **Model Performance and Accessibility:**
– The author expresses excitement over the capabilities of Llama 3.3, noting its performance is on par with larger models (e.g., Llama 3.1 405B and GPT-4), but with much lower resource requirements.
– Illustrates that substantial performance improvements have been made in AI models over a short period, making advanced technology more accessible to individual developers.

– **Personal Experience with Running Models:**
– The author provides insights into running the Llama 3.3 model on a personal MacBook with 64GB of RAM, detailing the setup process using software like Ollama and MLX.
– Information on memory management emphasizes practical considerations for running heavy models effectively on local machines.

– **Practical Applications and Use Cases:**
– Various use cases, such as generating formal letters and coding tasks, demonstrate the capabilities of the Llama 3.3 model.
– The ability of these models to perform complex tasks efficiently highlights their potential in business and development settings.

– **Benchmarks and Comparisons:**
– Mentions of independent benchmarks, like LiveBench, help contextualize the performance of Llama 3.3 within a broader landscape of LLMs.
– The competition among different providers of language models illustrates the rapid innovation in this field, where efficiency and cost-effectiveness are increasingly paramount.

– **Future of LLM Technology:**
– The text addresses speculations about a potential plateau in model performance but counters with optimism regarding ongoing advancements in multi-modality and efficiency.
– This foresight is relevant for professionals tracking the evolution of AI capabilities and the growing impact of such models on various industries.

– **Technical Details and Implementation:**
– The author shares a technical walkthrough for setting up and running Llama 3.3, underscoring the importance of accessibility and user-friendliness in AI development.
– Such practical insights can aid developers in implementing LLMs securely and efficiently within their own environments.

In summary, Llama 3.3 represents a significant step forward in the realm of generative AI, illustrating not only the capabilities of modern hardware but also the shifting dynamics of AI deployment and usage. For security and compliance professionals, this evolution also raises considerations around data management, privacy, and safe deployment practices as these technologies become increasingly integrated into everyday workflows.