Simon Willison’s Weblog: S1: The $6 R1 Competitor?

Feb 5, 2025

—

Source URL: https://simonwillison.net/2025/Feb/5/s1-the-6-r1-competitor/
Source: Simon Willison’s Weblog
Title: S1: The $6 R1 Competitor?

Feedly Summary: S1: The $6 R1 Competitor?
Tim Kellogg shares his notes on a new paper, s1: Simple test-time scaling, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 – the cost for 26 minutes on 16 NVIDIA H100 GPUs.
Tim highlight the most exciting result:

After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model.

The paper describes a technique called “Budget forcing":

To enforce a minimum, we suppress the generation
of the end-of-thinking token delimiter and optionally append
the string “Wait” to the model’s current reasoning trace to
encourage the model to reflect on its current generation

That’s the same trick Theia Vogel described a few weeks ago.
Here’s the s1-32B model on Hugging Face. I found a GGUF version of it at brittlewis12/s1-32B-GGUF, which I ran using Ollama like so:
ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0

I also found those 1,000 samples on Hugging Face in the simplescaling/s1K data repository there.
I used DuckDB to convert the parquet file to CSV (and turn one VARCHAR[] column into JSON):
COPY (
SELECT
solution,
question,
cot_type,
source_type,
metadata,
cot,
json_array(thinking_trajectories) as thinking_trajectories,
attempt
FROM ‘s1k-00001.parquet’
) TO ‘output.csv’ (HEADER, DELIMITER ‘,’);

Then I loaded that CSV into sqlite-utils so I could use the convert command to turn a Python data structure into JSON using json.dumps() and eval():
# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv –csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata ‘json.dumps(eval(value))’ –import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k –csv > s1k.csv

Here’s that CSV in a Gist, which meas I can load it into Datasette Lite.

It really is a tiny amount of training data. It’s mostly math and science, but there are also 15 cryptic crossword examples.
Tags: duckdb, datasette-lite, inference-scaling, ai, ollama, llms, datasette, generative-ai, qwen

AI Summary and Description: Yes

Summary: The text discusses a new model called “s1: Simple test-time scaling,” which showcases an innovative inference-scaling technique fine-tuned on a generative AI model. The findings highlight significant improvements in performance with a minimal dataset, making notable contributions to the field of AI, particularly in scaling and efficiency.

Detailed Description:

The text outlines a collaborative analysis by Tim Kellogg regarding a technical paper on the s1: Simple test-time scaling model, specifically how it leverages the Qwen2.5-32B-Instruct model for cost-effective inference scaling. Below are the key points of interest:

– **Cost Efficiency**: The model achieves impressive inference capabilities for just $6 over 26 minutes using 16 NVIDIA H100 GPUs.

– **Data Reduction**: Through rigorous filtering, only 1,000 high-quality examples from an initial dataset of 56,000 were necessary to reach performance benchmarks comparable to a 32 billion parameter model. This indicates a potential shift towards reducing the need for large datasets in training neural networks.

– **Technique – Budget Forcing**: The paper introduces “Budget forcing,” a method to encourage deeper reasoning in AI model outputs by manipulating token generation. This involves suppressing the end-of-thinking token and appending prompts to facilitate reflective generation.

– **Model Accessibility**: The s1-32B model can be accessed via Hugging Face, which indicates the importance of community and collaborative resources in the AI domain.

– **Data Handling Tools**: The author utilizes DuckDB and SQLite to manage and transform data formats (from parquet to CSV and JSON), illustrating the practical applications of data management tools in AI.

– **Training Data Composition**: The reduced dataset predominantly includes math and science problems, alongside some unique examples, showing a tailored focus for model training.

Implications for security and compliance professionals:
– **Efficiency in AI Deployment**: Understanding the implications of reduced training data may assist professionals in managing AI-related resources more effectively and responsibly, which could influence data governance strategies.

– **Data Handling and Management**: The methods applied in handling data (DuckDB, SQLite) highlight the importance of robust data management practices, which are critical for any compliance and security frameworks concerning data privacy and integrity in AI applications.

– **Scalability Innovations**: Lastly, the advancements in inference scaling may prompt further investigation into AI deployment strategies, including cost-effectiveness and performance metrics critical for compliance and security assessments in cloud computing environments.

Overall, the findings underscore a significant advancement in generative AI, which can lead to broader implications for cloud-based infrastructures and their security protocols.

.NET 01 1 2 3 4 5 a access accessibility Act advancement advancements after AGI AI AI applications ai model analysis and Application applications Arch art as assessment based benchmark benchmarks Best by C capabilities CERN Cloud cloud computing cloud-based cloud-based infrastructure Col collaborative command community compliance compliance professionals composition Computing computing environments core cost cost efficiency cost-effective cost-effectiveness CoT critical cross Current D data data formats data governance Data Handling data management data management practices data privacy data repository data structure dataset datasets datasette de deployment deployment strategies domain duckdb e effective effectiveness efficiency end environment EU face filtering fine for framework frameworks g Gen generation generative Generative AI GIS Go governance governance strategies GPU GPUs gs high Highlight HR http HTTPS hugging Hugging Face implications in Inference inference capabilities inference scaling Influence infrastructure innovation Innovations integrity inter investigation ite J json Just k Key l Labor large large datasets led Lite llama llm llms lm long low making management management tools math Meta metadata metrics mini model model outputs model training network networks neural network neural networks no notes Nvidia o o1 o1-preview of ollama on one opt ory out Outputs over parameter performance performance benchmark performance benchmarks performance metrics point potential practical applications pre Preview privacy problem professionals prompt prompts protocol protocols Py Python question Qwen R R1 rag rate Ray RCE real reasoning red reflective repository resources Ro s scalability scaling science sec security security and compliance security assessment security assessments security framework security frameworks security protocols SHA side Sig Sim Simple source sql sqlite SSE STIG structures T Tags: tech test text the Time time scaling to token token generation tool tools Tor TP training training data two up US use uth utils V val version web Wi x