February 5th, 2025

S1: The $6 R1 Competitor?

A new AI model shows promising performance on standard laptops, emphasizing inference-time scaling, cost-effective training, and the need for investment in AI research amid concerns about unauthorized model distillation.

Read original articleLink Icon
CuriositySkepticismPride
S1: The $6 R1 Competitor?

A recent paper has sparked interest in the AI community by demonstrating a new model that, while not state-of-the-art, can run on standard laptops and reveals insights into AI functioning. The paper discusses inference-time scaling laws, suggesting that longer "thinking" times can enhance performance in large language models (LLMs). It introduces a method to control response length by manipulating internal tags, allowing the model to second-guess its answers. The model's low cost of $6 is attributed to its small size and a focused dataset of 1,000 examples, which proved sufficient for achieving high performance. This efficiency enables extensive experimentation, highlighting the importance of iterative testing in AI development. The paper also touches on the geopolitical implications of AI advancements, emphasizing the need for substantial investment in AI research to maintain a competitive edge. Additionally, it raises concerns about unauthorized distillation of models, suggesting that preventing such practices may become increasingly difficult. Overall, the findings indicate a rapid pace of AI development, with potential breakthroughs anticipated in 2025.

- A new AI model demonstrates significant performance with minimal resources.

- The model utilizes innovative techniques to control inference time and response quality.

- Cost-effective training methods allow for extensive experimentation and faster AI advancements.

- Geopolitical considerations highlight the importance of investment in AI research.

- Concerns about unauthorized model distillation are raised, complicating future AI development.

AI: What people are saying
The comments reflect a diverse range of opinions and insights regarding the new AI model and its implications.
  • Concerns about unauthorized model distillation and its ethical implications are prevalent, with some arguing it undermines scientific research.
  • Many commenters express fascination with the techniques used for inference scaling, particularly the "Wait" hack, and its potential for further optimization.
  • There is skepticism about the effectiveness and efficiency of the new models compared to existing ones, with some suggesting they may not represent significant advancements.
  • Several users highlight the importance of cost-effective training and the potential for broader access to AI technologies on standard laptops.
  • Discussions about the future of AI research emphasize the need for continued investment and the potential risks of overhyping AI capabilities.
Link Icon 57 comments
By @mtrovo - 17 days
I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?
By @advael - 16 days
I'm strictly speaking never going to think of model distillation as "stealing." It goes against the spirit of scientific research, and besides every tech company has lost my permission to define what I think of as theft forever
By @pona-a - 17 days
If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.
By @mark_l_watson - 17 days
Off topic, but I just bookmarked Tim’s blog, great stuff.

I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.

Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.

I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.

By @bloomingkales - 17 days
If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.

You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.

Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.

By @swiftcoder - 17 days
> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...

By @cowsaymoo - 17 days
The part about taking control of a reasoning model's output length using <think></think> tags is interesting.

> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

I had found a few days ago that this let you 'inject' your own CoT and jailbreak it easier. Maybe these are related?

https://pastebin.com/G8Zzn0Lw

https://news.ycombinator.com/item?id=42891042#42896498

By @bberenberg - 17 days
In case you’re not sure what S1 is, here is the original paper: https://arxiv.org/html/2501.19393v1
By @gorgoiler - 16 days
This feels just like telling a constraint satisfaction engine to backtrack and find a more optimal route through the graph. We saw this 25 years ago with engines like PROVERB doing directed backtracking, and with adversarial planning when automating competitive games.

Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?

Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?

By @light_hue_1 - 17 days
S1 has no relationship to R1. It's a marketing campaign for an objectively terrible and unrelated paper.

S1 is fully supervised by distilling Gemini. R1 works by reinforcement learning with a much weaker judge LLM.

They don't follow the same scaling laws. They don't give you the same results. They don't have the same robustness. You can use R1 for your own problems. You can't use S1 unless Gemini works already.

We know that distillation works and is very cheap. This has been true for a decade; there's nothing here.

S1 is a rushed hack job (they didn't even run most of their evaluations with an excuse that the Gemini API is too hard to use!) that probably existed before R1 was released and then pivoted into this mess.

By @mmoustafa - 17 days
Love the look under the hood! Specially discovering some AI hack I came up with is how the labs are doing things too.

In this case, I was also forcing R1 to continue thinking by replacing </think> with “Okay,” after augmenting reasoning with web search results.

https://x.com/0xmmo/status/1886296693995646989

By @Aperocky - 17 days
For all the hype about thinking models, this feels much like compression in terms of information theory instead of a "takeoff" scenario.

There are a finite amount of information stored in any large model, the models are really good at presenting the correct information back, and adding thinking blocks made the models even better at doing that. But there is a cap to that.

Just like how you can compress a file by a lot, there is a theoretical maximum to the amount of compression before it starts becoming lossy. There is also a theoretical maximum of relevant information from a model regardless of how long it is forced to think.

By @bloomingkales - 17 days
This thing that people are calling “reasoning” is more like rendering to me really, or multi pass rendering. We’re just refining the render, there’s no reasoning involved.
By @robrenaud - 17 days
> "Note that this s1 dataset is distillation. Every example is a thought trace generated by another model, Qwen2.5"

The traces are generated by Gemini Flash Thinking.

8 hours of H100 is probably more like $24 if you want any kind of reliability, rather than $6.

By @_befireHack - 13 days
I work at a mid-sized research firm, and there’s this one coworker who completely turned her performance around. A complete 180. A few months ago, she was one of the slowest on the team, now she’s always the first to get her work done. I was curious, so I asked her what changed. She just laughed and said she just used an AI tool that she randomly found on YouTube to do 90% of her work.

We’ve been working on a project together, and every morning for the past two months, she’s sent me clean, perfectly organized FED data. I assumed she was just working late to get ahead. Turns out, she automated the whole thing. She even scheduled it to send automatically. Tasks that used to take hours. Gathering 1000s of rows of data, cleaning it, running a regression analysis, time series, hypothesis testing etc… she now completes almost instantly. Everything. Even random things like finding discounts for her Pilates class. She just needs to check and make sure everything is good. She’s not super technical so I was surprised she could do these complicated workflows but the craziest part is that she just prompted the whole thing. She just types something like “compile a list of X, format it into a CSV, and run X analysis” or “go to Y, see what people are saying, give me background of the people saying Z” And it just works. She’s even joking about connecting it to the office printer. I’m genuinely baffled. The barrier to effort is gone.

Now we’ve got a big market report due next week, and she told me she’s planning to use DeepResearch to handle it while she takes the week off. It’s honestly wild. I don’t think most people realize how doomed knowledge work is.

By @charlieyu1 - 17 days
> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

The larger the organisation, the less experiments you can afford to do. Employees are mostly incentivised by getting something done quick enough to not to be fired in this job market. They know that the higher-ups would get them off for temporary gains. Rush this deadline, ship that feature, produce something that looks OK enough.

By @ipnon - 17 days
All you need is attention and waiting. I feel like a zen monk.
By @jebarker - 17 days
S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.
By @ttyprintk - 17 days
By @nico - 17 days
> Why did it cost only $6? Because they used a small model and hardly any data.

> After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.

> 32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.

By @khazhoux - 17 days
I have a bunch of questions, would love for anyone to explain these basics:

* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)

* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?

* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?

* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?

By @maksimur - 17 days
It appears that someone has implemented a similar approach for DeepSeek-R1-Distill-Qwen-1.5B: https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...

I hope it gets tested further.

By @svara - 16 days
It just occurred to me that if you squint a little (just a little!) the S1 paper just provided the scientific explanation for why Twitter's short tweets mess you up and books are good for you.

Kidding, but not really. It's fascinating how we seem to be seeing a gradual convergence of machine learning and psychology.

By @nico - 17 days
> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check its answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>")

I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution

It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution

At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it

By @mangoman - 17 days
From the S1 paper:

> Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end

I'm feeling proud of myself that I had the crux of the same idea almost 6 months ago before reasoning models came out (and a bit disappointed that I didn't take this idea further!). Basically during inference time, you have to choose the next token to sample. Usually people just try to sample the distribution using the same sampling rules at each step.... but you don't have to! you can selectively insert words into the the LLM's mouth based on what it said previously or what it wants to say, and decide "nah, say this instead". I wrote a library so that you could sample an LLM using llama.cpp in swift and you could write rules to sample tokens and force tokens into the sequence depending on what was sampled. https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...

Here, I wrote a test that asks Phi-3 instruct "how are you" and it if it tried to say "as an AI I don't have feelings" or "I'm doing " I forced it to say "I'm doing poorly" and refuse to help since it was always so dang positive. It sorta worked, though the instruction tuned models REALLY want to help. But at the time I just didn't have a great use case for it - I had thought about a more conditional extension to llama.cpp's grammar sampling (you could imagine changing the grammar based on previously sampled text), or even just making it go down certain paths, but I just lost steam because I couldn't describe a killer use case for it.

This is that killer use case! forcing it to think more is such a great usecase for inserting ideas into the LLM's mouth, and I feel like there must be more to this idea to explore.

By @Havoc - 17 days
The point about agents to conceal access to the model is a good one.

Hopefully we won’t lose all access to models in future

By @bxtt - 17 days
CoT is widely known technique - what became fully novel was the level of training embedding CoT via RL with optimal reward trajectory. DeepSeek took it further due to their compute restriction to find memory, bandwidth, parallelism optimizations in every part (GRPO - reducing memory copies, DualPipe for data batch parallelism between memory & compute, kernel bypasses (PTX level optimization), etc.) - then even using MoE due to sparse activation and further distillation. They operated on the power scaling laws of parameters & tokens but high quality data circumvents this. I’m not surprised they utilized synthetic generation from OpenAI or copied the premise of CoT, but where they should get the most credit is their infra level & software level optimizations.

With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead

By @shaneofalltrad - 16 days
Well dang, I am great at tinkering like this because I can’t remember things half the time. I wonder if the ADHD QA guy solved this for the devs?
By @leopoldj - 16 days
>it can run on my laptop

Has anyone run it on a laptop (unquantized)? Disk size of the 32B model appears to be 80GB. Update: I'm using a 40GB A100 GPU. Loading the model took 30GB vRAM. I asked a simple question "How many r in raspberry". After 5 minutes nothing got generated beyond the prompt. I'm not sure how the author ran this on a laptop.

By @ALittleLight - 17 days
At 6 dollars per run, I'm tempted to try to figure out how to replicate this. I'd like to try some alternatives to "wait" - e.g. "double checking..." Or write my own chains of thought.
By @janalsncm - 17 days
I think a lot of people in the ML community were excited for Noam Brown to lead the O series at OpenAI because intuitively, a lot of reasoning problems are highly nonlinear i.e. they have a tree-like structure. So some kind of MCTS would work well. O1/O3 don’t seem to use this, and DeepSeek explicitly mentioned difficulties training such a model.

However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.

By @kristianp - 16 days
By @hidelooktropic - 17 days
> I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

Couldn't they just start hiding the thinking portion?

It would be easy for them to do this. Currently, they already provide one sentence summaries for each step of the thinking I think users would be fine or at least stay if it were changed to provide only that.

By @adamc - 16 days
I found it interesting but the "Wait" vs. "Hmm" bit just made me think we don't really understand our own models here. I mean, sure, it's great that they measured and found something better, but it's kind of disturbing that you have to guess.
By @theturtletalks - 17 days
Deepseek R1 uses <think/> and wait and you can see it in the thinking tokens second guessing itself. How does the model know when to wait?

These reasoning models are feeding more to OP's last point about NVidia and OpenAI data centers not being wasted since reason models require more tokens and faster tps.

By @mig1 - 16 days
This argument that the data centers and all the GPUs will be useful even in the context of Deepseek doesn't add up... basically they showed that it's diminishing returns after a certain amount. And so far it didn't make OpenAI or Anthropic go faster, did it?
By @cadamsdotcom - 17 days
Maybe this is why OpenAI hides o1/o3 reasoning tokens - constraining output at inference time seems to be easy to implement for other models and others would immediately start their photocopiers.

It also gave them a few months to recoup costs!

By @vagab0nd - 14 days
Cool trick. But is this better than reinforcement learning, where the LLM decides for itself the optimal thinking time for each prompt?
By @incrudible - 17 days
Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.

Wait, actually 1 + 1 equals 1.

By @sheepscreek - 17 days
LLMs still feel so magical. It’s like quantum physics. “I get it” but I don’t. Not really. I don’t think I ever will. Perhaps a human mind can only comprehend so much.
By @mountainriver - 16 days
> They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6

Running where? H100s are usually over $2/hr, thats closer to $25

By @cyp0633 - 17 days
Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I tried, maybe they've somewhat used that trick already?
By @Caitlynmeeks - 17 days
By @janalsncm - 17 days
> even the smartest people make hundreds of tiny experiments

This is the most important point, and why DeepSeek’s cheaper training matters.

And if you check the R1 paper, they have a section for “things that didn’t work”, each of which would normally be a paper of its own but because their training was so cheap and streamlined they could try a bunch of things.

By @kittikitti - 17 days
Thank you for this, I really appreciate this article and I learned a bunch!
By @nullbyte - 17 days
Great article! I enjoyed reading it
By @whimsicalism - 17 days
this isn't rlvr and so sorta uninteresting, they are just distilling the work already done
By @talles - 17 days
Anyone else wants more articles on how those benchmarks are created and how they work?

Those models can be trained in way tailored to have good results on specific benchmarks, making them way less general than it seems. No accusation from me, but I'm skeptical on all the recent so called 'breakthroughs'.

By @stefanoco - 16 days
Is it me, or the affiliations are totally missing in the cited paper?? Looks like they come from a mix of UK / US institutions
By @ConanRus - 17 days
Wait
By @yapyap - 17 days
> If you believe that AI development is a prime national security advantage, then you absolutely should want even more money poured into AI development, to make it go even faster.

This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.

To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.

Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from their peers and they buy into it. We are social creatures after all.

I think using any of this in a national security setting is stupid, wasteful and very, very insecure.

Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.

By @sambull - 17 days
That sovereign wealth fund with tik tok might set a good precedent; when we have to 'pour money' into these companies we can do so with stake in them held in our sovereign wealth fund.
By @GTP - 17 days
Sorry for being lazy, but I just don't have the time right now to read the paper. Is there in the paper or somewhere else a comparison based on benchmarks of S1 vs R1 (the full R1, not quantized or distilled)?
By @HenryBemis - 17 days
> Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

(sorry for the long quote)

I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.

For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.