S1: The $6 R1 Competitor?
A new AI model shows promising performance on standard laptops, emphasizing inference-time scaling, cost-effective training, and the need for investment in AI research amid concerns about unauthorized model distillation.
Read original articleA recent paper has sparked interest in the AI community by demonstrating a new model that, while not state-of-the-art, can run on standard laptops and reveals insights into AI functioning. The paper discusses inference-time scaling laws, suggesting that longer "thinking" times can enhance performance in large language models (LLMs). It introduces a method to control response length by manipulating internal tags, allowing the model to second-guess its answers. The model's low cost of $6 is attributed to its small size and a focused dataset of 1,000 examples, which proved sufficient for achieving high performance. This efficiency enables extensive experimentation, highlighting the importance of iterative testing in AI development. The paper also touches on the geopolitical implications of AI advancements, emphasizing the need for substantial investment in AI research to maintain a competitive edge. Additionally, it raises concerns about unauthorized distillation of models, suggesting that preventing such practices may become increasingly difficult. Overall, the findings indicate a rapid pace of AI development, with potential breakthroughs anticipated in 2025.
- A new AI model demonstrates significant performance with minimal resources.
- The model utilizes innovative techniques to control inference time and response quality.
- Cost-effective training methods allow for extensive experimentation and faster AI advancements.
- Geopolitical considerations highlight the importance of investment in AI research.
- Concerns about unauthorized model distillation are raised, complicating future AI development.
Related
Forget ChatGPT: why researchers now run small AIs on their laptops
Researchers are increasingly using small AI models locally on laptops for cost savings, privacy, and reproducibility. Open-weight models enable customization, with major firms releasing efficient alternatives to larger models.
Throw more AI at your problems
The article advocates for using multiple LLM calls in AI development, emphasizing task breakdown, cost management, and improved performance through techniques like RAG, fine-tuning, and asynchronous workflows.
Implications of Plateauing LLMs – Sympolymathesy, by Chris Krycho
Chris Krycho argues that while large language models may have plateaued in scaling, advancements in multi-modality and efficiency remain promising, alongside ethical concerns regarding their training and deployment.
AI Scaling Laws
The article examines AI scaling laws, emphasizing ongoing investments by major labs, the importance of new paradigms for model performance, and the need for better evaluations amid existing challenges.
Lessons from building a small-scale AI application
Richard Li highlights early scaling challenges in AI, emphasizing data quality, evaluation strategies, and the importance of the training pipeline. He advocates for cautious adoption of new AI libraries and hands-on experimentation.
- Concerns about unauthorized model distillation and its ethical implications are prevalent, with some arguing it undermines scientific research.
- Many commenters express fascination with the techniques used for inference scaling, particularly the "Wait" hack, and its potential for further optimization.
- There is skepticism about the effectiveness and efficiency of the new models compared to existing ones, with some suggesting they may not represent significant advancements.
- Several users highlight the importance of cost-effective training and the potential for broader access to AI technologies on standard laptops.
- Discussions about the future of AI research emphasize the need for continued investment and the potential risks of overhyping AI capabilities.
I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.
Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.
I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.
Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.
You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.
Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.
I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...
> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".
I had found a few days ago that this let you 'inject' your own CoT and jailbreak it easier. Maybe these are related?
Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?
Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?
S1 is fully supervised by distilling Gemini. R1 works by reinforcement learning with a much weaker judge LLM.
They don't follow the same scaling laws. They don't give you the same results. They don't have the same robustness. You can use R1 for your own problems. You can't use S1 unless Gemini works already.
We know that distillation works and is very cheap. This has been true for a decade; there's nothing here.
S1 is a rushed hack job (they didn't even run most of their evaluations with an excuse that the Gemini API is too hard to use!) that probably existed before R1 was released and then pivoted into this mess.
In this case, I was also forcing R1 to continue thinking by replacing </think> with “Okay,” after augmenting reasoning with web search results.
There are a finite amount of information stored in any large model, the models are really good at presenting the correct information back, and adding thinking blocks made the models even better at doing that. But there is a cap to that.
Just like how you can compress a file by a lot, there is a theoretical maximum to the amount of compression before it starts becoming lossy. There is also a theoretical maximum of relevant information from a model regardless of how long it is forced to think.
The traces are generated by Gemini Flash Thinking.
8 hours of H100 is probably more like $24 if you want any kind of reliability, rather than $6.
We’ve been working on a project together, and every morning for the past two months, she’s sent me clean, perfectly organized FED data. I assumed she was just working late to get ahead. Turns out, she automated the whole thing. She even scheduled it to send automatically. Tasks that used to take hours. Gathering 1000s of rows of data, cleaning it, running a regression analysis, time series, hypothesis testing etc… she now completes almost instantly. Everything. Even random things like finding discounts for her Pilates class. She just needs to check and make sure everything is good. She’s not super technical so I was surprised she could do these complicated workflows but the craziest part is that she just prompted the whole thing. She just types something like “compile a list of X, format it into a CSV, and run X analysis” or “go to Y, see what people are saying, give me background of the people saying Z” And it just works. She’s even joking about connecting it to the office printer. I’m genuinely baffled. The barrier to effort is gone.
Now we’ve got a big market report due next week, and she told me she’s planning to use DeepResearch to handle it while she takes the week off. It’s honestly wild. I don’t think most people realize how doomed knowledge work is.
The larger the organisation, the less experiments you can afford to do. Employees are mostly incentivised by getting something done quick enough to not to be fired in this job market. They know that the higher-ups would get them off for temporary gains. Rush this deadline, ship that feature, produce something that looks OK enough.
> After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.
> 32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.
* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)
* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?
* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?
* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?
I hope it gets tested further.
Kidding, but not really. It's fascinating how we seem to be seeing a gradual convergence of machine learning and psychology.
I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution
It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution
At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it
> Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end
I'm feeling proud of myself that I had the crux of the same idea almost 6 months ago before reasoning models came out (and a bit disappointed that I didn't take this idea further!). Basically during inference time, you have to choose the next token to sample. Usually people just try to sample the distribution using the same sampling rules at each step.... but you don't have to! you can selectively insert words into the the LLM's mouth based on what it said previously or what it wants to say, and decide "nah, say this instead". I wrote a library so that you could sample an LLM using llama.cpp in swift and you could write rules to sample tokens and force tokens into the sequence depending on what was sampled. https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...
Here, I wrote a test that asks Phi-3 instruct "how are you" and it if it tried to say "as an AI I don't have feelings" or "I'm doing " I forced it to say "I'm doing poorly" and refuse to help since it was always so dang positive. It sorta worked, though the instruction tuned models REALLY want to help. But at the time I just didn't have a great use case for it - I had thought about a more conditional extension to llama.cpp's grammar sampling (you could imagine changing the grammar based on previously sampled text), or even just making it go down certain paths, but I just lost steam because I couldn't describe a killer use case for it.
This is that killer use case! forcing it to think more is such a great usecase for inserting ideas into the LLM's mouth, and I feel like there must be more to this idea to explore.
Hopefully we won’t lose all access to models in future
With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead
Has anyone run it on a laptop (unquantized)? Disk size of the 32B model appears to be 80GB. Update: I'm using a 40GB A100 GPU. Loading the model took 30GB vRAM. I asked a simple question "How many r in raspberry". After 5 minutes nothing got generated beyond the prompt. I'm not sure how the author ran this on a laptop.
However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.
Couldn't they just start hiding the thinking portion?
It would be easy for them to do this. Currently, they already provide one sentence summaries for each step of the thinking I think users would be fine or at least stay if it were changed to provide only that.
These reasoning models are feeding more to OP's last point about NVidia and OpenAI data centers not being wasted since reason models require more tokens and faster tps.
It also gave them a few months to recoup costs!
Wait, actually 1 + 1 equals 1.
Running where? H100s are usually over $2/hr, thats closer to $25
This is the most important point, and why DeepSeek’s cheaper training matters.
And if you check the R1 paper, they have a section for “things that didn’t work”, each of which would normally be a paper of its own but because their training was so cheap and streamlined they could try a bunch of things.
Those models can be trained in way tailored to have good results on specific benchmarks, making them way less general than it seems. No accusation from me, but I'm skeptical on all the recent so called 'breakthroughs'.
This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.
To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.
Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from their peers and they buy into it. We are social creatures after all.
I think using any of this in a national security setting is stupid, wasteful and very, very insecure.
Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.
(sorry for the long quote)
I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.
For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.
Related
Forget ChatGPT: why researchers now run small AIs on their laptops
Researchers are increasingly using small AI models locally on laptops for cost savings, privacy, and reproducibility. Open-weight models enable customization, with major firms releasing efficient alternatives to larger models.
Throw more AI at your problems
The article advocates for using multiple LLM calls in AI development, emphasizing task breakdown, cost management, and improved performance through techniques like RAG, fine-tuning, and asynchronous workflows.
Implications of Plateauing LLMs – Sympolymathesy, by Chris Krycho
Chris Krycho argues that while large language models may have plateaued in scaling, advancements in multi-modality and efficiency remain promising, alongside ethical concerns regarding their training and deployment.
AI Scaling Laws
The article examines AI scaling laws, emphasizing ongoing investments by major labs, the importance of new paradigms for model performance, and the need for better evaluations amid existing challenges.
Lessons from building a small-scale AI application
Richard Li highlights early scaling challenges in AI, emphasizing data quality, evaluation strategies, and the importance of the training pipeline. He advocates for cautious adoption of new AI libraries and hands-on experimentation.