September 13th, 2024

OpenAI o1 Results on ARC-AGI-Pub

OpenAI's new o1-preview and o1-mini models enhance reasoning through a chain-of-thought approach, showing improved performance but requiring more time, with modest results on ARC-AGI benchmarks.

Read original articleLink Icon
CuriositySkepticismAppreciation
OpenAI o1 Results on ARC-AGI-Pub

OpenAI has recently released its o1-preview and o1-mini models, which are designed to enhance reasoning capabilities through a chain-of-thought (CoT) approach. These models were tested against the ARC Prize benchmarks, revealing that while o1 performs better than previous models like GPT-4o, it requires significantly more time to achieve similar results. The o1 models utilize a new reinforcement learning algorithm that emphasizes generating synthetic CoTs to improve reasoning during training and inference. This method allows for greater adaptability and generalization, particularly in informal language tasks. However, the performance of o1 on ARC-AGI benchmarks remains modest compared to other tasks, indicating that while it excels in structured reasoning, it struggles with novel problem-solving. The relationship between accuracy and test-time compute suggests that increasing computational resources can enhance performance, but this does not equate to achieving artificial general intelligence (AGI). The findings highlight the need for innovative approaches beyond current methodologies to advance towards AGI, as existing models still rely heavily on pre-training data and may not effectively synthesize new reasoning on demand. The ARC Prize aims to foster open-source contributions to AGI research, encouraging new ideas and collaboration in the field.

- OpenAI's o1 models show improved reasoning but require more time for similar results compared to previous models.

- The chain-of-thought approach enhances adaptability and generalization in informal language tasks.

- Performance on ARC-AGI benchmarks is modest, indicating challenges in novel problem-solving.

- Increasing computational resources can improve accuracy but does not guarantee AGI.

- The ARC Prize encourages innovative contributions to advance AGI research.

AI: What people are saying
The comments on OpenAI's o1-preview and o1-mini models reveal a mix of opinions and insights regarding their performance and implications for AGI.
  • Users note that while o1-preview shows improved reasoning capabilities, it requires significantly more time to achieve results compared to other models like Claude 3.5 Sonnet.
  • Some commenters express skepticism about the effectiveness of the ARC-AGI benchmarks, questioning their relevance and the validity of the tasks involved.
  • There is a consensus that o1 represents a step forward in LLM reasoning, but concerns remain about its limitations in generalization and potential for hallucinations.
  • Several users highlight the competitive landscape, comparing o1's performance to other models and discussing the strategies that may enhance its effectiveness.
  • Practical questions arise regarding the application of o1 in real-world scenarios, such as integrating it into existing codebases.
Link Icon 19 comments
By @killthebuddha - 5 months
In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

By @Stevvo - 5 months
"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248

So, how well might o1 do with Greenblatt's strategy?

By @w4 - 5 months
> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

Sheesh. We're going to need more compute.

By @fsndz - 5 months
As expected, I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
By @GaggiX - 5 months
It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.

That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.

I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

By @alphabetting - 5 months
This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.
By @mrcwinn - 5 months
How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?
By @fancyfredbot - 5 months
I found the level headed explanation of why log linear improvements in test score with increased compute aren't revolutionary the best part of this article. That's not to say the rest wasn't good too! One of the best articles on o1 I've read.
By @benreesman - 5 months
The test you really want is the apples-to-apples comparison between GPT-4o faced with the same CoT and other context annealing that presumably, uh, Q* sorry Strawberry now feeds it (on your dime). This would of course require seeing the tokens you are paying for instead of being threatened with bans for asking about them.

Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.

By @Terretta - 5 months
TL;DR (direct quote):

“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”

“We still need new ideas for AGI.”

By @ec109685 - 5 months
Why is this considered such a great AGI test? It seems possible to extensively train a model on the algorithms used to solve these cases, and some cases feel beyond what a human could straightforwardly figure out.
By @a_wild_dandan - 5 months
This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.
By @lossolo - 5 months
It seems like o1 is a lot worse than Claude on coding tasks https://livebench.ai
By @perching_aix - 5 months
Is it possible for me, a human, to undertake these benchmarks?
By @Alifatisk - 5 months
This is a great marketing for Anthropic
By @meowface - 5 months
Takeaway:

>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Scores:

>GPT-4o: 9%

>o1-preview: 21%

>Claude 3.5 Sonnet: 21%

>MindsAI: 46% (current highest score)

By @bulbosaur123 - 5 months
Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?
By @devit - 5 months
Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?

It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.

The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).

It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.

Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).

And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.

For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.