Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"The o3 system demonstrates the first practical, general implementation of a computer adapting to novel unseen problems"

Yet, they said when it was announced:

"OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."

These two statements are completely opposed. I can't take seriously anything this article says about o3.



No they aren't. Every arc problem is novel - that's why it resisted deep learning for so long (and still does to a degree).

We just don't know how much the model seeing what an arc problem is on the first place boosts its ability to solve them - that limited statement is all the author is making.


The ARC prize was created last year. Arc hasn't resisted AI for very long.

See; https://en.wikipedia.org/wiki/Fran%C3%A7ois_Chollet


Huh? It was made in 2019


OK,

I was following Wikipedia: In 2024, Chollet launched ARC Prize, a US$1 million competition to solve the ARC-AGI benchmark. I guess the ARC benchmark appeared in 2019 (https://arcprize.org/). So (shrug)


Your quote is accurate from here:

https://arcprize.org/blog/oai-o3-pub-breakthrough

They were talking about training on the public dataset -- OpenAI tuned the o3 model with 75% of the public dataset. There was some idea/hope that these LLMs would be able to gain enough knowledge in the latent space that they would automatically do well on the ARC-AGI problems. But using 75% of the public training set for tuning puts them at the about same challenge level as all other competitors (who use 100% of training).

In the post they were saying they didn't have a chance to test the o3 model's performance on ARC-AGI "out of-the-box", which is how the 14% scoring R1-zero was tested (no SFT, no search). They have been testing the LLMs out of the box like this to see if they are "smart" wrt the problem set by default.


Glad someone brought this up.

I'm personally fine with o3 being tuned on the train set as a way to teach models "the rules of the game", what annoys me is that this wasn't also done with the o1 models or r1. It's a misleading comparison that suggests that o3 is a huge improvement over o1 when in reality much of that improvement may have simply been that one model knew which game it was playing and the others didn't.


They are testing with a different dataset. The authors saying that they have not tested on the version of o3 that has not seen the training set.


Yeah...the whole point is that you're testing the model on something it hasn't seen already. If the problems were in the training set by definition the model has seen them before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: