Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well o3 scored 75% on AGI-1, R1 and o1 only 25%.... watch this space though....


What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.


I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.

We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.

e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).

We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.

There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.


That could also definitely make sense if the SOTA models are too slow and expensive to be popular with a general audience.


Yeah, but they can use DeepSeek's new algorithm too.


with 57 million(!!) tokens


From the article :

o3 (low) 75.7% 335K $20

o3 (high) 87.5% 57M $3.4K


When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.

If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.


$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.


You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.


You absolutely do not get a "guarantee" of correctness (event with the airquotes) from any lawyer.


You can sue a lawyer giving certain kinds of bad advice and occasionally win . That is what the guarantee is about


You can probably sue Open AI for getting bad legal advice from ChatGPT too.


Sure, but can you also win the case ;)?

On the bottom of ChatGPT.com I see a disclaimer: “ChatGPT can make mistakes. Check important info”.

I don’t think you can succesfully sue with such caveat emptor.


What’s the liability insurance of the AI like


Refer to IBM’s 1979 slide for details on that


I view it as a positive that the methodology can take in more compute (bitter lesson style)


But can o3 write a symphony?

Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.


In my view there's two modes of creativity:

1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.

2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.

(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)


Terry Tao has referred to this classification system as foxes vs hedgehogs. https://en.m.wikipedia.org/wiki/The_Hedgehog_and_the_Fox


we'd have to create a numerical scale for creativity, from boring to Dali, with milliEschers and MegaGeigers somewhere in there as well


It's essential that we quantify everything so that we can put a price on it. I'd go with Kahlograms though.


Have you tried suno.ai?


Have _you_? It lost its novelty after a couple of days.


I probably listen to Suno (both my own songs, and songs other people have created) about as often as I listen to Spotify, these days.


LLMs have read everything humans made so just ask one if there’s anything truly new in that freshly confabulated slop-phony.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: