What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.
I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.
We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.
e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).
We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.
There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.
When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.
If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.
$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.
You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.
1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.
2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.
(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)