Well o3 scored 75% on AGI-1, R1 and o1 only 25%.... watch this space though....

levocardia · 2025-01-29T20:59:59 1738184399

What's interesting is that you can already see the "AI race" dynamics in play -- OpenAI must be under immense market pressure to push o3 out to the public to reclaim "king of the hill" status.

spoaceman7777 · 2025-01-29T21:45:46 1738187146

I suppose they're under some pressure to release o3-mini, since r1 is roughly a peer for that, but r1 itself is still quite rough. The o1 series had seen significantly more QA time to smooth out the rough edges, and idiosyncracies what a "production" model should be optimized for, vs. just a top scorer on benchmarks.

We'll likely only see o3 once there is a true polished peer for it. It's a race, and companies are keeping their best models close to their chest, as they're used internally to train smaller models.

e.g., Claude 3.5 Opus has been around for quite a while, but it's unreleased. Instead, it was just used to refine Claude Sonnet 3.5 into Claude Sonnet 3.6 (3.6 is for lack of a better name, since it's still called 3.5).

We also might see a new GPT-4o refresh trained up using GPT-o3 via deepseek's distillation technique and other tricks.

There are a lot of new directions to go in now for OpenAI, but unfortunately, we won't likely see them until their API dominance comes under threat.

danenania · 2025-01-30T06:24:11 1738218251

That could also definitely make sense if the SOTA models are too slow and expensive to be popular with a general audience.

amelius · 2025-01-29T23:28:18 1738193298

Yeah, but they can use DeepSeek's new algorithm too.

mohsen1 · 2025-01-29T18:15:03 1738174503

with 57 million(!!) tokens

sheepdestroyer · 2025-01-29T18:40:08 1738176008

From the article :

o3 (low) 75.7% 335K $20

o3 (high) 87.5% 57M $3.4K

mrandish · 2025-01-29T20:16:16 1738181776

When I saw these numbers back in the initial o3-ARC post, I immediately converted them into "$ per ARC-AGI-1 %" and concluded we may be at a point where each increased increment of 'real human-like novel reasoning' gets exponentially more compute costly.

If Mike Knoop is correct, maybe R1 is pointing the way toward more efficient approaches. That would certainly be a good thing. This whole DeepSeek release and the reactions have shown by limiting the export to China of high-end GPUs, the US incentivized China to figure out how to make low-end GPUs work really well. The more subtle meta-lesson here is that the massive flood of investment capital being shoved toward leading edge AI companies has fostered a drag race mentality which prioritized winning top-line performance far above efficiency, costs, etc.

jl6 · 2025-01-29T19:36:38 1738179398

$3.4K is about what you might pay a magic circle lawyer for an opinion on a matter. Not saying o3 is an efficient use of resources, just saying that it’s not outlandish that a sufficiently good AI could be worth that kind of money.

victorbjorklund · 2025-01-29T21:09:22 1738184962

You pay that price to a law firm to get good service and to get a "guarantee" of correctness. You get neither from an LLM. Not saying it is not worth anything but you cant compare it to a top law firm.

nl · 2025-01-30T01:57:55 1738202275

You absolutely do not get a "guarantee" of correctness (event with the airquotes) from any lawyer.

manquer · 2025-01-30T11:38:00 1738237080

You can sue a lawyer giving certain kinds of bad advice and occasionally win . That is what the guarantee is about

bobxmax · 2025-01-30T13:29:14 1738243754

You can probably sue Open AI for getting bad legal advice from ChatGPT too.

throw-qqqqq · 2025-01-30T17:48:59 1738259339

Sure, but can you also win the case ;)?

On the bottom of ChatGPT.com I see a disclaimer: “ChatGPT can make mistakes. Check important info”.

I don’t think you can succesfully sue with such caveat emptor.

ant6n · 2025-01-29T20:31:54 1738182714

What’s the liability insurance of the AI like

baq · 2025-01-29T22:21:47 1738189307

Refer to IBM’s 1979 slide for details on that

Davidzheng · 2025-01-29T21:21:41 1738185701

I view it as a positive that the methodology can take in more compute (bitter lesson style)

optimalsolver · 2025-01-29T18:42:54 1738176174

But can o3 write a symphony?

Seriously though, I'd like to hear suggestions on how to automatically evaluate an AI model's creativity, no humans in the loop.

gsam · 2025-01-29T21:31:43 1738186303

In my view there's two modes of creativity:

1. That two distant topics or ideas are actually much more closely related. The creative sees one example of an idea and applies it to a discipline that nobody expects. In theory, reduction of the maximally distant can probably be measured with a tangible metric.

2. Discovery of ideas that are even more maximally distant. Pushing the edge, and this can be done by pure search and randomness actually. But it's no good if it's garbage. The trick is, what is garbage? That is very context dependent.

(Also, a creative might be measured on the efficiency of these metrics rather than absolute output)

docfort · 2025-01-30T12:20:09 1738239609

Terry Tao has referred to this classification system as foxes vs hedgehogs. https://en.m.wikipedia.org/wiki/The_Hedgehog_and_the_Fox

fragmede · 2025-01-29T18:59:31 1738177171

we'd have to create a numerical scale for creativity, from boring to Dali, with milliEschers and MegaGeigers somewhere in there as well

rpastuszak · 2025-01-29T19:23:35 1738178615

It's essential that we quantify everything so that we can put a price on it. I'd go with Kahlograms though.

johnfn · 2025-01-29T19:35:38 1738179338

Have you tried suno.ai?

Vampiero · 2025-01-30T00:05:48 1738195548

Have _you_? It lost its novelty after a couple of days.

drusepth · 2025-01-30T22:53:41 1738277621

I probably listen to Suno (both my own songs, and songs other people have created) about as often as I listen to Spotify, these days.

baq · 2025-01-29T20:44:56 1738183496

LLMs have read everything humans made so just ask one if there’s anything truly new in that freshly confabulated slop-phony.