But is arc-agi really that useful though? Nowadays it seems to me that it's just another benchmark that needs to be specifically trained for. Maybe the Chinese models just didn't focus on it as much.
Is it though? Do we still have the expectation that LLMs will eventually be able to solve problems they haven't seen before? Or do we just want the most accurate auto complete at the cheapest price at this point?
It indicates that there's a good chance that they have trained on the test set, making the eval scores useless. Even if you have given up on the dream of generalization entirely, you can't meaningfully compare models which have trained on test to those which have not.
That is ture, but the revenue of the artisanal stuff is probably only a very low percentage of the overall market, which would imply a lot of software engineers would have to exit the field. Which is what we here don't want to see.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
Now explain why it wouldn’t also be fair to kick people off that were loudly emitting disgusting flatulence. Is it because they “might” not have control over it? Can I not claim I also “might” not have the control over my impulsive desire to listen to music or that I can’t use headphones for a medical issue?
I mean such a thing I would say equally detracts from the flying experience, so why not also kick those people off?
Edit: not sure why I’m getting downvoted, this is a legitimate question. I genuinely want to hear the justification.
You'd have a more convincing argument if you argued for a passenger with Tourette's or something. Bodily functions are obviously different from watching a movie at full volume, because there's never a situation where you would be involuntarily blasting the audio of your show or whatever to the whole plane.
Okay, Tourette’s then. Should we kick people off for Tourette’s?
Your comment also presupposes two things: that flatulence is always involuntary and blasting music isn’t. Let’s say I have a form of Tourette’s that forces me to involuntarily blast noise and music and I have medical papers to prove it. Is it okay then?
I would absolutely support it if you could demonstrate that those two things are actually true. My point is: Who gets to decide what’s legitimately an involuntary medical issue and what isn’t, and where is the line that demarcates it? And what is the point of this exercise? It’s to prevent people from forcing everyone else to have a worse experience for their own personal gain, which flatulence is a form of that you could argue, so why is blasting music fundamentally different?
Yes. Because I'm asking the question who decides what is involuntary or not. Who is it? It seems like there is a presupposition here, but who is defining that?
Coming back to the Tourette's example: let's say someone starts shouting cuss words and loudly annoying everyone else "involuntarily". Do they get kicked off the plane? Why or why not? Who decides that? Does the person have to present medical evidence that they have Tourette's to not get kicked off the plane? If so, can they also present medical evidence of a condition that causes them to spontaneously press play on their mobile devices with no headphones and would that be accepted?
I'm obviously not defending the behavior of the loud-music-on-plane-players, or advocating that everyone needs to smell everyone's farts. I'm pointing out that this is something that is arbitrary and weaponizable.
You don't understand that a phone isn't a part of the human body? Seriously? We as a society can't even come to agreement on that basic fact anymore?
If someone shoots a gun in a crowd is that too an involuntary bodily function? Is the gun not just part of their body? Are you confused by that as well? Where do we draw the limits on what is the human body? Who decides that? If I lay on the ground does the whole earth become my body?
FUN FACT: Aviation rules require that any plane carrying a parachute must have at least one for every person on board. Hopefully the reason is obvious.
Now given that, do you really want to pay the extra cost of flying with 300 parachutes just so mr-full-volume-phone can have one?
That is an incredibly fun fact. Does this only apply to commercial or also a little Cessna? Presumably there is no actual enforcement on the private planes.
I made it too fun: what I said was at best an over-genarlization. The actual rules [1] apply to acrobatics and say that parachutes are required for everyone when non-crew passenger is on the plane:
Unless each occupant of the aircraft is wearing an approved parachute, no pilot of a civil aircraft carrying any person (other than a crewmember) may execute any intentional [acrobatic] maneuver...
So without the passenger no one needs a parachute, with them everyone does.
It's perfectly legal for a 787 to carry a few parachutes just for the full-volume passengers.
Nah, with how ticketing is these days they'll bug you a dozen times to choose between the $50 basic economy disaster package that only has the mask and 50% airflow or the full package for $100 that includes another 25% airflow and a flotation device. Business execute gets you the parachute, a private life raft, and a few days of MREs for $250.
Either you missed the joke or I missed your sarcasm. I read GP as a joke: being literally kicked out of a flight in air is a death sentence, which is a bit harsh penalty indeed.
There's a fine line between making ppl civilized and fascism-like level of control. And I believe Japan errs on the other side too much with their ridiculous number of such rules in all areas of life. Even though I recently visited Japan, I can't really speak to how happy they are, but the stereotype is that they are not the happiest ppl out there. I believe their obedience to all such societal rules has a role in it.
Not true. Geekbench, especially single threaded benchmark, is probably the best we got, it has a bunch of workloads, unlike many other benchmarks like cinebench for example. And they publish all the results on their website, so you can dig into each individual workload and find the ones that apply to you.
And like the other poster mentioned, it correlates well with SPEC, so it's basically a easily accessible SPEC. These days the only benchmark I use to quickly judge some CPU is geekbench.
May I suggest the one I use (I wrote it), which also correlates well with SPEC & Geekbench 5, but also runs the benchmarks on all cores if you want to so you get both max single-thread and max multi-thread: https://github.com/dkechag/dkbench-docker . You basically run 'docker run -it --rm dkechag/dkbench'.
I took a look, it's not bad but it seems to contain too many micro benchmarks like regex or primes. Geekbench at least has clang which is a subscore that I always look at.
The primes one is my least favourite one indeed, I left it in just because I happened to include it in the very first version and I am thinking it just counts for 5% in the end...
The regex ones are "micro" yet quite important, dkbench it's a Perl (and C)-based benchmark (reflects our main code), and the regex engine is the most highly optimized part of the language so regex speed is a good representation of text processing speed in Perl.
As I said, the overall score correlates well to SPEC/Geekbench so as a suite it works well.
For compiler comparisons I usually compile a language like python or perl as a test, but I did not want to add something like that, to keep it fast with many smaller benchmarks.
reply