I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.
the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.
Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?
A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot.
Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.