Hacker News

smokel · 2026-02-05T19:25:16 1770319516

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

blibble · 2026-02-05T20:50:46 1770324646

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against

quinnjh · 2026-02-05T19:26:50 1770319610

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

techpression · 2026-02-05T20:50:10 1770324610

A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.