Definitely one of the weaker areas in the current LLM boom. Comparing models, or...

ACCount37 · 2025-11-08T14:59:42 1762613982

Ratings on LMArena are too easily gamed.

Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.

A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.

energy123 · 2025-11-08T16:34:39 1762619679

It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.

ACCount37 · 2025-11-08T17:17:31 1762622251

And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.

Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.

botro · 2025-11-08T18:56:06 1762628166

This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.

My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.

The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.

LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.

andai · 2025-11-09T16:44:45 1762706685

>when we get a prompt working reliably on one model, we often have trouble porting it to another LLM

I saw a study where a prompt massively boosted one model's performance on a task, but significantly reduced another popular model's performance on the same task.

boccaff · 2025-11-09T18:21:06 1762712466

Do you have any pointer to search for that?

diamond559 · 2025-11-08T16:44:48 1762620288

I'd rather quit then be forced to beta test idiocracy. What's your company so we can all avoid it?

HPsquared · 2025-11-08T15:16:02 1762614962

Psychometric testing of humans has a lot of difficulties, too. It's hard to measure some things.

3abiton · 2025-11-09T13:34:05 1762695245

> Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them. So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.