Definitely one of the weaker areas in the current LLM boom. Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.
Even professional human evaluators are quite vulnerable to sycophancy and overconfident-and-wrong answers. And LMArena evaluators aren't professionals.
A lot of the sycophancy mess that seeps from this generation of LLM stems from reckless tuning based on human feedback. Tuning for good LMArena performance has similar effects - and not at all by a coincidence.
It's biased to small context performance, which is why I don't pay much attention to it as a developer aside from a quick glance. I need performance at 40-100k tokens which models like Deepseek can't deliver but Gemini 2.5 Pro and ChatGPT 5.0 Thinking can.
And even "long term performance" splits itself into "performance on multi-turn instruction following" and "performance on agentic tasks" down the line. And "performance on agentic tasks" is a hydra in itself.
Capturing LLM performance with a single metric is a hopeless task. But even a single flawed metric beats no metrics at all.
This is something I've stuggled with for my site, I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for each model with different temperature values available as a toggles.
My thinking was to just make the responses available to users and let them see how models perform. But from some feedback, turns out users don't want to have to evaluate the answers and would rather see a leaderboard and rankings.
The scalable solution to that would be LLM as judge that some benchmarks already use, but that just feels wrong to me.
LM Arena tries to solve this with the crowd sourced solution, but I think the right method would have to be domain expert human reviewers, so like Wirecutter VS IMDb, but that is expensive to pull off.
>when we get a prompt working reliably on one model, we often have trouble porting it to another LLM
I saw a study where a prompt massively boosted one model's performance on a task, but significantly reduced another popular model's performance on the same task.
> Comparing models, or even different versions of the same model, is a pseudo-scientific mess.
Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them.
So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.
I'm still using https://lmarena.ai/leaderboard. Perhaps there is something better and someone will pipe up to tell me about it. But we use LLMs at work and have unexplainable variations between them.
And when we get a prompt working reliably on one model, we often have trouble porting it to another LLM - even straight "version upgrades" such as from GPT-4 to -5. Your prompt and your model become highly coupled quite easily.
I dunno what to do about it and am tending to just pick Gemini as a result.