LMArena Is a Cancer on AI

halbgut · 2026-01-06T17:14:23 1767719663

Like any LLM benchmark, LMArena is highly flawed. I do think it has a right to exist. For me anecdotally it has been indicative of which LLMs style I like best, not necessarily its factual accuracy. It hasn't however been a very useful tool to find the best LLM for a given job.

To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.

I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.

You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.