New study accuses LM Arena of gaming its popular AI benchmark

The popular AI vibe test may not be as fair as it seems.

May 1, 2025 - 23:01

New study accuses LM Arena of gaming its popular AI benchmark

The rapid proliferation of AI chatbots has made it difficult to know which models are actually improving and which are falling behind. Traditional academic benchmarks only tell you so much, which has led many to lean on vibes-based analysis from LM Arena. However, a new study claims this popular AI ranking platform is rife with unfair practices, favoring large companies that just so happen to rank near the top of the index. The site's operators, however, say the study draws the wrong conclusions.

LM Arena was created in 2023 as a research project at the University of California, Berkeley. The pitch is simple—users feed a prompt into two unidentified AI models in the "Chatbot Arena" and evaluate the outputs to vote on the one they like more. This data is aggregated in the LM Arena leaderboard that shows which models people like the most, which can help track improvements in AI models.

Companies are paying more attention to this ranking as the AI market heats up. Google noted when it released Gemini 2.5 Pro that the model debuted at the top of the LM Arena leaderboard, where it remains to this day. Meanwhile, DeepSeek's strong performance in the Chatbot Arena earlier this year helped to catapult it to the upper echelons of the LLM race.

Read full article

Comments