Earlier this week, Meta found itself in the midst of controversy after using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on the LM Arena benchmark. This incident prompted LM Arena's maintainers to issue an apology, revise their policies, and evaluate the unmodified, publicly available version of Maverick. The results were less than impressive. The unmodified Maverick, officially named “Llama-4-Maverick-17B-128E-Instruct,” ranked below models such as OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro, some of which are several months old. This raised questions about the true capabilities of Meta's publicly available model.https://x.com/pigeon__s/status/1910705956486336586 So, why the stark difference in performance? Meta explained that the experimental version, “Llama-4-Maverick-03-26-Experimental,” was specifically “optimized for conversationality.” This optimization strategy resonated well with LM Arena's evaluation method, which relies on human raters comparing model outputs and selecting their preferred responses. The focus on conversational appeal, however, didn't translate to overall performance in broader contexts. It's worth noting that LM Arena has faced scrutiny regarding its reliability as a comprehensive measure of AI model performance. Tailoring a model to excel on a specific benchmark, while potentially boosting its score, can be misleading. It makes it difficult for developers to accurately predict how the model will perform in diverse real-world applications. This incident highlights the challenges of creating fair and representative benchmarks for AI models. In response to the controversy, a Meta spokesperson stated that the company experiments with “all types of custom variants.” They clarified that “‘Llama-4-Maverick-03-26-Experimental’ is a chat optimized version we experimented with that also performs well on LMArena.” Meta emphasized that they have now released the open-source version and are eager to see how developers customize Llama 4 for their unique use cases and provide feedback. This incident underscores the importance of transparency and clear communication in the AI community, especially when it comes to benchmarking and model evaluation.