Saturday, April 19, 2025

The finish line was caught ai benchmarks

Share

At the weekend, the finish dropped two current ones Llam 4 models: A smaller model called Scout and Maverick, a medium-sized model that the company can overcome GPT-4O and Gemini 2.0 Flash “in a wide range of commonly reported comparative tests”.

Maverick quickly secured the number two place for Lmrena, the Benchmark AI website, in which people compare the results from various systems and vote for the best. In finish press releaseThe company emphasized the Elo Mavericka result of 1417, which placed it above 4o Opennai and just under Gemini 2.5 Pro. (Higher ELO result means that the model wins more often in the arena when it heads to the head with competitors).

The achievement seemed that Llama 4 Meta Open 4 is a solemn claimant to the latest, closed models from OpenAI, Anthropic and Google. Then AI researchers browsing the meta documentation discovered something unusual.

In a miniature print, META admits that the Maverick version tested on Lmarena is not the same as available to the audience. According to his own materials, he implemented “Experimental version of the chat” Maverick to Lmarena, which has been specially “optimized in terms of conversation” TechCrunch First Reported.

“The interpretation of our META policy was not consistent with what we expect from model suppliers,” Lmrena Published On x two days after the model is released. “The metal should have explained that” Llama-4-Maverick-03-26-Experimental “was an adapted model for optimization for human preferences. As a result, we update our leaders’ plaque rules to strengthen our commitment to honest, repetitive assessments, so that this confusion does not occur in the future. “

Meta spokesman, Ashley Gabriel, told We -Mail that “we are experimenting with all the types of custom variants.”

“Llama-4-MAVERIC-03-26-EXPERIMENTAL” is a version optimized by the chat with which we experimented, which also works on lmarena, “said Gabriel.” We now released our version of Open Source and we will see how programmers adapt Llam 4 to their own employ. maneuverable. “

Although what the meta has done with Maverick is not clearly against the principles of Lmarena, the site shared fears About playing the system And he took steps to “prevent bandage and reference.” When companies can send specially tuned versions of their models for testing, while releasing different versions for the audience, comparative rankings such as Lmarena, become less significant as real performance indicators.

“This is the most frequently respected general reference point, because all the other sucks,” says the independent researcher Ai Simon Willison The Verge. “When Llam 4 appeared, the fact that he took second place in the arena, just after Gemini 2.5 Pro – it really impressed me, and I would copy for the fact that he did not read a miniature print.”

Shortly after the release of Meta Maverick and Scout, the AI ​​community began Speaking of rumor This meta also trained its LAMA 4 models to better achieve comparative points when hiding their real restrictions. The Vice President of the Generative AI in Meta, Ahmad Al-Dahle, took care of the accusations In the post on x: “We also heard the claims that we have trained test sets – it’s just not true and we would never do it. The best understanding is that people see variable quality is caused by the fact that you need to stabilize implementation.”

“This is generally a very misleading edition.”

Some He also noticed This lama 4 was released in a strange time. Saturday does not tend to fall when great AI news drops. After someone in Threads asked why Llam 4 was released on the weekend, MAO CEO Mark Zuckerberg replied: “That was then ready.”

“It is generally a very misleading edition,” says Willison, who Strictly follows and documents AI models. “The model result we got there is completely worthless for me. I can’t even employ the model on which they received a high result.”

The meta path to release Lama 4 was not exactly silky. According to until the last report With InformationThe company has repeatedly removed the premiere because of it did not meet internal expectations. These expectations are particularly high after Deepseek, the initial start-up from AI from China, released an open weight model that caused a lot of noise.

Ultimately, using the optimized model in Lmarena puts programmers in a difficult position. When choosing models such as Llama 4 for their applications, they naturally look for clues in the field of comparative tests. But as in the case of Maverick, these tests can reflect the possibilities that are actually not available in models to which society can get.

As accelerated by the development of AI, this episode shows how the references become battlefields. It also shows how the meta is eagerly seen as the AI ​​leader, even if it means system play.

Update, April 7: The story has been updated to add a meta statement.

Latest Posts

More News