Amazon assumes AI benchmarks don't matter

This is a fragment Sources: Alex Heatha newsletter about artificial intelligence and the tech industry, distributed once a week only to The Verge subscribers.

Amazon’s head of artificial intelligence has a message for model freaks: stop looking at rankings.

“I want real-world usability. None of these benchmarks are real,” Rohit Prasad, vice president of AGI at Amazon, told me ahead of today’s announcements at the AWS re:Invent conference in Las Vegas. “The only way to run a true benchmark is if everyone follows the same training data and the ratings are completely maintained. That’s not what’s happening. Frankly, the ratings are getting more and more noisy and don’t show the true power of these models.”

It’s a contradictory stance when every other AI lab is quick to brag about how their modern models are rapidly climbing the rankings. It’s also convenient for Amazon, considering the previous version of Nova, its flagship model, was in 79th place on the LMArena when we spoke to Prasad last week. Still, rejecting benchmarks only works if Amazon can tell a different story about progress.

“They don’t show the true power of these models.”

At the center of today’s re:Invent announcements is Nova Forge, a service that Amazon says enables companies to train custom AI models in ways previously impossible without spending billions of dollars. The problem Forge is addressing is real. Most companies trying to fine-tune AI models face three bad options: fine-tune a closed model (but only at the edges), train on open-weighted models (but without the original training data and risk capability regression, where the AI becomes an expert on modern data but forgets about the original, broader skills), or build the model from scratch at enormous cost.

Forge offers something else: access to Amazon’s Nova model checkpoints during pre-, mid- and post-workout stages. Companies can introduce their proprietary data at the beginning of the process, when the model’s “learning capacity” is highest, as Prasad puts it, rather than simply improving the model’s behavior at the end.

“What we have done is democratize AI and develop frontier models for your use cases at a fraction of what it would cost [before]” said Prasad. Forge was created because Amazon’s internal teams needed a tool that could transfer their domain expertise to a core model without having to build from scratch.

“We created Forge because our internal teams wanted Forge,” he said. This is a familiar Amazon pattern. AWS itself became famed for creating the infrastructure built for Amazon’s own retail operations before becoming the company’s profit engine.

Reddit has been using Forge to build custom security models based on community-moderated data for 23 years. “I haven’t seen anything like this before,” Chris Slowe, Reddit’s chief technology officer and first employee, told me. “We had an outstanding engineer who acted like a kid in a candy store.”

Slowe said Reddit continued its pre-training work last week, which “looks really promising.” The goal: to replace multiple bespoke security models with a single model from a Reddit expert who understands the nuances of community moderation, including the notoriously subjective rule that appears everywhere on subreddits: “Don’t be a jerk.”

“Having an expert model will help you understand the community,” Slowe said. “He’ll have a pretty good idea of what a moron means.”

This is the thread Amazon wants developers to tap into: not raw IQ points, but control and specialization.

He explained that Forge allows Reddit to control its models, avoid surprises from API changes, maintain ownership of its weights, and avoid sending sensitive data to third-party model providers. He said Reddit is already exploring using the same approach for Reddit Answers and other products.

When I asked Slowe if it matters that the Nova isn’t a top-shelf model in the benchmarks, he replied bluntly: “What matters in this context is the expertise of the model on Reddit.” This is the thread Amazon wants developers to tap into: not raw IQ points, but control and specialization.

In the case of Forge, Amazon is making a calculated bet that the model race has become commoditized and that it can succeed as a place where companies can build specialized AI for specific business problems. It’s an AWS-style view of the world: infrastructure over intelligence and customization over raw capabilities. This strategy also allows Amazon to avoid direct comparisons with OpenAI and Anthropic, which it once hoped to compete with at the model level.

Whether Forge is truly pioneering or just cleverly positioned depends, of course, on developer adoption. Amazon claims that the race of the models is not commonly understood. If this turns out to be true, the scoreboard will turn into something much quieter and harder to play: whether AI models actually provide real-world utility.

Follow topics and authors from this story to see more events like this in your personalized homepage feed and receive email updates.

Alex Heath

Categories

Amazon assumes AI benchmarks don’t matter

Bluesky CEO Jay Graber is stepping down

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions

Can Artificial Intelligence Kill the Venture Capitalist?

More News

What’s going on with Alexa+?

The winter storm tested power grids that are strained to accommodate AI data centers

Google DeepMind employees ask leaders to ensure their “physical safety” from ICE

Google Photos now lets you describe how to turn images into videos

Bluesky CEO Jay Graber is stepping down

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling