Gemini 3 Pro achieved 69% confidence in blind tests, compared to 16% for Gemini 2.5: a case for assessing AI based on real-world trust rather than academic testing

Just a few weeks ago, Google made its debut Gemini 3 model, claiming to have achieved leadership in many AI benchmarks. But tThe challenge with vendor-provided benchmarks is that they are simply provided by the vendors.

Novel supplier neutral rating from ProlificHowever, it puts Gemini 3 at the top of the leaderboard. This is not consistent with a set of academic standards; rather, it is based on a set of real attributes that real users and organizations care about.

Prolific was founded by scientists from the University of Oxford. The company provides high-quality, reliable human data that supports tough research and ethical development of artificial intelligence. The company “HUMAN reference point” applies this approach by using representative human trials and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not only technical performance but also user trust, adaptability, and communication style.

In Humane’s latest test, 26,000 users participated in a blind model test. In the evaluation, the Gemini 3 Pro trust score increased from 16% to 69%, the highest result in Prolific’s history. Gemini 3 now ranks first for trust, ethics and safety 69% of the time across all demographic subgroups compared to its predecessor Gemini 2.5 Pro which only ranked first 16% of the time.

Overall, Gemini 3 ranked first in three of the four evaluation categories: performance and reasoning, interaction and adaptability, and trust and safety. It only lost on communication style, where DeepSeek V3 won in preference with 43%. The Humane test also found that Gemini 3 performed consistently well across 22 different user demographics, accounting for differences in age, gender, ethnicity and political orientation. The evaluation also showed that users were five times more likely to choose a given model in direct blind comparisons.

But the ranking matters less than that Why won.

“It’s consistency across a very wide range of different use cases and a personality and style that appeals to a wide range of different types of users,” Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. “While in some specific cases either small subgroups or for specific types of conversations prefer different models, the breadth of knowledge and flexibility of the model across a range of different use cases and audience types allowed it to win this particular benchmark.”

How blind testing reveals what academic benchmarks lack

The HUMAINE methodology exposes gaps in the way the industry evaluates models. Users interact with two models simultaneously in multi-turn conversations. They don’t know which vendors power each response. They discuss whatever topics are important to them, not predetermined test questions.

This is the samples himself, that’s what matters. HUMAINE uses representative samples from the US and UK populations, controlling for age, gender, ethnicity and political orientation. This reveals something that static benchmarks can’t capture: model performance varies depending on your audience.

“If you take the AI leaderboard, most of them can still have a pretty static list,” Bradley said. “But in our case, if we include audience, we get a slightly different table of results whether we’re looking at a left-wing or right-wing sample, in the US and the UK. I think age was the most different condition in our experiment.”

This is important for companies implementing artificial intelligence among different employee populations. A model that works well for one demographic may perform less well for another.

The methodology also addresses a fundamental question in AI evaluation: Why use human judges at all when the AI can judge itself? Bradley noted that his company does use AI judges in some use cases, although he emphasized that human review is still a critical factor.

“We see that the greatest benefits come from bright orchestration of both LLM referee and human data. Both have strengths and weaknesses, which when combined wisely produce better results,” Bradley said. “But we still think human data is where the alpha is. We’re still extremely bullish that human data and human intelligence need to be on top of that.”

What does trust mean in AI assessment?

Trust, Ethics and Safety measure a user’s confidence in reliability, fact-based accuracy and responsible behavior. In the HUMAINE methodology, trust is not a vendor claim or technical metric – it is what users report after blinded interviews with competing models.

The 69% figure represents the probability across all demographic groups. This consistency matters more than aggregate performance because organizations may serve different populations.

“There was no awareness that Gemini was used in this scenario,” Bradley said. “This was based solely on a blinded multiturn reaction.”

This separates perceived trust from acquired trust. Users evaluated model results without knowing which vendor produced them, eliminating Google’s brand advantage. In customer-facing deployments where the AI provider remains unseen to end users, this distinction is crucial.

What companies should do now

One of the key things companies should do now as they consider different models is adopt an assessment framework that works.

“It is becoming more and more difficult to evaluate models based solely on vibration,” Bradley said. “I think we increasingly need a more rigorous, scientific approach to really understand how these models work.”

HUMAINE data provides a framework: Test consistency across operate cases and user demographics, not just peak performance for specific tasks. Blind testing to separate model quality from brand perception. Employ representative samples that match your actual user population. Plan for ongoing evaluation as models change.

For enterprises looking to implement AI at scale, this means moving from “which model is best” to “which model is best for our specific use case, user demographics, and required attributes.”

The rigor of representative sampling and blind testing provides the data to make this determination – something that technical benchmarks and vibration-based assessment cannot provide.

Categories

Gemini 3 Pro achieved 69% confidence in blind tests, compared to 16% for Gemini 2.5: a case for assessing AI based on real-world trust rather than academic testing

How blind testing reveals what academic benchmarks lack

What does trust mean in AI assessment?

What companies should do now

Bluesky CEO Jay Graber is stepping down

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions

Can Artificial Intelligence Kill the Venture Capitalist?

More News

Bluesky CEO Jay Graber is stepping down

Anthropic sues Department of Defense over supply chain risk labeling

Can Artificial Intelligence Kill the Venture Capitalist?

When AI companies go to war, security gets left behind

Bluesky CEO Jay Graber is stepping down

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling