Monday, March 9, 2026

The 70% Facts Ceiling: Why Google’s Up-to-date ‘FACTS’ Benchmark Is a Wake-Up Call for Enterprise AI

Share

There is no shortage of generative AI benchmarks designed to measure a given model’s performance and accuracy while performing a variety of useful enterprise tasks – from coding Down instructions below Down agent-based web browsing AND tool use. However, many of these metrics have one fundamental flaw: they measure AI’s ability to solve specific problems and requests, not how actual the model is in the results – how well it produces objectively correct information linked to real-world data – especially for information contained in images or graphics.

In industries where accuracy is paramount – legal, financial and medical – there is no standardized way of measurement factuality was a critical blind spot.

That’s changing today: Google’s FACTS team and its Kaggle data analytics unit released the FACTS Benchmark Suite, a comprehensive assessment framework designed to close this gap.

Related research article reveals a more nuanced definition of the problem, dividing “facticity” into two distinct operational scenarios: “contextual facticity” (grounding the answer on provided data) and “world knowledge factuality” (retrieving information from memory or the network).

While the headline news is Gemini 3 Pro’s top-tier positioning, the deeper story for designers is the industry-wide “wall of facts.”

According to preliminary results, no model – including Gemini 3 Pro, GPT-5 or Claude 4.5 Opus – was able to achieve an accuracy score of 70% across the entire problem set. For tech leaders, this is a signal: the era of “trust but verify” is not over yet.

Deconstructing the benchmark

The FACTS suite goes beyond straightforward questions and answers. It consists of four separate tests, each simulating a different real-world failure mode that developers encounter in production:

  1. Parametric benchmark (internal knowledge): Can the model answer trivia-style questions accurately using only training data?

  2. Search benchmark (tool usage): Can the model effectively employ a web search tool to retrieve and synthesize current information?

  3. Multimodal benchmark (vision): Can the model accurately interpret charts, graphs and images without hallucinations?

  4. Ground Benchmark v2 (context): Can the model stick closely to the source text provided?

Google has made 3,513 examples publicly available, while Kaggle keeps a private set to prevent developers from training on test data – a common problem known as “pollution”.

Leaderboard: All game

The initial benchmark run puts Gemini 3 Pro in the lead with a Comprehensive FACTS Score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and GPT-5 OpenAI (61.8%). However, a closer look at the data reveals where the real battlegrounds are for engineering teams.

Model

FACTS Score (average)

Search (RAG capability)

Multimodal (vision)

Gemini 3 Pro

68.8

83.8

46.1

Gemini 2.5 Pro

62.1

63.9

46.9

GPT-5

61.8

77.7

44.1

Grok 4

53.6

75.3

25.7

Close 4.5 Work

51.3

73.2

39.2

Data comes from FACTS Team Release Notes.

For designers: the “search” vs. “parametric” gap.

For developers creating RAG (Recovery Augmented Generation) systems, the most significant metric is the search benchmark.

The data shows a huge discrepancy between a model’s ability to “know” things (parametric) and its ability to “find” things (search). For example, Gemini 3 Pro achieves a high score of 83.8% in search tasks, but only 76.4% in parametric tasks.

This confirms the current enterprise architecture standard: do not rely on internal model memory for critical facts.

If you’re building an internal knowledge bot, FACTS results suggest that connecting your model to a search tool or vector database is not optional – it’s the only way to augment accuracy to an acceptable production level.

Multimodality warning

The most concerning data point for product managers is multimodal task performance. Scores here are generally low. Even the category leader, the Gemini 2.5 Pro, only achieved 46.9% accuracy.

Comparison tasks included reading graphs, interpreting diagrams, and identifying objects in nature. With an overall accuracy of less than 50%, this suggests that multimodal AI is not yet ready for unsupervised data extraction.

Conclusion: If your product roadmap includes AI automatically pulling invoice data or interpreting financial charts without human review, you are probably introducing a significant error rate to your pipeline.

Why this matters for your stack

The FACTS benchmark is likely to become the standard benchmark in public procurement. When evaluating models for enterprise employ, tech leaders should look beyond the composite score and dive into the specific benchmark that fits their employ case:

  • Are you creating a customer service bot? Check the Grounding score to make sure the bot is sticking to your policy documents. (The Gemini 2.5 Pro actually outperformed the Gemini 3 Pro here, 74.2 vs. 69.0).

  • Building a research assistant? Prioritize your search results.

  • Are you creating an image analysis tool? Proceed with extreme caution.

As the FACTS team noted in their release, “All models evaluated achieved overall accuracy below 70%, leaving significant margins for future progress.” For now, the message to the industry is clear: models are getting smarter, but they are not yet infallible. Design your systems with the understanding that the raw model may be wrong about a third of the time.

Latest Posts

More News