FACTS Grounding: A fresh benchmark for assessing the facticity of immense language models

Responsibility and safety

Published: December 17, 2024
Author’s: FACTS team

Our comprehensive online benchmarking and ranking offer a much-needed measure of how thoroughly LLMs base their answers on the source material provided and avoid hallucinations

Enormous language models (LLMs) are changing the way we access information, but their factual accuracy remains imperfect. They may “hallucinate” false information, especially when given complicated inputs. This, in turn, can undermine confidence in LLMs and limit their real-world application.

Today we present FACTS Groundinga comprehensive benchmark for assessing the LLM’s ability to generate responses that are not only factually correct with respect to the input data provided, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will spur industry-wide progress on facts and fundamentals. To track progress, we also run FACTS Leaderboard on Kaggle. We have already tested leading LLMs using FACTS grounding and have populated the initial leaderboard with their grounding scores. We will maintain and update the leaderboard as the field develops.

Current leaderboard ranking

FACTS Grounding Dataset

To accurately assess the factuality and foundations of any LLM, the FACTS Grounding dataset contains 1,719 examples, each of which has been carefully crafted to require lengthy responses based on the context document provided. Each example includes a document, a system statement requiring LLM to refer only to the document provided, and an accompanying user request.

Example from the FACTS Grounding dataset

All examples are divided into a “public” set (860) and a “private” set (859). We are public kit release today so that anyone can operate it for LLM assessment. Of course, we know that you need to protect yourself from the problems of benchmark contamination and ranking hacking, which is why, in line with standard industry practice, we do not conduct private evaluation. FACTS rankings scores are the average scores across both public and private sets.

To ensure diversity in input, FACTS Grounding examples include documents of varying lengths, up to a maximum of 32,000 tokens (approximately 20,000 words), covering domains such as finance, technology, retail, medical, and law. User requests are similarly broad and include requests for summarization, generating questions and answers, and rewriting tasks. We did not include any examples that might require creativity, mathematics, or complicated reasoning – abilities that might require the model to operate more advanced reasoning in addition to grounding.

Rapid distribution

Collective assessment by leading LLMs

To be successful in a given example, LLM must synthesize the complicated information contained in the document and generate a long response that is both a comprehensive response to the user’s request and fully attributable to that document.

FACTS Grounding automatically evaluates model responses using three leading LLM judges – namely Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet. We chose a combination of different judges to mitigate the potential bias of a judge giving higher ratings to responses provided by a member of his or her own model family. The automated judge models were comprehensively evaluated against a set of tests to find the most effective prompt templates and verify consistency with human raters.

Each FACTS grounding example is assessed in two stages. First, responses are assessed for eligibility and disqualified if they do not match the user’s requests. Secondly, responses are rated as factual if they are fully based on the information contained in the document provided and do not contain hallucinations.

Because the eligibility and accuracy of a given LLM response are assessed separately by multiple AI judge models, the results are then aggregated to determine whether the LLM successfully handled the example. The final score for the overall grounding task is the average of the scores of all scoring models across all examples. Find more details about our FACTS grounding assessment methodology in our newspaper.

A fact-based response that does not properly respond to the user’s request fails the benchmark. Here we see three cases of model responses that the automated LLM judges found ineligible

FACTS Grounding will continue to evolve

We recognize that progress can quickly outpace standards, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Reality and grounding are among the key factors that will shape the future success and usability of LLM and wider AI systems, and our goal is to develop and refine FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to do so commit to grounding FACTSevaluate your models on an open set of examples or submit your models for evaluation. We believe that comprehensive benchmarking methods combined with continuous research and development will allow for further improvement of AI systems.

Thanks

FACTS Grounding was hosted by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu and Nate Keating.

We also greatly appreciate contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar , Rachana Fellinger, Rui Wang, Zizhao Zhang and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu and Yossi Matias for their continuous support.

Categories

FACTS Grounding: A fresh benchmark for assessing the facticity of immense language models

FACTS Grounding Dataset

Collective assessment by leading LLMs

FACTS Grounding will continue to evolve

Thanks

Torpedo bats and sweet physics

The study shows that AI chatbots raise patients’ involvement and reduce the burden of clinicians

M3gan 2.0 gives the doll some improvements in the novel trailer

The tablet factory in the box makes practical production education more available

Opeli and Anthropic are fighting for students with free artificial intelligence

More News

Assessment of potential threats of advanced cyber security AI

Following a responsible road to Aga

Gemini 2.5: Our most knowledgeable AI model

We present Gemma 3: The most talented model that you can start on one GPU or TPU

Torpedo bats and sweet physics

The study shows that AI chatbots raise patients’ involvement and reduce the burden of clinicians

M3gan 2.0 gives the doll some improvements in the novel trailer